Introduction
TL;DR:
- AI Sales Forecasting often fails due to data semantics (schemas, time meaning, leakage), not model choice.
- Model your sources as sales + calendar + price + promo + inventory/stockouts, then build a stable training/inference view.
- Enforce point-in-time correctness for time-series feature joins to prevent leakage.
- Treat stockouts as censored demand and track them explicitly.
In this Part 2, you’ll get a practical data model and validation rules you can lift into a warehouse/lakehouse.
Why it matters: Forecasting systems are “data products.” If the data contract breaks, no model can rescue production.
1) Data contract goals
Lock four things:
- Target
ydefinition (units vs revenue vs net sales) - Time granularity + timezone + cutoff/close process
- Feature availability at prediction time (no leakage)
- Separation of zero sales vs unobserved/censored (stockouts, store closed)
Why it matters: Leakage is usually introduced by joins, and it can look great offline while failing in production.
2) Canonical training / inference view (recommended)
Retail benchmarks commonly separate sales context (calendar, prices), which maps cleanly to production modeling.
Recommended columns: ds, series_id, y, is_open, is_listed, stockout_flag, price, promo_flag, event_name.
Rule of thumb: y=0 should mean “sellable but not sold.” If is_open=0 or is_listed=0, treat as missing/excluded.
Why it matters: Once this view is stable, you can swap platforms/models without rebuilding everything.
3) Source tables blueprint
| Domain | Table | Key | Notes |
|---|---|---|---|
| Target | fact_sales | ds, sku_id, store_id | observed sales |
| Calendar | dim_calendar | ds | holidays/events |
| Price | fact_price | sku_id, store_id, effective_from~to | effective-dated |
| Promo | fact_promo_plan | sku_id, store_id, start~end | planned vs actual split |
| Inventory | fact_inventory_snapshot | as_of_ts, sku_id, store_id | derive stockouts |
| Static | dim_product, dim_store | ids | categories/regions |
Why it matters: Separating these domains lets you manage time semantics and “future-known” features correctly.
4) Time semantics + leakage prevention
Use both:
event_time: when it actually happenedas_of_time: when your system knew it
Then enforce point-in-time joins so features reflect what was available at the label time.
For lag/rolling features, generate them with an explicit cutoff (don’t backfill from future data). Azure documents lag/rolling feature concepts for forecasting.
Why it matters: Most time-series failures come from join semantics, not model architecture.
5) Stockouts and censored demand
Stockouts can censor true demand and introduce systematic underestimation.
Minimum practice: create stockout_flag, and exclude or run a demand-recovery step.
Why it matters: If you treat stockouts as “zero demand,” you train the model to under-order.
6) Data quality rules as executable tests
Great Expectations formalizes validation rules as Expectation Suites, and can publish results as Data Docs.
Core validations:
- Uniqueness:
(ds, series_id) - Ranges:
y>=0,price>0,promo_start<=promo_end - Semantics:
stockout_flagpresent; closed/unlisted days excluded
Why it matters: Automated data tests shorten incident response from hours to minutes.
Conclusion
- Model your data as sales + calendar + price + promo + inventory/stockouts, then expose a stable canonical view.
- Enforce point-in-time correctness to prevent leakage.
- Track stockouts explicitly because they create censored demand.
- Make quality rules executable via validation suites.
Summary
- Stable schema beats unstable modeling
- Point-in-time joins prevent leakage
- Stockouts are censorship, not “zero demand”
- Quality rules must be automated
Recommended Hashtags
#ai-sales-forecasting #demand-forecasting #time-series #data-modeling #point-in-time #data-quality #retail-analytics #mlops
References
- (M5 Forecasting - Accuracy (Data), Accessed 2026-02-08)[https://www.kaggle.com/c/m5-forecasting-accuracy/data]
- (M5 dataset overview (prices, promotions, holidays), Accessed 2026-02-08)[https://colab.research.google.com/github/ikyath/M5-Forecasting-Accuracy-Kaggle/blob/master/M5_Forecast_Encoder_Decoder_Final.ipynb/]
- (Point-in-time feature joins - Databricks Docs, Accessed 2026-02-08)[https://docs.databricks.com/aws/en/machine-learning/feature-store/time-series]
- (Point-in-time feature joins - Azure Databricks, Accessed 2026-02-08)[https://learn.microsoft.com/en-us/azure/databricks/machine-learning/feature-store/time-series]
- (FreshRetailNet-50K: Stockout-annotated censored demand, 2025-05-22)[https://arxiv.org/abs/2505.16319]
- (Censored Demand Estimation in Retail, Accessed 2026-02-08)[https://dl.acm.org/doi/10.1145/3154489]
- (Expectation Suite - Great Expectations Docs, Accessed 2026-02-08)[https://docs.greatexpectations.io/docs/0.18/reference/learn/terms/expectation_suite/]
- (Data Docs - Great Expectations Docs, Accessed 2026-02-08)[https://docs.greatexpectations.io/docs/0.18/reference/learn/terms/data_docs]
- (Lag features for forecasting in AutoML, Accessed 2026-02-08)[https://learn.microsoft.com/en-us/azure/machine-learning/concept-automl-forecasting-lags?view=azureml-api-2]
- (Time series cross-validation - FPP3, Accessed 2026-02-08)[https://otexts.com/fpp3/tscv.html]