AI Sales Forecasting: Data Modeling Template for Demand Forecasting (Part 2)

Introduction

TL;DR:
- AI Sales Forecasting often fails due to data semantics (schemas, time meaning, leakage), not model choice.
- Model your sources as sales + calendar + price + promo + inventory/stockouts, then build a stable training/inference view.
- Enforce point-in-time correctness for time-series feature joins to prevent leakage.
- Treat stockouts as censored demand and track them explicitly.

In this Part 2, you’ll get a practical data model and validation rules you can lift into a warehouse/lakehouse.

Why it matters: Forecasting systems are “data products.” If the data contract breaks, no model can rescue production.

1) Data contract goals

Lock four things:

Target y definition (units vs revenue vs net sales)
Time granularity + timezone + cutoff/close process
Feature availability at prediction time (no leakage)
Separation of zero sales vs unobserved/censored (stockouts, store closed)

Why it matters: Leakage is usually introduced by joins, and it can look great offline while failing in production.

2) Canonical training / inference view (recommended)

Retail benchmarks commonly separate sales context (calendar, prices), which maps cleanly to production modeling.

Recommended columns: ds, series_id, y, is_open, is_listed, stockout_flag, price, promo_flag, event_name.

Rule of thumb: y=0 should mean “sellable but not sold.” If is_open=0 or is_listed=0, treat as missing/excluded.

Why it matters: Once this view is stable, you can swap platforms/models without rebuilding everything.

3) Source tables blueprint

Domain	Table	Key	Notes
Target	`fact_sales`	ds, sku_id, store_id	observed sales
Calendar	`dim_calendar`	ds	holidays/events
Price	`fact_price`	sku_id, store_id, effective_from~to	effective-dated
Promo	`fact_promo_plan`	sku_id, store_id, start~end	planned vs actual split
Inventory	`fact_inventory_snapshot`	as_of_ts, sku_id, store_id	derive stockouts
Static	`dim_product`, `dim_store`	ids	categories/regions

Why it matters: Separating these domains lets you manage time semantics and “future-known” features correctly.

4) Time semantics + leakage prevention

Use both:

event_time: when it actually happened
as_of_time: when your system knew it

Then enforce point-in-time joins so features reflect what was available at the label time.

For lag/rolling features, generate them with an explicit cutoff (don’t backfill from future data). Azure documents lag/rolling feature concepts for forecasting.

Why it matters: Most time-series failures come from join semantics, not model architecture.

5) Stockouts and censored demand

Stockouts can censor true demand and introduce systematic underestimation. Minimum practice: create stockout_flag, and exclude or run a demand-recovery step.

Why it matters: If you treat stockouts as “zero demand,” you train the model to under-order.

6) Data quality rules as executable tests

Great Expectations formalizes validation rules as Expectation Suites, and can publish results as Data Docs.

Core validations:

Uniqueness: (ds, series_id)
Ranges: y>=0, price>0, promo_start<=promo_end
Semantics: stockout_flag present; closed/unlisted days excluded

Why it matters: Automated data tests shorten incident response from hours to minutes.

Conclusion

Model your data as sales + calendar + price + promo + inventory/stockouts, then expose a stable canonical view.
Enforce point-in-time correctness to prevent leakage.
Track stockouts explicitly because they create censored demand.
Make quality rules executable via validation suites.

Summary

Stable schema beats unstable modeling
Point-in-time joins prevent leakage
Stockouts are censorship, not “zero demand”
Quality rules must be automated

Recommended Hashtags

#ai-sales-forecasting #demand-forecasting #time-series #data-modeling #point-in-time #data-quality #retail-analytics #mlops

References

(M5 Forecasting - Accuracy (Data), Accessed 2026-02-08)[https://www.kaggle.com/c/m5-forecasting-accuracy/data]
(M5 dataset overview (prices, promotions, holidays), Accessed 2026-02-08)[https://colab.research.google.com/github/ikyath/M5-Forecasting-Accuracy-Kaggle/blob/master/M5_Forecast_Encoder_Decoder_Final.ipynb/]
(Point-in-time feature joins - Databricks Docs, Accessed 2026-02-08)[https://docs.databricks.com/aws/en/machine-learning/feature-store/time-series]
(Point-in-time feature joins - Azure Databricks, Accessed 2026-02-08)[https://learn.microsoft.com/en-us/azure/databricks/machine-learning/feature-store/time-series]
(FreshRetailNet-50K: Stockout-annotated censored demand, 2025-05-22)[https://arxiv.org/abs/2505.16319]
(Censored Demand Estimation in Retail, Accessed 2026-02-08)[https://dl.acm.org/doi/10.1145/3154489]
(Expectation Suite - Great Expectations Docs, Accessed 2026-02-08)[https://docs.greatexpectations.io/docs/0.18/reference/learn/terms/expectation_suite/]
(Data Docs - Great Expectations Docs, Accessed 2026-02-08)[https://docs.greatexpectations.io/docs/0.18/reference/learn/terms/data_docs]
(Lag features for forecasting in AutoML, Accessed 2026-02-08)[https://learn.microsoft.com/en-us/azure/machine-learning/concept-automl-forecasting-lags?view=azureml-api-2]
(Time series cross-validation - FPP3, Accessed 2026-02-08)[https://otexts.com/fpp3/tscv.html]

Introduction#

1) Data contract goals#

2) Canonical training / inference view (recommended)#

3) Source tables blueprint#

4) Time semantics + leakage prevention#

5) Stockouts and censored demand#

6) Data quality rules as executable tests#

Conclusion#

Summary#

Recommended Hashtags#

References#