Introduction

  • TL;DR:

    • AI Sales Forecasting often fails due to data semantics (schemas, time meaning, leakage), not model choice.
    • Model your sources as sales + calendar + price + promo + inventory/stockouts, then build a stable training/inference view.
    • Enforce point-in-time correctness for time-series feature joins to prevent leakage.
    • Treat stockouts as censored demand and track them explicitly.

In this Part 2, you’ll get a practical data model and validation rules you can lift into a warehouse/lakehouse.

Why it matters: Forecasting systems are “data products.” If the data contract breaks, no model can rescue production.


1) Data contract goals

Lock four things:

  1. Target y definition (units vs revenue vs net sales)
  2. Time granularity + timezone + cutoff/close process
  3. Feature availability at prediction time (no leakage)
  4. Separation of zero sales vs unobserved/censored (stockouts, store closed)

Why it matters: Leakage is usually introduced by joins, and it can look great offline while failing in production.


Retail benchmarks commonly separate sales context (calendar, prices), which maps cleanly to production modeling.

Recommended columns: ds, series_id, y, is_open, is_listed, stockout_flag, price, promo_flag, event_name.

Rule of thumb: y=0 should mean “sellable but not sold.” If is_open=0 or is_listed=0, treat as missing/excluded.

Why it matters: Once this view is stable, you can swap platforms/models without rebuilding everything.


3) Source tables blueprint

DomainTableKeyNotes
Targetfact_salesds, sku_id, store_idobserved sales
Calendardim_calendardsholidays/events
Pricefact_pricesku_id, store_id, effective_from~toeffective-dated
Promofact_promo_plansku_id, store_id, start~endplanned vs actual split
Inventoryfact_inventory_snapshotas_of_ts, sku_id, store_idderive stockouts
Staticdim_product, dim_storeidscategories/regions

Why it matters: Separating these domains lets you manage time semantics and “future-known” features correctly.


4) Time semantics + leakage prevention

Use both:

  • event_time: when it actually happened
  • as_of_time: when your system knew it

Then enforce point-in-time joins so features reflect what was available at the label time.

For lag/rolling features, generate them with an explicit cutoff (don’t backfill from future data). Azure documents lag/rolling feature concepts for forecasting.

Why it matters: Most time-series failures come from join semantics, not model architecture.


5) Stockouts and censored demand

Stockouts can censor true demand and introduce systematic underestimation. Minimum practice: create stockout_flag, and exclude or run a demand-recovery step.

Why it matters: If you treat stockouts as “zero demand,” you train the model to under-order.


6) Data quality rules as executable tests

Great Expectations formalizes validation rules as Expectation Suites, and can publish results as Data Docs.

Core validations:

  • Uniqueness: (ds, series_id)
  • Ranges: y>=0, price>0, promo_start<=promo_end
  • Semantics: stockout_flag present; closed/unlisted days excluded

Why it matters: Automated data tests shorten incident response from hours to minutes.


Conclusion

  • Model your data as sales + calendar + price + promo + inventory/stockouts, then expose a stable canonical view.
  • Enforce point-in-time correctness to prevent leakage.
  • Track stockouts explicitly because they create censored demand.
  • Make quality rules executable via validation suites.

Summary

  • Stable schema beats unstable modeling
  • Point-in-time joins prevent leakage
  • Stockouts are censorship, not “zero demand”
  • Quality rules must be automated

#ai-sales-forecasting #demand-forecasting #time-series #data-modeling #point-in-time #data-quality #retail-analytics #mlops

References

  • (M5 Forecasting - Accuracy (Data), Accessed 2026-02-08)[https://www.kaggle.com/c/m5-forecasting-accuracy/data]
  • (M5 dataset overview (prices, promotions, holidays), Accessed 2026-02-08)[https://colab.research.google.com/github/ikyath/M5-Forecasting-Accuracy-Kaggle/blob/master/M5_Forecast_Encoder_Decoder_Final.ipynb/]
  • (Point-in-time feature joins - Databricks Docs, Accessed 2026-02-08)[https://docs.databricks.com/aws/en/machine-learning/feature-store/time-series]
  • (Point-in-time feature joins - Azure Databricks, Accessed 2026-02-08)[https://learn.microsoft.com/en-us/azure/databricks/machine-learning/feature-store/time-series]
  • (FreshRetailNet-50K: Stockout-annotated censored demand, 2025-05-22)[https://arxiv.org/abs/2505.16319]
  • (Censored Demand Estimation in Retail, Accessed 2026-02-08)[https://dl.acm.org/doi/10.1145/3154489]
  • (Expectation Suite - Great Expectations Docs, Accessed 2026-02-08)[https://docs.greatexpectations.io/docs/0.18/reference/learn/terms/expectation_suite/]
  • (Data Docs - Great Expectations Docs, Accessed 2026-02-08)[https://docs.greatexpectations.io/docs/0.18/reference/learn/terms/data_docs]
  • (Lag features for forecasting in AutoML, Accessed 2026-02-08)[https://learn.microsoft.com/en-us/azure/machine-learning/concept-automl-forecasting-lags?view=azureml-api-2]
  • (Time series cross-validation - FPP3, Accessed 2026-02-08)[https://otexts.com/fpp3/tscv.html]