Introduction
TL;DR:
- AI Sales Forecasting must be evaluated using genuine forecasts on unseen data, not training residuals.
- Use rolling forecasting origin (rolling-origin CV) with explicit choices: horizon, step, window type, and refit policy.
- Report WAPE + MASE (and pinball loss for quantiles) and compare everything against two fixed baselines: seasonal naive + ETS.
In this lecture-style part, you’ll build a backtest setup that matches deployment conditions and produces a decision-ready report.
Why it matters: A “lab-only” backtest is the fastest route to production forecast incidents.
1) Prerequisites
1.1 Lock operational equivalence
- Horizon must match lead time / ordering cadence.
- Decide: expanding vs rolling window, and whether to refit each fold.
FPP3 emphasizes evaluating forecast accuracy using genuine forecasts on new data.
Why it matters: If backtesting conditions differ from ops, offline scores won’t predict online performance.
2) Step-by-step backtesting (rolling-origin)
Step 1 — Create rolling-origin folds
FPP3 describes “evaluation on a rolling forecasting origin” where the origin moves forward in time.
Step 2 — Make refit policy explicit
sktime provides evaluate for time-series CV style backtesting.
Why it matters: Refit vs no-refit changes the meaning of your results.
3) Two mandatory baselines
- Seasonal naive baseline
- ETS baseline via statsmodels
ETSModel
Why it matters: If your AI model can’t beat ETS, it likely doesn’t deserve production complexity.
4) Metrics that work for sales forecasting
- WAPE (volume-weighted percentage error): Hyndman provides the definition and intuition.
- WAPE in practice: AWS Forecast documents WAPE as an evaluation metric.
- MASE + standard metrics: AutoGluon lists WAPE and MASE among its forecasting metrics.
- Quantile forecasts: evaluate pinball/quantile loss (pinball loss definition).
Why it matters: Retail often has many small/zero values; WAPE/MASE are more stable than MAPE-only reporting.
5) Verification: report template + release gates
5.1 A minimum decision-ready report
Include:
- Overall WAPE/MASE
- Promo vs non-promo slices
- Top-revenue SKUs slice
- Quantile losses if probabilistic
M5 is a widely cited retail benchmark emphasizing hierarchical and weighted evaluation (WRMSSE).
5.2 Leakage and split correctness
scikit-learn’s TimeSeriesSplit exists to avoid training on the future; keep spacing and ordering constraints in mind.
Why it matters: A single “overall score” hides the exact segments that blow up operations.
Troubleshooting
- Offline good, production bad → backtest doesn’t match rolling deployment or cutoff rules.
- TimeSeriesSplit unstable → irregular spacing or panel-series mixing; use per-series backtests or a dedicated forecasting evaluation workflow.
- Promo periods regress → promo features not future-known or mislabeled; gate promo slices explicitly.
Conclusion
- Use rolling-origin backtesting and genuine forecasts.
- Fix two baselines: seasonal naive + ETS.
- Report WAPE/MASE and pinball loss for quantiles, with segment gates.
Summary
- Rolling-origin CV that matches ops
- Explicit window + refit policy
- Two fixed baselines (seasonal naive, ETS)
- WAPE/MASE + quantile loss where relevant
- Segment-level gates (promo, top SKUs)
Recommended Hashtags
#ai-sales-forecasting #demand-forecasting #time-series #backtesting #rolling-origin #wape #mase #ets #mlops
References
- (Time series cross-validation - FPP3, Accessed 2026-02-09)[https://otexts.com/fpp3/tscv.html]
- (Evaluating point forecast accuracy - FPP3, Accessed 2026-02-09)[https://otexts.com/fpp3/accuracy.html]
- (Forecasting with sktime (rolling evaluation), Accessed 2026-02-09)[https://www.sktime.net/en/latest/examples/01_forecasting.html]
- (evaluate (timeseries CV) - sktime, Accessed 2026-02-09)[https://www.sktime.net/en/stable/api_reference/auto_generated/sktime.forecasting.model_evaluation.evaluate.html]
- (TimeSeriesSplit - scikit-learn, Accessed 2026-02-09)[https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html]
- (ETSModel - statsmodels, Accessed 2026-02-09)[https://www.statsmodels.org/stable/generated/statsmodels.tsa.exponential_smoothing.ets.ETSModel.html]
- (WAPE - Rob Hyndman, 2025-08-08)[https://robjhyndman.com/hyndsight/wape.html]
- (Forecast metrics (WAPE/MASE) - AutoGluon, Accessed 2026-02-09)[https://auto.gluon.ai/dev/tutorials/timeseries/forecasting-metrics.html]
- (WAPE metric - AWS Forecast, Accessed 2026-02-09)[https://docs.aws.amazon.com/forecast/latest/dg/metrics.html]
- (M5 Forecasting - Accuracy - Kaggle, Accessed 2026-02-09)[https://www.kaggle.com/competitions/m5-forecasting-accuracy]
- (M5 results and conclusions, 2022-01-01)[https://www.sciencedirect.com/science/article/pii/S0169207021001874]
- (Pinball loss definition - Lokad, 2012-02-01)[https://www.lokad.com/pinball-loss-function-definition/]