AI Sales Forecasting: Backtesting with Rolling-Origin CV, Baselines, and Report Gates (Part 3)

Introduction

TL;DR:
- AI Sales Forecasting must be evaluated using genuine forecasts on unseen data, not training residuals.
- Use rolling forecasting origin (rolling-origin CV) with explicit choices: horizon, step, window type, and refit policy.
- Report WAPE + MASE (and pinball loss for quantiles) and compare everything against two fixed baselines: seasonal naive + ETS.

In this lecture-style part, you’ll build a backtest setup that matches deployment conditions and produces a decision-ready report.

Why it matters: A “lab-only” backtest is the fastest route to production forecast incidents.

1) Prerequisites

1.1 Lock operational equivalence

Horizon must match lead time / ordering cadence.
Decide: expanding vs rolling window, and whether to refit each fold.

FPP3 emphasizes evaluating forecast accuracy using genuine forecasts on new data.

Why it matters: If backtesting conditions differ from ops, offline scores won’t predict online performance.

2) Step-by-step backtesting (rolling-origin)

Step 1 — Create rolling-origin folds

FPP3 describes “evaluation on a rolling forecasting origin” where the origin moves forward in time.

Step 2 — Make refit policy explicit

sktime provides evaluate for time-series CV style backtesting.

Why it matters: Refit vs no-refit changes the meaning of your results.

3) Two mandatory baselines

Seasonal naive baseline
ETS baseline via statsmodels ETSModel

Why it matters: If your AI model can’t beat ETS, it likely doesn’t deserve production complexity.

4) Metrics that work for sales forecasting

WAPE (volume-weighted percentage error): Hyndman provides the definition and intuition.
WAPE in practice: AWS Forecast documents WAPE as an evaluation metric.
MASE + standard metrics: AutoGluon lists WAPE and MASE among its forecasting metrics.
Quantile forecasts: evaluate pinball/quantile loss (pinball loss definition).

Why it matters: Retail often has many small/zero values; WAPE/MASE are more stable than MAPE-only reporting.

5) Verification: report template + release gates

5.1 A minimum decision-ready report

Include:

Overall WAPE/MASE
Promo vs non-promo slices
Top-revenue SKUs slice
Quantile losses if probabilistic

M5 is a widely cited retail benchmark emphasizing hierarchical and weighted evaluation (WRMSSE).

5.2 Leakage and split correctness

scikit-learn’s TimeSeriesSplit exists to avoid training on the future; keep spacing and ordering constraints in mind.

Why it matters: A single “overall score” hides the exact segments that blow up operations.

Troubleshooting

Offline good, production bad → backtest doesn’t match rolling deployment or cutoff rules.
TimeSeriesSplit unstable → irregular spacing or panel-series mixing; use per-series backtests or a dedicated forecasting evaluation workflow.
Promo periods regress → promo features not future-known or mislabeled; gate promo slices explicitly.

Conclusion

Use rolling-origin backtesting and genuine forecasts.
Fix two baselines: seasonal naive + ETS.
Report WAPE/MASE and pinball loss for quantiles, with segment gates.

Summary

Rolling-origin CV that matches ops
Explicit window + refit policy
Two fixed baselines (seasonal naive, ETS)
WAPE/MASE + quantile loss where relevant
Segment-level gates (promo, top SKUs)

Recommended Hashtags

#ai-sales-forecasting #demand-forecasting #time-series #backtesting #rolling-origin #wape #mase #ets #mlops

References

(Time series cross-validation - FPP3, Accessed 2026-02-09)[https://otexts.com/fpp3/tscv.html]
(Evaluating point forecast accuracy - FPP3, Accessed 2026-02-09)[https://otexts.com/fpp3/accuracy.html]
(Forecasting with sktime (rolling evaluation), Accessed 2026-02-09)[https://www.sktime.net/en/latest/examples/01_forecasting.html]
(evaluate (timeseries CV) - sktime, Accessed 2026-02-09)[https://www.sktime.net/en/stable/api_reference/auto_generated/sktime.forecasting.model_evaluation.evaluate.html]
(TimeSeriesSplit - scikit-learn, Accessed 2026-02-09)[https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html]
(ETSModel - statsmodels, Accessed 2026-02-09)[https://www.statsmodels.org/stable/generated/statsmodels.tsa.exponential_smoothing.ets.ETSModel.html]
(WAPE - Rob Hyndman, 2025-08-08)[https://robjhyndman.com/hyndsight/wape.html]
(Forecast metrics (WAPE/MASE) - AutoGluon, Accessed 2026-02-09)[https://auto.gluon.ai/dev/tutorials/timeseries/forecasting-metrics.html]
(WAPE metric - AWS Forecast, Accessed 2026-02-09)[https://docs.aws.amazon.com/forecast/latest/dg/metrics.html]
(M5 Forecasting - Accuracy - Kaggle, Accessed 2026-02-09)[https://www.kaggle.com/competitions/m5-forecasting-accuracy]
(M5 results and conclusions, 2022-01-01)[https://www.sciencedirect.com/science/article/pii/S0169207021001874]
(Pinball loss definition - Lokad, 2012-02-01)[https://www.lokad.com/pinball-loss-function-definition/]

Introduction#

1) Prerequisites#

1.1 Lock operational equivalence#

2) Step-by-step backtesting (rolling-origin)#

Step 1 — Create rolling-origin folds#

Step 2 — Make refit policy explicit#

3) Two mandatory baselines#

4) Metrics that work for sales forecasting#

5) Verification: report template + release gates#

5.1 A minimum decision-ready report#

5.2 Leakage and split correctness#

Troubleshooting#

Conclusion#

Summary#

Recommended Hashtags#

References#