Introduction

  • TL;DR:

    • AI Sales Forecasting must be evaluated using genuine forecasts on unseen data, not training residuals.
    • Use rolling forecasting origin (rolling-origin CV) with explicit choices: horizon, step, window type, and refit policy.
    • Report WAPE + MASE (and pinball loss for quantiles) and compare everything against two fixed baselines: seasonal naive + ETS.

In this lecture-style part, you’ll build a backtest setup that matches deployment conditions and produces a decision-ready report.

Why it matters: A “lab-only” backtest is the fastest route to production forecast incidents.


1) Prerequisites

1.1 Lock operational equivalence

  • Horizon must match lead time / ordering cadence.
  • Decide: expanding vs rolling window, and whether to refit each fold.

FPP3 emphasizes evaluating forecast accuracy using genuine forecasts on new data.

Why it matters: If backtesting conditions differ from ops, offline scores won’t predict online performance.


2) Step-by-step backtesting (rolling-origin)

Step 1 — Create rolling-origin folds

FPP3 describes “evaluation on a rolling forecasting origin” where the origin moves forward in time.

Step 2 — Make refit policy explicit

sktime provides evaluate for time-series CV style backtesting.

Why it matters: Refit vs no-refit changes the meaning of your results.


3) Two mandatory baselines

  • Seasonal naive baseline
  • ETS baseline via statsmodels ETSModel

Why it matters: If your AI model can’t beat ETS, it likely doesn’t deserve production complexity.


4) Metrics that work for sales forecasting

  • WAPE (volume-weighted percentage error): Hyndman provides the definition and intuition.
  • WAPE in practice: AWS Forecast documents WAPE as an evaluation metric.
  • MASE + standard metrics: AutoGluon lists WAPE and MASE among its forecasting metrics.
  • Quantile forecasts: evaluate pinball/quantile loss (pinball loss definition).

Why it matters: Retail often has many small/zero values; WAPE/MASE are more stable than MAPE-only reporting.


5) Verification: report template + release gates

5.1 A minimum decision-ready report

Include:

  • Overall WAPE/MASE
  • Promo vs non-promo slices
  • Top-revenue SKUs slice
  • Quantile losses if probabilistic

M5 is a widely cited retail benchmark emphasizing hierarchical and weighted evaluation (WRMSSE).

5.2 Leakage and split correctness

scikit-learn’s TimeSeriesSplit exists to avoid training on the future; keep spacing and ordering constraints in mind.

Why it matters: A single “overall score” hides the exact segments that blow up operations.


Troubleshooting

  1. Offline good, production bad → backtest doesn’t match rolling deployment or cutoff rules.
  2. TimeSeriesSplit unstable → irregular spacing or panel-series mixing; use per-series backtests or a dedicated forecasting evaluation workflow.
  3. Promo periods regress → promo features not future-known or mislabeled; gate promo slices explicitly.

Conclusion

  • Use rolling-origin backtesting and genuine forecasts.
  • Fix two baselines: seasonal naive + ETS.
  • Report WAPE/MASE and pinball loss for quantiles, with segment gates.

Summary

  • Rolling-origin CV that matches ops
  • Explicit window + refit policy
  • Two fixed baselines (seasonal naive, ETS)
  • WAPE/MASE + quantile loss where relevant
  • Segment-level gates (promo, top SKUs)

#ai-sales-forecasting #demand-forecasting #time-series #backtesting #rolling-origin #wape #mase #ets #mlops

References

  • (Time series cross-validation - FPP3, Accessed 2026-02-09)[https://otexts.com/fpp3/tscv.html]
  • (Evaluating point forecast accuracy - FPP3, Accessed 2026-02-09)[https://otexts.com/fpp3/accuracy.html]
  • (Forecasting with sktime (rolling evaluation), Accessed 2026-02-09)[https://www.sktime.net/en/latest/examples/01_forecasting.html]
  • (evaluate (timeseries CV) - sktime, Accessed 2026-02-09)[https://www.sktime.net/en/stable/api_reference/auto_generated/sktime.forecasting.model_evaluation.evaluate.html]
  • (TimeSeriesSplit - scikit-learn, Accessed 2026-02-09)[https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html]
  • (ETSModel - statsmodels, Accessed 2026-02-09)[https://www.statsmodels.org/stable/generated/statsmodels.tsa.exponential_smoothing.ets.ETSModel.html]
  • (WAPE - Rob Hyndman, 2025-08-08)[https://robjhyndman.com/hyndsight/wape.html]
  • (Forecast metrics (WAPE/MASE) - AutoGluon, Accessed 2026-02-09)[https://auto.gluon.ai/dev/tutorials/timeseries/forecasting-metrics.html]
  • (WAPE metric - AWS Forecast, Accessed 2026-02-09)[https://docs.aws.amazon.com/forecast/latest/dg/metrics.html]
  • (M5 Forecasting - Accuracy - Kaggle, Accessed 2026-02-09)[https://www.kaggle.com/competitions/m5-forecasting-accuracy]
  • (M5 results and conclusions, 2022-01-01)[https://www.sciencedirect.com/science/article/pii/S0169207021001874]
  • (Pinball loss definition - Lokad, 2012-02-01)[https://www.lokad.com/pinball-loss-function-definition/]