Home / Point of View / Demystifying AI Forecasting
Forecasting

Demystifying AI Forecasting:
From Black Box to Boardroom Answer

Producing a demand forecast is not the hard part. Explaining it is. When an executive asks "how did it come up with $1.5M for next month?" the answer requires opening the black box - feature matrices, model competition, SHAP attribution, and forecastability scoring. This article walks through every layer.

Paul Ausserer, Marquis Data May 2026 15 min read

The question nobody prepares for

You ran the Python pipeline. The models ran. The FVA numbers look good. The 12-month forward forecast is sitting in the Excel workbook ready to share. You walk into the executive meeting, present the numbers - and then it happens.

"This says $1.5M for May. How did it come up with that? What went into it?"

If your answer is "the algorithm analyzed the historical data," you have already lost the room. Not because the forecast is wrong - it may be excellent - but because an unexplained forecast is an untrustworthy forecast. The finance team will not load it into the budget. The procurement manager will not size the purchase orders against it. The operating partner will not present it to the board.

Producing a forecast is the easy part. Explaining it - in dollar terms, by feature, at every horizon, for every grain - is where most forecasting initiatives either earn credibility or lose it permanently.

This article shares some of the tools we use to answer that question - the feature matrix that shows what the model was actually trained on, the FVA framework for selecting the right model per entity, SHAP values that convert model output into dollar-level attribution per feature, and the forecastability scorecard that determines which entities belong in an automated forecast at all. None of these are exotic. Together they make "how did it come up with that number" a question with a real answer.

A note on methodology
Demand forecasting is a deep field with many valid methodologies. The approaches here reflect what we have seen produce the best practical results in PE-owned manufacturing environments - not a claim that these are the only right answers. There are practitioners with far broader expertise in the discipline. One we have worked with directly and recommend without reservation is Nicolas Vandeput, whose work on demand forecasting and inventory optimization is worth reading by anyone building serious forecasting capability.

What the model actually sees: the training feature matrix

A machine learning demand forecast does not see the business. It sees a table. Each row is one month of data for one grain - a specific combination of facility, end market, and product category. Each column is a feature: a number the model can use to learn the relationship between past conditions and future revenue.

The feature matrix for a typical manufacturing entity contains 30 or more columns, organized into five groups:

  • Revenue history (y actuals and lag features) - the target variable y plus lag_1 through lag_12: what revenue was 1 month ago, 2 months ago, and so on back through the training window
  • Rolling statistics - 3-month and 6-month rolling mean and standard deviation, capturing recent trend and volatility
  • Calendar features - month of year, quarter, year trend, and year-over-year growth rate, giving the model awareness of seasonality and direction
  • Backlog lag features - open order balance 1, 2, 3, and 6 months prior, the most direct leading indicator of near-term revenue
  • Quote activity features - aggregate new-quote value lagged 1, 2, 3, 6, and 12 months, plus a normalized quote index (current month quotes divided by trailing 12-month average) so the model reads whether the pipeline is running hot or cold relative to its own history

The final row of the training matrix is the prediction row. All feature columns are known - they are prior months - but the target variable y (revenue) is null. That null is what the model must estimate. The table below illustrates the structure for a fictional mid-size industrial entity.

Training Data Shape - Southeast Fabrication · Industrial · Aerospace (Illustrative)
Each row is one month of data for this grain. The model learns from all rows. The last row shows what the feature matrix looks like at prediction time - revenue is unknown and must be estimated. Scroll right to see all columns.
Revenue history - actuals & lags of y Calendar - rolling stats & YoY Backlog - open order lags Quote lags - dollar amounts & index Prediction target
Month y (Actuals) lag_1 lag_2 lag_3 lag_6 roll_mean_3 yoy_growth backlog_lag1 backlog_lag2 backlog_lag3 quote_lag1 quote_lag3 quote_index
Oct 2024 4,210,000 3,980,000 4,050,000 3,860,000 3,740,000 4,013,333 +6.2% 1,640,000 1,580,000 1,490,000 318,000 294,000 1.04
Nov 2024 4,480,000 4,210,000 3,980,000 4,050,000 3,820,000 4,146,667 +8.4% 1,720,000 1,640,000 1,580,000 341,000 318,000 1.11
Dec 2024 5,140,000 4,480,000 4,210,000 3,980,000 3,910,000 4,610,000 +10.1% 1,890,000 1,720,000 1,640,000 362,000 341,000 1.18
Jan 2025 3,620,000 5,140,000 4,480,000 4,210,000 4,050,000 4,413,333 +5.3% 1,380,000 1,890,000 1,720,000 271,000 362,000 0.88
Feb 2025 3,830,000 3,620,000 5,140,000 4,480,000 4,120,000 4,196,667 +4.8% 1,440,000 1,380,000 1,890,000 288,000 271,000 0.93
Mar 2025 4,060,000 3,830,000 3,620,000 5,140,000 4,210,000 3,836,667 +5.6% 1,520,000 1,440,000 1,380,000 304,000 288,000 0.98
Apr 2025 ► ? 4,060,000 3,830,000 3,620,000 4,480,000 3,970,000 +5.2% 1,610,000 1,520,000 1,440,000 322,000 304,000 1.02
Apr 2025 is the prediction row. All 13 feature columns shown here are fully known from prior months. The model combines them to estimate y. Revenue is null until the month closes. The full feature matrix includes lag_12, rolling_mean_6, rolling_std_3/6, quarter, and year trend - 30+ columns in total. quote_index = current month quotes divided by trailing 12-month average (1.0 = on pace, above 1.0 = pipeline running ahead of history).

This is exactly what the gradient boosted models (XGBoost, LightGBM, CatBoost) receive. They do not know what month it is in any human sense - they see numbers in columns. The model learns that when backlog_lag1 is elevated and quote_index is above 1.0, revenue the following month tends to be above average. That pattern, learned from hundreds of rows across the training history, is what produces the forecast.

Model competition, walk-forward validation, and FVA

The Marquis IQ pipeline does not pick a model before seeing the data. It runs all candidate models against each grain independently and lets accuracy determine the winner. The process is walk-forward (expanding window) validation - a rigorous time-series-specific alternative to random cross-validation that respects temporal ordering and prevents data leakage.

Walk-forward validation

In walk-forward validation, the model is trained on the first N months and tested on months N+1 and N+2. Then trained on months 1 through N+1 and tested on N+2 and N+3. The window expands with each step, and accuracy is measured at 1-month and 6-month lead times across all out-of-sample test windows. This mimics the real forecasting environment: the model always trains on the past and predicts the future, never the reverse.

Forecast Value Added (FVA)

The primary model selection criterion is Forecast Value Added: FVA = MAE(Naive) - MAE(Model). A positive FVA means the model outperforms the naive seasonal baseline. A negative FVA means the model is worse than simply repeating last year's same month. FVA is calculated for every model on every grain. The model with the highest positive FVA is selected. If no model beats naive, the naive baseline is used as the forecast - adding complexity for no accuracy gain is not a trade-off worth making.

Why naive is the right floor: the seasonal naive forecast is free, instant, completely transparent, and surprisingly difficult to beat on stable, seasonal businesses. It is not a strawman - it is a genuine benchmark. Any model that earns a negative FVA relative to naive is a liability, not an asset.

Accuracy metrics reported per entity

MAPE
Mean Absolute Percentage Error
Average absolute error as % of actual. Interpretable but sensitive to near-zero months. Primary metric for executive reporting.
MAE
Mean Absolute Error
Average absolute dollar error. Used for FVA calculation. Scale-dependent - not comparable across entities of different size.
MASE
Mean Absolute Scaled Error
MAE scaled by the naive baseline error. MASE below 1.0 means the model beats naive. Comparable across entities and horizons.
RMSE
Root Mean Square Error
Penalizes large errors more heavily than MAE. Useful for identifying models that occasionally miss badly vs. consistently miss modestly.
Bias
Mean Error (signed)
Average signed error. Persistent positive bias = model systematically over-forecasts. Persistent negative bias = systematically under-forecasts.
FVA
Forecast Value Added
MAE(Naive) minus MAE(Model). The primary model selection criterion. Positive = model earns its place. Negative = use naive instead.

SHAP: the dollar-level attribution that answers the boardroom question

SHAP (SHapley Additive exPlanations) is a game-theory-based method for attributing a model's prediction to individual input features in a way that is both mathematically rigorous and practically interpretable. For gradient boosted tree models (XGBoost, LightGBM, CatBoost), SHAP values are computed efficiently using TreeSHAP, an algorithm that leverages the tree structure to calculate exact attribution values without approximation.

The question SHAP answers is exactly the boardroom question: of the $487K the model forecasted for next month, how much came from recent revenue history, how much from the backlog position, and how much from quoting activity? Each feature gets a dollar-denominated contribution - positive if it pushed the forecast up, negative if it pulled it down. The baseline (the model's average prediction across all training data) plus the sum of all SHAP values equals the final forecast exactly.

SHAP Feature Attribution - Midwest Industrial · Distribution · Commercial (LightGBM, Illustrative)
Dollar contribution of each feature to the M+1 and M+2 forward forecasts. Positive = feature pushed forecast up. Negative = feature pulled forecast down. Baseline + sum of all contributions = final forecast.
Feature M+1 ($) M+1 % M+2 ($) M+2 %
Model Baseline (average prediction) +512,400 - +512,400 -
Revenue: 3 Months Ago -31,840 -95.0% -2,910 -108.0%
Backlog: 6 Months -28,760 -86.0% -29,440 -1092.0%
Year Trend +14,210 +42.0% +14,580 +541.0%
Revenue: 12 Months Ago +11,320 +34.0% +12,860 +477.0%
Backlog: 2 Months -8,440 -25.0% -8,620 -319.0%
Rolling Std (6m) +7,980 +24.0% +8,210 +304.0%
Revenue: 6 Months Ago +6,870 +21.0% +12,630 +468.0%
Rolling Avg (3m) -5,220 -16.0% -4,960 -184.0%
Month of Year +5,140 +15.0% +5,380 +199.0%
Backlog: 1 Month +4,820 +14.0% +3,410 +126.0%
Backlog: 3 Months -4,490 -13.0% -5,020 -186.0%
YoY Growth Rate +2,870 +9.0% -110 -4.0%
Revenue: Prior Month +2,640 +8.0% -4,980 -184.0%
Rolling Avg (6m) +1,910 +6.0% +2,580 +96.0%
Quarter +310 +1.0% +390 +14.0%
= M+1 / M+2 Forecast +487,720 - +516,400 -

Reading the SHAP output, the answer to "how did it come up with $487K?" is now precise: the model started from a $512K baseline, backlog signals pulled the forecast down by a combined $41K across three lag periods (recent backlog has been running soft), the year trend and 12-month revenue history added back $25K (the business is growing), and a below-average rolling 3-month average removed another $5K. The residual signal from seasonality (Month of Year) added $5K. The sum equals the forecast exactly.

This is the translation layer that makes AI forecasting defensible. A plant controller can read the SHAP output and say "the model is forecasting lower because my backlog position has deteriorated over the last six months - that makes sense." An executive can see that year trend is consistently adding positive contribution across all grains, confirming the business's growth trajectory is being captured. An anomaly - a feature contributing an unexpectedly large positive or negative value - surfaces immediately and triggers investigation.

Forecastability scoring: not everything belongs in an automated forecast

Before the pipeline selects a model for any entity, it runs a forecastability analysis. The question it answers is not "which model is best?" but "should this entity be in an automated forecast at all?" Applying sophisticated statistical models to demand signals that are too sparse, too erratic, or too short to learn from does not produce better forecasts - it produces confident wrong numbers.

Demand classification: the Syntetos-Boylan framework

The pipeline classifies each entity's demand pattern using two measures: ADI (Average Demand Interval), the average number of periods between non-zero demand occurrences, and CV² (squared coefficient of variation), the variance in demand size when demand does occur. Together they define four demand classes:

  • Smooth (low ADI, low CV²) - demand arrives regularly with consistent size. Candidate for ETS, XGBoost, LightGBM, CatBoost.
  • Erratic (low ADI, high CV²) - demand arrives regularly but in highly variable amounts. ETS with manual review; ML models may overfit.
  • Intermittent (high ADI, low CV²) - demand arrives infrequently but consistently sized when it does. Croston's method (SBA variant) is the correct approach.
  • Lumpy (high ADI, high CV²) - demand is both infrequent and highly variable. Automated forecasting is not reliable; reorder-point planning is more appropriate.

The classification directly determines which models are candidates for selection. Running XGBoost on a lumpy demand entity wastes compute and produces a forecast no planning team should trust. The forecastability scorecard below shows this classification in action across a sample of entities, with completely anonymized part numbers.

Forecastability Detail Scorecard - Sample Entities (Illustrative, Anonymized)
Forecastability score 0-100: above 60 = automated forecast; 35-60 = use with manual review; below 35 = reorder-point planning. SBC Class based on ADI and CV². MASE below 1.0 means the model beats naive.
SKU Facility History ADI CV² SBC Class MASE MAPE % Forecastability Recommended Approach
FG-1840Site-A48mo1.0000.451 Smooth 0.1793.5% 91.7 XGBoost + ETS (strong value)
FG-3024Site-A47mo1.0000.207 Smooth 0.2449.6% 92.0 XGBoost + ETS (strong value)
FG-1222Site-B48mo1.0000.388 Smooth 0.26664.5% 88.9 XGBoost + ETS (moderate value)
FG-0220Site-A46mo1.0200.369 Smooth 0.52314.0% 86.6 ETS (moderate value)
FG-2183Site-A47mo1.0000.592 Erratic 0.22316.2% 86.7 ETS with Manual Review
FG-3050Site-B47mo1.0400.646 Erratic 0.16011.8% 86.3 ETS with Manual Review
FG-4420Site-B34mo1.3900.409 Intermittent 0.1426.6% 85.2 Croston's Method
FG-1780Site-A48mo1.0000.056 Smooth 0.70417.2% 85.1 ETS (moderate value)
FG-2940Site-A44mo1.0500.474 Smooth 0.55531.5% 55.0 ETS - review required
FG-0381Site-B43mo1.8670.434 Lumpy 0.52652.0% 28.4 Reorder-Point Planning
ADI: Average Demand Interval (1.0 = demand every period). CV²: Squared Coefficient of Variation (above 0.49 = erratic). SBC Class = Syntetos-Boylan Classification. MASE below 1.0 = model beats naive. Forecastability score combines MASE, demand regularity, and history depth. Part numbers are illustrative - not derived from any specific client dataset.

Reading the scorecard top to bottom, the pattern is clear: smooth entities with 40+ months of history and MASE well below 1.0 are strong automated forecast candidates. Erratic entities are forecasted but flagged for manual review before loading into the MRP upload. Intermittent entities route to Croston rather than a standard time-series model. Lumpy entities at the bottom of the list are excluded from the automated forecast entirely and routed to reorder-point planning - a more reliable approach for demand that is both rare and unpredictable.

The Excel output: the accountability layer

The pipeline's final deliverable is not the forecast number. It is an Excel workbook that documents every decision made to produce that number - and gives the planning team the tools to validate it, challenge it, or override it. This workbook is the answer to the boardroom question made tangible.

The workbook contains five tabs that tell the complete story:

  • Actuals vs. Forecast - period-by-period actual revenue alongside the model's in-sample fitted values and the out-of-sample forward forecast. The visual gap between actual and fitted is the residual - the part of demand the model did not explain. Large persistent residuals in a specific direction are a signal that something structural changed that the model has not captured.
  • FVA by Model and Grain - the full FVA scorecard showing every model's walk-forward accuracy at 1-month and 6-month horizons. The winning model per grain is highlighted. Grains where no model beats naive are flagged in red with a recommendation to investigate data quality or grain definition before the next run.
  • SHAP Attribution - the dollar-level feature contribution table for M+1 and M+2 for every grain in the run. This is the tab that answers "how did it come up with that number." Paste rows from this tab into a board deck and the forecast becomes auditable.
  • Forecastability Scorecard - every entity classified by demand pattern, scored, and assigned a recommended modeling approach. The planning team uses this tab to decide what goes into the MRP upload versus what gets manual planning intervention.
  • Forward Forecast - the actionable output: 12-month rolling forecast by entity, grain, and horizon, ready for validation and MRP upload. This is the tab the procurement manager acts on.

The workbook is regenerated automatically on every pipeline run. Numbers change as new actuals arrive. The SHAP attribution updates to reflect the current feature values. The FVA scorecard re-evaluates model competition with the expanded history. Nothing in the workbook is stale - it is always the current state of the model's understanding of the business.

Related reading: From ERP to MRP: Building a Forecasting Competency That Runs Itself covers the full pipeline architecture. Why Manufacturing Demand Planning Fails covers the data foundation this pipeline requires.

Common questions

Questions about demand forecast explainability, model validation, and forecastability scoring.

What is SHAP and why is it the right tool for explaining demand forecasts?
SHAP (SHapley Additive exPlanations) is a method rooted in cooperative game theory that assigns each input feature a contribution value representing its dollar impact on a specific prediction. Unlike feature importance metrics that describe global model behavior across the entire dataset, SHAP produces local explanations specific to each individual prediction. For a demand forecast, this means knowing not just that backlog is generally an important feature, but that for this specific entity in this specific month, the 6-month backlog lag pulled the forecast down by $28,760. For gradient boosted tree models, TreeSHAP computes these values exactly and efficiently, making it practical to run on every grain in every period.
What is walk-forward validation and why does it matter for time series models?
Walk-forward validation - also called expanding window validation or time-series cross-validation - trains the model on historical data through period N and tests it on periods N+1 and beyond, then expands the training window and repeats. Unlike standard k-fold cross-validation, which randomly assigns observations to folds, walk-forward validation never allows the model to train on future data to predict the past. This is critical for time series because standard cross-validation produces optimistic accuracy estimates on temporal data - the model implicitly learns future patterns during training and appears more accurate than it will be in production. Walk-forward validation measures the accuracy the model will actually achieve when deployed.
What is the SBC demand classification and which model should each class use?
The Syntetos-Boylan Classification (SBC) categorizes demand patterns using two dimensions: ADI (Average Demand Interval, measuring demand frequency) and CV squared (measuring demand size variability). Smooth demand (ADI near 1.0, low CV squared) is regular and consistent - ETS, XGBoost, LightGBM, and CatBoost are all candidates. Erratic demand (ADI near 1.0, high CV squared) arrives regularly but in highly variable amounts - ETS is more reliable than ML models that may overfit to the variability. Intermittent demand (high ADI, low CV squared) arrives infrequently but consistently sized - Croston's method with the Syntetos-Boylan bias correction (SBA) is the correct statistical approach. Lumpy demand (high ADI, high CV squared) is both infrequent and variable - automated statistical forecasting is unreliable and reorder-point planning is more appropriate.
What is the normalized quote index and why include it as a feature?
The normalized quote index is computed as the current month's total new-quote value divided by the trailing 12-month average. A value of 1.0 means quoting activity is exactly on pace with the prior year average. Above 1.0 means the pipeline is running hot relative to history. Below 1.0 means the pipeline is soft. The value of normalization is that it makes quoting activity comparable across periods and removes the absolute scale of the business - a growing company will have higher raw quote volumes every year, but the normalized index correctly shows whether the pipeline is ahead or behind its own historical pace. As a leading indicator, the quote index is particularly useful for 3-to-6-month horizon forecasts where the backlog lags have already been absorbed into near-term revenue.

A forecast you can explain is a forecast your team will use.

The Marquis IQ forecasting pipeline produces SHAP attribution, FVA validation, forecastability scoring, and the Excel workbook that makes every number defensible - automatically, every period.