Producing a demand forecast is not the hard part. Explaining it is. When an executive asks "how did it come up with $1.5M for next month?" the answer requires opening the black box - feature matrices, model competition, SHAP attribution, and forecastability scoring. This article walks through every layer.
You ran the Python pipeline. The models ran. The FVA numbers look good. The 12-month forward forecast is sitting in the Excel workbook ready to share. You walk into the executive meeting, present the numbers - and then it happens.
"This says $1.5M for May. How did it come up with that? What went into it?"
If your answer is "the algorithm analyzed the historical data," you have already lost the room. Not because the forecast is wrong - it may be excellent - but because an unexplained forecast is an untrustworthy forecast. The finance team will not load it into the budget. The procurement manager will not size the purchase orders against it. The operating partner will not present it to the board.
Producing a forecast is the easy part. Explaining it - in dollar terms, by feature, at every horizon, for every grain - is where most forecasting initiatives either earn credibility or lose it permanently.
This article shares some of the tools we use to answer that question - the feature matrix that shows what the model was actually trained on, the FVA framework for selecting the right model per entity, SHAP values that convert model output into dollar-level attribution per feature, and the forecastability scorecard that determines which entities belong in an automated forecast at all. None of these are exotic. Together they make "how did it come up with that number" a question with a real answer.
A machine learning demand forecast does not see the business. It sees a table. Each row is one month of data for one grain - a specific combination of facility, end market, and product category. Each column is a feature: a number the model can use to learn the relationship between past conditions and future revenue.
The feature matrix for a typical manufacturing entity contains 30 or more columns, organized into five groups:
The final row of the training matrix is the prediction row. All feature columns are known - they are prior months - but the target variable y (revenue) is null. That null is what the model must estimate. The table below illustrates the structure for a fictional mid-size industrial entity.
| Month | y (Actuals) | lag_1 | lag_2 | lag_3 | lag_6 | roll_mean_3 | yoy_growth | backlog_lag1 | backlog_lag2 | backlog_lag3 | quote_lag1 | quote_lag3 | quote_index |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Oct 2024 | 4,210,000 | 3,980,000 | 4,050,000 | 3,860,000 | 3,740,000 | 4,013,333 | +6.2% | 1,640,000 | 1,580,000 | 1,490,000 | 318,000 | 294,000 | 1.04 |
| Nov 2024 | 4,480,000 | 4,210,000 | 3,980,000 | 4,050,000 | 3,820,000 | 4,146,667 | +8.4% | 1,720,000 | 1,640,000 | 1,580,000 | 341,000 | 318,000 | 1.11 |
| Dec 2024 | 5,140,000 | 4,480,000 | 4,210,000 | 3,980,000 | 3,910,000 | 4,610,000 | +10.1% | 1,890,000 | 1,720,000 | 1,640,000 | 362,000 | 341,000 | 1.18 |
| Jan 2025 | 3,620,000 | 5,140,000 | 4,480,000 | 4,210,000 | 4,050,000 | 4,413,333 | +5.3% | 1,380,000 | 1,890,000 | 1,720,000 | 271,000 | 362,000 | 0.88 |
| Feb 2025 | 3,830,000 | 3,620,000 | 5,140,000 | 4,480,000 | 4,120,000 | 4,196,667 | +4.8% | 1,440,000 | 1,380,000 | 1,890,000 | 288,000 | 271,000 | 0.93 |
| Mar 2025 | 4,060,000 | 3,830,000 | 3,620,000 | 5,140,000 | 4,210,000 | 3,836,667 | +5.6% | 1,520,000 | 1,440,000 | 1,380,000 | 304,000 | 288,000 | 0.98 |
| Apr 2025 ► | ? | 4,060,000 | 3,830,000 | 3,620,000 | 4,480,000 | 3,970,000 | +5.2% | 1,610,000 | 1,520,000 | 1,440,000 | 322,000 | 304,000 | 1.02 |
This is exactly what the gradient boosted models (XGBoost, LightGBM, CatBoost) receive. They do not know what month it is in any human sense - they see numbers in columns. The model learns that when backlog_lag1 is elevated and quote_index is above 1.0, revenue the following month tends to be above average. That pattern, learned from hundreds of rows across the training history, is what produces the forecast.
The Marquis IQ pipeline does not pick a model before seeing the data. It runs all candidate models against each grain independently and lets accuracy determine the winner. The process is walk-forward (expanding window) validation - a rigorous time-series-specific alternative to random cross-validation that respects temporal ordering and prevents data leakage.
In walk-forward validation, the model is trained on the first N months and tested on months N+1 and N+2. Then trained on months 1 through N+1 and tested on N+2 and N+3. The window expands with each step, and accuracy is measured at 1-month and 6-month lead times across all out-of-sample test windows. This mimics the real forecasting environment: the model always trains on the past and predicts the future, never the reverse.
The primary model selection criterion is Forecast Value Added: FVA = MAE(Naive) - MAE(Model). A positive FVA means the model outperforms the naive seasonal baseline. A negative FVA means the model is worse than simply repeating last year's same month. FVA is calculated for every model on every grain. The model with the highest positive FVA is selected. If no model beats naive, the naive baseline is used as the forecast - adding complexity for no accuracy gain is not a trade-off worth making.
Why naive is the right floor: the seasonal naive forecast is free, instant, completely transparent, and surprisingly difficult to beat on stable, seasonal businesses. It is not a strawman - it is a genuine benchmark. Any model that earns a negative FVA relative to naive is a liability, not an asset.
SHAP (SHapley Additive exPlanations) is a game-theory-based method for attributing a model's prediction to individual input features in a way that is both mathematically rigorous and practically interpretable. For gradient boosted tree models (XGBoost, LightGBM, CatBoost), SHAP values are computed efficiently using TreeSHAP, an algorithm that leverages the tree structure to calculate exact attribution values without approximation.
The question SHAP answers is exactly the boardroom question: of the $487K the model forecasted for next month, how much came from recent revenue history, how much from the backlog position, and how much from quoting activity? Each feature gets a dollar-denominated contribution - positive if it pushed the forecast up, negative if it pulled it down. The baseline (the model's average prediction across all training data) plus the sum of all SHAP values equals the final forecast exactly.
| Feature | M+1 ($) | M+1 % | M+2 ($) | M+2 % |
|---|---|---|---|---|
| Model Baseline (average prediction) | +512,400 | - | +512,400 | - |
| Revenue: 3 Months Ago | -31,840 | -95.0% | -2,910 | -108.0% |
| Backlog: 6 Months | -28,760 | -86.0% | -29,440 | -1092.0% |
| Year Trend | +14,210 | +42.0% | +14,580 | +541.0% |
| Revenue: 12 Months Ago | +11,320 | +34.0% | +12,860 | +477.0% |
| Backlog: 2 Months | -8,440 | -25.0% | -8,620 | -319.0% |
| Rolling Std (6m) | +7,980 | +24.0% | +8,210 | +304.0% |
| Revenue: 6 Months Ago | +6,870 | +21.0% | +12,630 | +468.0% |
| Rolling Avg (3m) | -5,220 | -16.0% | -4,960 | -184.0% |
| Month of Year | +5,140 | +15.0% | +5,380 | +199.0% |
| Backlog: 1 Month | +4,820 | +14.0% | +3,410 | +126.0% |
| Backlog: 3 Months | -4,490 | -13.0% | -5,020 | -186.0% |
| YoY Growth Rate | +2,870 | +9.0% | -110 | -4.0% |
| Revenue: Prior Month | +2,640 | +8.0% | -4,980 | -184.0% |
| Rolling Avg (6m) | +1,910 | +6.0% | +2,580 | +96.0% |
| Quarter | +310 | +1.0% | +390 | +14.0% |
| = M+1 / M+2 Forecast | +487,720 | - | +516,400 | - |
Reading the SHAP output, the answer to "how did it come up with $487K?" is now precise: the model started from a $512K baseline, backlog signals pulled the forecast down by a combined $41K across three lag periods (recent backlog has been running soft), the year trend and 12-month revenue history added back $25K (the business is growing), and a below-average rolling 3-month average removed another $5K. The residual signal from seasonality (Month of Year) added $5K. The sum equals the forecast exactly.
This is the translation layer that makes AI forecasting defensible. A plant controller can read the SHAP output and say "the model is forecasting lower because my backlog position has deteriorated over the last six months - that makes sense." An executive can see that year trend is consistently adding positive contribution across all grains, confirming the business's growth trajectory is being captured. An anomaly - a feature contributing an unexpectedly large positive or negative value - surfaces immediately and triggers investigation.
Before the pipeline selects a model for any entity, it runs a forecastability analysis. The question it answers is not "which model is best?" but "should this entity be in an automated forecast at all?" Applying sophisticated statistical models to demand signals that are too sparse, too erratic, or too short to learn from does not produce better forecasts - it produces confident wrong numbers.
The pipeline classifies each entity's demand pattern using two measures: ADI (Average Demand Interval), the average number of periods between non-zero demand occurrences, and CV² (squared coefficient of variation), the variance in demand size when demand does occur. Together they define four demand classes:
The classification directly determines which models are candidates for selection. Running XGBoost on a lumpy demand entity wastes compute and produces a forecast no planning team should trust. The forecastability scorecard below shows this classification in action across a sample of entities, with completely anonymized part numbers.
| SKU | Facility | History | ADI | CV² | SBC Class | MASE | MAPE % | Forecastability | Recommended Approach |
|---|---|---|---|---|---|---|---|---|---|
| FG-1840 | Site-A | 48mo | 1.000 | 0.451 | Smooth | 0.179 | 3.5% | 91.7 | XGBoost + ETS (strong value) |
| FG-3024 | Site-A | 47mo | 1.000 | 0.207 | Smooth | 0.244 | 9.6% | 92.0 | XGBoost + ETS (strong value) |
| FG-1222 | Site-B | 48mo | 1.000 | 0.388 | Smooth | 0.266 | 64.5% | 88.9 | XGBoost + ETS (moderate value) |
| FG-0220 | Site-A | 46mo | 1.020 | 0.369 | Smooth | 0.523 | 14.0% | 86.6 | ETS (moderate value) |
| FG-2183 | Site-A | 47mo | 1.000 | 0.592 | Erratic | 0.223 | 16.2% | 86.7 | ETS with Manual Review |
| FG-3050 | Site-B | 47mo | 1.040 | 0.646 | Erratic | 0.160 | 11.8% | 86.3 | ETS with Manual Review |
| FG-4420 | Site-B | 34mo | 1.390 | 0.409 | Intermittent | 0.142 | 6.6% | 85.2 | Croston's Method |
| FG-1780 | Site-A | 48mo | 1.000 | 0.056 | Smooth | 0.704 | 17.2% | 85.1 | ETS (moderate value) |
| FG-2940 | Site-A | 44mo | 1.050 | 0.474 | Smooth | 0.555 | 31.5% | 55.0 | ETS - review required |
| FG-0381 | Site-B | 43mo | 1.867 | 0.434 | Lumpy | 0.526 | 52.0% | 28.4 | Reorder-Point Planning |
Reading the scorecard top to bottom, the pattern is clear: smooth entities with 40+ months of history and MASE well below 1.0 are strong automated forecast candidates. Erratic entities are forecasted but flagged for manual review before loading into the MRP upload. Intermittent entities route to Croston rather than a standard time-series model. Lumpy entities at the bottom of the list are excluded from the automated forecast entirely and routed to reorder-point planning - a more reliable approach for demand that is both rare and unpredictable.
The pipeline's final deliverable is not the forecast number. It is an Excel workbook that documents every decision made to produce that number - and gives the planning team the tools to validate it, challenge it, or override it. This workbook is the answer to the boardroom question made tangible.
The workbook contains five tabs that tell the complete story:
The workbook is regenerated automatically on every pipeline run. Numbers change as new actuals arrive. The SHAP attribution updates to reflect the current feature values. The FVA scorecard re-evaluates model competition with the expanded history. Nothing in the workbook is stale - it is always the current state of the model's understanding of the business.
Related reading: From ERP to MRP: Building a Forecasting Competency That Runs Itself covers the full pipeline architecture. Why Manufacturing Demand Planning Fails covers the data foundation this pipeline requires.
Questions about demand forecast explainability, model validation, and forecastability scoring.
The Marquis IQ forecasting pipeline produces SHAP attribution, FVA validation, forecastability scoring, and the Excel workbook that makes every number defensible - automatically, every period.