Demystifying AI Forecasting: Black Box to Boardroom

The question nobody prepares for

You ran the Python pipeline. The models ran. The FVA numbers look good. The 12-month forward forecast is sitting in the Excel workbook ready to share. You walk into the executive meeting, present the numbers - and then it happens.

"This says $1.5M for May. How did it come up with that? What went into it?"

If your answer is "the algorithm analyzed the historical data," you have already lost the room. Not because the forecast is wrong - it may be excellent - but because an unexplained forecast is an untrustworthy forecast. The finance team will not load it into the budget. The procurement manager will not size the purchase orders against it. The operating partner will not present it to the board.

Producing a forecast is the easy part. Explaining it - in dollar terms, by feature, at every horizon, for every grain - is where most forecasting initiatives either earn credibility or lose it permanently.

This article shares some of the tools we use to answer that question - the feature matrix that shows what the model was actually trained on, the FVA framework for selecting the right model per entity, SHAP values that convert model output into dollar-level attribution per feature, and the forecastability scorecard that determines which entities belong in an automated forecast at all. None of these are exotic. Together they make "how did it come up with that number" a question with a real answer.

A note on methodology

Demand forecasting is a deep field with many valid methodologies. The approaches here reflect what we have seen produce the best practical results in PE-owned manufacturing environments - not a claim that these are the only right answers. There are practitioners with far broader expertise in the discipline. One we have worked with directly and recommend without reservation is Nicolas Vandeput, whose work on demand forecasting and inventory optimization is worth reading by anyone building serious forecasting capability.

What the model actually sees: the training feature matrix

A machine learning demand forecast does not see the business. It sees a table. Each row is one month of data for one grain - a specific combination of facility, end market, and product category. Each column is a feature: a number the model can use to learn the relationship between past conditions and future revenue.

The feature matrix for a typical manufacturing entity contains 30 or more columns, organized into five groups:

Revenue history (y actuals and lag features) - the target variable y plus lag_1 through lag_12: what revenue was 1 month ago, 2 months ago, and so on back through the training window
Rolling statistics - 3-month and 6-month rolling mean and standard deviation, capturing recent trend and volatility
Calendar features - month of year, quarter, year trend, and year-over-year growth rate, giving the model awareness of seasonality and direction
Backlog lag features - open order balance 1, 2, 3, and 6 months prior, the most direct leading indicator of near-term revenue
Quote activity features - aggregate new-quote value lagged 1, 2, 3, 6, and 12 months, plus a normalized quote index (current month quotes divided by trailing 12-month average) so the model reads whether the pipeline is running hot or cold relative to its own history

The final row of the training matrix is the prediction row. All feature columns are known - they are prior months - but the target variable y (revenue) is null. That null is what the model must estimate. The table below illustrates the structure for a fictional mid-size industrial entity.

Training Data Shape - Southeast Fabrication · Industrial · Aerospace (Illustrative)

Each row is one month of data for this grain. The model learns from all rows. The last row shows what the feature matrix looks like at prediction time - revenue is unknown and must be estimated. Scroll right to see all columns.

Revenue history - actuals & lags of y Calendar - rolling stats & YoY Backlog - open order lags Quote lags - dollar amounts & index Prediction target

Month	y (Actuals)	lag_1	lag_2	lag_3	lag_6	roll_mean_3	yoy_growth	backlog_lag1	backlog_lag2	backlog_lag3	quote_lag1	quote_lag3	quote_index
Oct 2024	4,210,000	3,980,000	4,050,000	3,860,000	3,740,000	4,013,333	+6.2%	1,640,000	1,580,000	1,490,000	318,000	294,000	1.04
Nov 2024	4,480,000	4,210,000	3,980,000	4,050,000	3,820,000	4,146,667	+8.4%	1,720,000	1,640,000	1,580,000	341,000	318,000	1.11
Dec 2024	5,140,000	4,480,000	4,210,000	3,980,000	3,910,000	4,610,000	+10.1%	1,890,000	1,720,000	1,640,000	362,000	341,000	1.18
Jan 2025	3,620,000	5,140,000	4,480,000	4,210,000	4,050,000	4,413,333	+5.3%	1,380,000	1,890,000	1,720,000	271,000	362,000	0.88
Feb 2025	3,830,000	3,620,000	5,140,000	4,480,000	4,120,000	4,196,667	+4.8%	1,440,000	1,380,000	1,890,000	288,000	271,000	0.93
Mar 2025	4,060,000	3,830,000	3,620,000	5,140,000	4,210,000	3,836,667	+5.6%	1,520,000	1,440,000	1,380,000	304,000	288,000	0.98
Apr 2025 ►	?	4,060,000	3,830,000	3,620,000	4,480,000	3,970,000	+5.2%	1,610,000	1,520,000	1,440,000	322,000	304,000	1.02

Apr 2025 is the prediction row. All 13 feature columns shown here are fully known from prior months. The model combines them to estimate y. Revenue is null until the month closes. The full feature matrix includes lag_12, rolling_mean_6, rolling_std_3/6, quarter, and year trend - 30+ columns in total. quote_index = current month quotes divided by trailing 12-month average (1.0 = on pace, above 1.0 = pipeline running ahead of history).

This is exactly what the gradient boosted models (XGBoost, LightGBM, CatBoost) receive. They do not know what month it is in any human sense - they see numbers in columns. The model learns that when backlog_lag1 is elevated and quote_index is above 1.0, revenue the following month tends to be above average. That pattern, learned from hundreds of rows across the training history, is what produces the forecast.

Model competition, walk-forward validation, and FVA

The Marquis IQ pipeline does not pick a model before seeing the data. It runs all candidate models against each grain independently and lets accuracy determine the winner. The process is walk-forward (expanding window) validation - a rigorous time-series-specific alternative to random cross-validation that respects temporal ordering and prevents data leakage.

Walk-forward validation

In walk-forward validation, the model is trained on the first N months and tested on months N+1 and N+2. Then trained on months 1 through N+1 and tested on N+2 and N+3. The window expands with each step, and accuracy is measured at 1-month and 6-month lead times across all out-of-sample test windows. This mimics the real forecasting environment: the model always trains on the past and predicts the future, never the reverse.

Forecast Value Added (FVA)

The primary model selection criterion is Forecast Value Added: FVA = MAE(Naive) - MAE(Model). A positive FVA means the model outperforms the naive seasonal baseline. A negative FVA means the model is worse than simply repeating last year's same month. FVA is calculated for every model on every grain. The model with the highest positive FVA is selected. If no model beats naive, the naive baseline is used as the forecast - adding complexity for no accuracy gain is not a trade-off worth making.

Why naive is the right floor: the seasonal naive forecast is free, instant, completely transparent, and surprisingly difficult to beat on stable, seasonal businesses. It is not a strawman - it is a genuine benchmark. Any model that earns a negative FVA relative to naive is a liability, not an asset.

Accuracy metrics reported per entity

MAPE

Mean Absolute Percentage Error

Average absolute error as % of actual. Interpretable but sensitive to near-zero months. Primary metric for executive reporting.

MAE

Mean Absolute Error

Average absolute dollar error. Used for FVA calculation. Scale-dependent - not comparable across entities of different size.

MASE

Mean Absolute Scaled Error

MAE scaled by the naive baseline error. MASE below 1.0 means the model beats naive. Comparable across entities and horizons.

RMSE

Root Mean Square Error

Penalizes large errors more heavily than MAE. Useful for identifying models that occasionally miss badly vs. consistently miss modestly.

Bias

Mean Error (signed)

Average signed error. Persistent positive bias = model systematically over-forecasts. Persistent negative bias = systematically under-forecasts.

FVA

Forecast Value Added

MAE(Naive) minus MAE(Model). The primary model selection criterion. Positive = model earns its place. Negative = use naive instead.

SHAP: the dollar-level attribution that answers the boardroom question

SHAP (SHapley Additive exPlanations) is a game-theory-based method for attributing a model's prediction to individual input features in a way that is both mathematically rigorous and practically interpretable. For gradient boosted tree models (XGBoost, LightGBM, CatBoost), SHAP values are computed efficiently using TreeSHAP, an algorithm that leverages the tree structure to calculate exact attribution values without approximation.

The question SHAP answers is exactly the boardroom question: of the $487K the model forecasted for next month, how much came from recent revenue history, how much from the backlog position, and how much from quoting activity? Each feature gets a dollar-denominated contribution - positive if it pushed the forecast up, negative if it pulled it down. The baseline (the model's average prediction across all training data) plus the sum of all SHAP values equals the final forecast exactly.

SHAP Feature Attribution - Midwest Industrial · Distribution · Commercial (LightGBM, Illustrative)

Dollar contribution of each feature to the M+1 and M+2 forward forecasts. Positive = feature pushed forecast up. Negative = feature pulled forecast down. Baseline + sum of all contributions = final forecast.

Feature	M+1 ($)	M+1 %	M+2 ($)	M+2 %
Model Baseline (average prediction)	+512,400	-	+512,400	-
Revenue: 3 Months Ago	-31,840	-95.0%	-2,910	-108.0%
Backlog: 6 Months	-28,760	-86.0%	-29,440	-1092.0%
Year Trend	+14,210	+42.0%	+14,580	+541.0%
Revenue: 12 Months Ago	+11,320	+34.0%	+12,860	+477.0%
Backlog: 2 Months	-8,440	-25.0%	-8,620	-319.0%
Rolling Std (6m)	+7,980	+24.0%	+8,210	+304.0%
Revenue: 6 Months Ago	+6,870	+21.0%	+12,630	+468.0%
Rolling Avg (3m)	-5,220	-16.0%	-4,960	-184.0%
Month of Year	+5,140	+15.0%	+5,380	+199.0%
Backlog: 1 Month	+4,820	+14.0%	+3,410	+126.0%
Backlog: 3 Months	-4,490	-13.0%	-5,020	-186.0%
YoY Growth Rate	+2,870	+9.0%	-110	-4.0%
Revenue: Prior Month	+2,640	+8.0%	-4,980	-184.0%
Rolling Avg (6m)	+1,910	+6.0%	+2,580	+96.0%
Quarter	+310	+1.0%	+390	+14.0%
= M+1 / M+2 Forecast	+487,720	-	+516,400	-

Reading the SHAP output, the answer to "how did it come up with $487K?" is now precise: the model started from a $512K baseline, backlog signals pulled the forecast down by a combined $41K across three lag periods (recent backlog has been running soft), the year trend and 12-month revenue history added back $25K (the business is growing), and a below-average rolling 3-month average removed another $5K. The residual signal from seasonality (Month of Year) added $5K. The sum equals the forecast exactly.

This is the translation layer that makes AI forecasting defensible. A plant controller can read the SHAP output and say "the model is forecasting lower because my backlog position has deteriorated over the last six months - that makes sense." An executive can see that year trend is consistently adding positive contribution across all grains, confirming the business's growth trajectory is being captured. An anomaly - a feature contributing an unexpectedly large positive or negative value - surfaces immediately and triggers investigation.

Forecastability scoring: not everything belongs in an automated forecast

Before the pipeline selects a model for any entity, it runs a forecastability analysis. The question it answers is not "which model is best?" but "should this entity be in an automated forecast at all?" Applying sophisticated statistical models to demand signals that are too sparse, too erratic, or too short to learn from does not produce better forecasts - it produces confident wrong numbers.

Demand classification: the Syntetos-Boylan framework

The pipeline classifies each entity's demand pattern using two measures: ADI (Average Demand Interval), the average number of periods between non-zero demand occurrences, and CV² (squared coefficient of variation), the variance in demand size when demand does occur. Together they define four demand classes:

Smooth (low ADI, low CV²) - demand arrives regularly with consistent size. Candidate for ETS, XGBoost, LightGBM, CatBoost.
Erratic (low ADI, high CV²) - demand arrives regularly but in highly variable amounts. ETS with manual review; ML models may overfit.
Intermittent (high ADI, low CV²) - demand arrives infrequently but consistently sized when it does. Croston's method (SBA variant) is the correct approach.
Lumpy (high ADI, high CV²) - demand is both infrequent and highly variable. Automated forecasting is not reliable; reorder-point planning is more appropriate.

The classification directly determines which models are candidates for selection. Running XGBoost on a lumpy demand entity wastes compute and produces a forecast no planning team should trust. The forecastability scorecard below shows this classification in action across a sample of entities, with completely anonymized part numbers.

Forecastability Detail Scorecard - Sample Entities (Illustrative, Anonymized)

Forecastability score 0-100: above 60 = automated forecast; 35-60 = use with manual review; below 35 = reorder-point planning. SBC Class based on ADI and CV². MASE below 1.0 means the model beats naive.

SKU	Facility	History	ADI	CV²	SBC Class	MASE	MAPE %	Forecastability	Recommended Approach
FG-1840	Site-A	48mo	1.000	0.451	Smooth	0.179	3.5%	91.7	XGBoost + ETS (strong value)
FG-3024	Site-A	47mo	1.000	0.207	Smooth	0.244	9.6%	92.0	XGBoost + ETS (strong value)
FG-1222	Site-B	48mo	1.000	0.388	Smooth	0.266	64.5%	88.9	XGBoost + ETS (moderate value)
FG-0220	Site-A	46mo	1.020	0.369	Smooth	0.523	14.0%	86.6	ETS (moderate value)
FG-2183	Site-A	47mo	1.000	0.592	Erratic	0.223	16.2%	86.7	ETS with Manual Review
FG-3050	Site-B	47mo	1.040	0.646	Erratic	0.160	11.8%	86.3	ETS with Manual Review
FG-4420	Site-B	34mo	1.390	0.409	Intermittent	0.142	6.6%	85.2	Croston's Method
FG-1780	Site-A	48mo	1.000	0.056	Smooth	0.704	17.2%	85.1	ETS (moderate value)
FG-2940	Site-A	44mo	1.050	0.474	Smooth	0.555	31.5%	55.0	ETS - review required
FG-0381	Site-B	43mo	1.867	0.434	Lumpy	0.526	52.0%	28.4	Reorder-Point Planning

ADI: Average Demand Interval (1.0 = demand every period). CV²: Squared Coefficient of Variation (above 0.49 = erratic). SBC Class = Syntetos-Boylan Classification. MASE below 1.0 = model beats naive. Forecastability score combines MASE, demand regularity, and history depth. Part numbers are illustrative - not derived from any specific client dataset.

Reading the scorecard top to bottom, the pattern is clear: smooth entities with 40+ months of history and MASE well below 1.0 are strong automated forecast candidates. Erratic entities are forecasted but flagged for manual review before loading into the MRP upload. Intermittent entities route to Croston rather than a standard time-series model. Lumpy entities at the bottom of the list are excluded from the automated forecast entirely and routed to reorder-point planning - a more reliable approach for demand that is both rare and unpredictable.

The Excel output: the accountability layer

The pipeline's final deliverable is not the forecast number. It is an Excel workbook that documents every decision made to produce that number - and gives the planning team the tools to validate it, challenge it, or override it. This workbook is the answer to the boardroom question made tangible.

The workbook contains five tabs that tell the complete story:

Actuals vs. Forecast - period-by-period actual revenue alongside the model's in-sample fitted values and the out-of-sample forward forecast. The visual gap between actual and fitted is the residual - the part of demand the model did not explain. Large persistent residuals in a specific direction are a signal that something structural changed that the model has not captured.
FVA by Model and Grain - the full FVA scorecard showing every model's walk-forward accuracy at 1-month and 6-month horizons. The winning model per grain is highlighted. Grains where no model beats naive are flagged in red with a recommendation to investigate data quality or grain definition before the next run.
SHAP Attribution - the dollar-level feature contribution table for M+1 and M+2 for every grain in the run. This is the tab that answers "how did it come up with that number." Paste rows from this tab into a board deck and the forecast becomes auditable.
Forecastability Scorecard - every entity classified by demand pattern, scored, and assigned a recommended modeling approach. The planning team uses this tab to decide what goes into the MRP upload versus what gets manual planning intervention.
Forward Forecast - the actionable output: 12-month rolling forecast by entity, grain, and horizon, ready for validation and MRP upload. This is the tab the procurement manager acts on.

The workbook is regenerated automatically on every pipeline run. Numbers change as new actuals arrive. The SHAP attribution updates to reflect the current feature values. The FVA scorecard re-evaluates model competition with the expanded history. Nothing in the workbook is stale - it is always the current state of the model's understanding of the business.

Related reading: From ERP to MRP: Building a Forecasting Competency That Runs Itself covers the full pipeline architecture. Why Manufacturing Demand Planning Fails covers the data foundation this pipeline requires.

Demystifying AI Forecasting:
From Black Box to Boardroom Answer

The question nobody prepares for

What the model actually sees: the training feature matrix

Model competition, walk-forward validation, and FVA

Walk-forward validation

Forecast Value Added (FVA)

Accuracy metrics reported per entity

SHAP: the dollar-level attribution that answers the boardroom question

Forecastability scoring: not everything belongs in an automated forecast

Demand classification: the Syntetos-Boylan framework

The Excel output: the accountability layer

Common questions

A forecast you can explain is a forecast your team will use.

Demystifying AI Forecasting:From Black Box to Boardroom Answer

The question nobody prepares for

What the model actually sees: the training feature matrix

Model competition, walk-forward validation, and FVA

Walk-forward validation

Forecast Value Added (FVA)

Accuracy metrics reported per entity

SHAP: the dollar-level attribution that answers the boardroom question

Forecastability scoring: not everything belongs in an automated forecast

Demand classification: the Syntetos-Boylan framework

The Excel output: the accountability layer

Common questions

A forecast you can explain is a forecast your team will use.

Demystifying AI Forecasting:
From Black Box to Boardroom Answer