Beyond Accuracy: When Forecasts Meet Cash Flow

Inventory is the moment when a forecast stops being a spreadsheet exercise and starts costing money.

A demand model can look elegant in validation. It can shave RMSE by a few decimals, win a leaderboard, and make the data science team briefly feel like civilization has advanced. Then the warehouse over-orders slow-moving stock, the store misses fast-moving items, and the finance team discovers that “better accuracy” is not the same thing as better cash flow.

That is the useful irritation behind Beyond Accuracy: Evaluating Forecasting Models by Multi-Echelon Inventory Cost, an arXiv paper that evaluates forecasting models by passing their predictions through an inventory simulator rather than stopping at forecast error.¹

The paper’s core move is simple: compare forecasting models not only by RMSE and MAE, but also by what happens after the forecast becomes an order quantity. Over-forecasting creates holding cost. Under-forecasting creates shortage cost. In a distribution-center-to-store network, errors can also propagate upstream and downstream. In other words, the model is not being judged as a prediction machine alone. It is being judged as part of a decision pipeline.

That matters because the common business misconception is not that accuracy is useless. Accuracy is useful. The misconception is more specific: the model with the best forecast error is assumed to be the model with the best operating economics. This paper gives a clean counterexample. LSTM achieves the best RMSE in the reported single-echelon test, while Temporal CNN produces the lowest average inventory cost and the highest fill rate.

That small ranking reversal is the whole article.

The real object is the decision pipeline, not the forecast

The paper studies seven forecasting models on a controlled subset of the M5 Walmart dataset: California, FOODS_1. The model list is deliberately broad rather than exotic:

Category	Models tested	Role in the comparison
Simple baseline	Naive lag-1	Tests whether complexity beats yesterday-as-today
Classical time series	Holt-Winters, ARIMA(1,1,1)	Represents traditional forecasting practice
Machine learning	Gradient Boosting Regressor, XGBoost	Tests tabular nonlinear models with engineered features
Deep learning	LSTM, Temporal CNN	Tests global sequence models that learn temporal structure

The dataset is reshaped into daily demand series and enriched with ordinary retail features: lags, rolling means, calendar indicators, event indicators, and SNAP variables. The final 28 days are held out for testing, the previous 28 days are used for validation, and training uses the earlier history. Classical models are fitted per series; machine-learning and deep-learning models are trained globally across the selected panel.

So far, this is a fairly standard forecasting experiment. Useful, but not yet special.

The important step comes next. Each point forecast is converted directly into an order quantity. If demand is $D_t$ and the model forecasts $\hat{D}_t$, the simulator uses the forecast as the order:

$$ q_t = \hat{D}_t $$

The newsvendor cost then penalizes two different mistakes:

$$ C_t = h \max(q_t - D_t, 0) + b \max(D_t - q_t, 0) $$

where $h$ is the overage or holding cost and $b$ is the underage or shortage cost.

This is the conversion that many AI demos politely avoid. Once the forecast becomes $q_t$, error is no longer abstract. Positive and negative errors have different business meanings. A model that makes slightly larger errors in the wrong direction can be more expensive than a model with a better average error score. A model with similar RMSE can produce a different fill rate. The inventory system is a translation layer, and it translates statistical mistakes into operational consequences.

Accuracy still matters, but it is not the final ranking

The headline result is that modern learning models outperform classical baselines on both forecasting and inventory metrics. That part is not shocking. Retail demand is nonlinear, intermittent, calendar-sensitive, and generally less polite than textbook time series. Naive, Holt-Winters, and ARIMA are not designed to absorb all of that structure.

The more interesting part is the separation between forecast accuracy and inventory performance.

Model	RMSE	MAE	Avg. cost/day	Fill rate	Cost reduction vs Naive	Fill gain vs Naive
Naive lag-1	2.909	1.505	4.521	0.534	0.0%	0.0 pp
Holt-Winters ES	2.677	1.487	4.182	0.583	7.5%	4.9 pp
ARIMA(1,1,1)	2.636	1.486	4.258	0.572	5.8%	3.8 pp
GBR	2.296	1.293	3.854	0.605	14.8%	7.1 pp
XGBoost	2.294	1.289	3.839	0.606	15.1%	7.2 pp
LSTM	2.207	1.243	3.704	0.620	18.1%	8.6 pp
Temporal CNN	2.260	1.293	3.674	0.632	18.7%	9.8 pp

LSTM is the strongest model by RMSE and MAE. Temporal CNN is not far behind on RMSE, but it wins on average daily cost and fill rate. The difference is not huge, but it is conceptually important. Forecasting accuracy and operational value are correlated, not identical.

This is where the paper becomes more useful than a normal model leaderboard. A standard summary would say: “deep learning works best.” True, but blunt. The better reading is: “the model that best minimizes symmetric statistical error is not automatically the model that best manages asymmetric operating cost.”

That distinction matters in real supply chains because the two sides of forecast error rarely have equal consequences. In some categories, a stockout damages revenue, customer retention, and shelf availability. In others, overstock creates markdowns, spoilage, storage cost, or working-capital drag. RMSE does not know which side hurts more. The inventory simulator does.

The sensitivity test checks whether the ranking survives different shortage costs

The paper then varies the shortage penalty $b$ while keeping the holding cost fixed at $h = 1$. This is best read as a sensitivity test, not a second thesis. Its purpose is to ask whether the model ranking depends too heavily on one assumed shortage-cost setting.

Model	Avg. cost when $b=2$	Avg. cost when $b=5$	Avg. cost when $b=10$
Naive lag-1	2.259	4.521	8.291
Holt-Winters ES	2.157	4.182	7.558
ARIMA(1,1,1)	2.179	4.258	7.722
GBR	1.933	3.854	7.054
XGBoost	1.926	3.839	7.028
LSTM	1.858	3.704	6.781
Temporal CNN	1.888	3.674	6.652

The ranking is broadly stable. Deep learning remains strongest, boosted trees remain competitive, and classical models remain behind. LSTM has the lowest average cost at $b=2$, while Temporal CNN leads at $b=5$ and $b=10$.

The interpretation should be precise. This test supports the claim that the main result is not fragile under the paper’s tested shortage-penalty values. It does not prove that Temporal CNN is universally superior for every retailer, SKU class, cost ratio, replenishment lead time, or service-level target. Please do not take a three-column sensitivity table and turn it into a procurement policy. That would be efficient, but only in the way a toaster is efficient at legal reasoning.

One technical detail is especially important: under the paper’s point-forecast-to-order rule, changing $b$ affects the cost calculation, but not the order quantity or fill rate for a fixed forecast. In a classical newsvendor decision with a full demand distribution, a higher shortage penalty would normally push the optimal order quantity upward through a critical-ratio rule. Here, the simulator uses point forecasts directly as orders. That makes the experiment clean for comparing forecasting models, but it also limits how far the result can be generalized to fully optimized inventory control.

The two-echelon extension is the right business direction, but the numeric evidence is thinner

The paper extends the framework from a single-echelon setting to a two-echelon system: one distribution center supplies multiple stores. Store demand is aggregated at the DC level, the DC places orders based on aggregate demand, and fulfillment can be allocated proportionally when inventory is insufficient.

Conceptually, this is the right move. A retailer does not run one isolated newsvendor problem in a vacuum. Store-level decisions interact with DC replenishment, upstream aggregation, and allocation rules. Forecast error can therefore migrate through the network. A store-level miss is local; a DC-level miss can affect many outlets.

The business implication is also clear: the value of forecasting depends on where the forecast enters the system. A model used only for store replenishment may improve shelf availability locally. A model used at the distribution center may affect network-wide allocation, upstream purchase orders, and the timing of replenishment across stores.

However, the article should not overstate the evidence here. In the accessible arXiv HTML, the detailed numeric table is concentrated on single-echelon performance and shortage-penalty sensitivity. The two-echelon framework is described methodologically and discussed as an extension, but the strongest reported numerical evidence is still the single-echelon comparison.

That does not make the two-echelon contribution irrelevant. It means the contribution is partly architectural: the paper shows how to propagate forecasts into a network-level simulator, which is closer to how supply chains actually operate. For business readers, that architectural step may be more valuable than another fractional improvement in RMSE.

What each experiment is actually doing

A useful way to read the paper is to separate the purpose of each test. Not every table is trying to prove the same thing.

Paper component	Likely purpose	What it supports	What it does not prove
Seven-model forecast comparison	Main evidence	Learning-based models improve forecast and inventory KPIs on the CA_FOODS_1 test split	Universal model dominance across all retail categories
Single-echelon newsvendor simulation	Main evidence	Forecasts can be evaluated by cost and fill rate, not only RMSE/MAE	Fully optimized replenishment under lead times, safety stock, or service constraints
Shortage-penalty variation	Robustness/sensitivity test	Deep-model advantages remain broadly stable as shortage cost changes	Stability under all cost structures or stochastic ordering policies
Two-echelon DC-store formulation	Exploratory/architectural extension	Forecast evaluation can be propagated into a network setting	A complete empirical proof of multi-echelon superiority with detailed network KPIs
Discussion of future probabilistic forecasts	Boundary and next-step framing	Point forecasts are not the end state for inventory-aware forecasting	Current results already solve service-level optimization

This separation is not academic bookkeeping. It prevents a common reading error: treating every part of a paper as equally proven. The main empirical evidence says the forecasting-to-inventory evaluation is useful and that deep models perform strongly on this controlled retail subset. The two-echelon part says the same logic can be extended into a more realistic network formulation. Those are related claims, but they are not identical claims.

The business lesson is not “use deep learning”; it is “evaluate the whole chain”

The easiest executive takeaway would be: use Temporal CNN or LSTM for demand forecasting. That is not wrong, but it is too small.

The larger lesson is that forecasting should be evaluated as a component inside an operating system:

Data → Forecast → Order Quantity → Inventory Outcome → Cash Flow

Most analytics teams optimize the second step. Operations teams live with the last three.

This paper’s practical value is that it makes the handoff explicit. A forecast is not a finished product. It is an input into a replenishment rule. That rule creates inventory. Inventory creates cost, service levels, and working-capital consequences.

For a business building forecasting infrastructure, the evaluation dashboard should therefore include at least three layers:

Evaluation layer	Example metric	What it tells management
Forecast layer	RMSE, MAE	Whether the model predicts demand accurately
Decision layer	Order quantity, stockout/overstock frequency	How forecast errors translate into actions
Financial/service layer	Holding cost, shortage cost, fill rate	Whether the system improves business outcomes

This is also where the paper quietly challenges how many AI projects are sold internally. A model can be technically better and still operationally irrelevant if no one maps its outputs into decisions. Conversely, a modest accuracy improvement can be valuable if it reduces the expensive errors: the stockouts that matter, the overstock that ties up cash, the DC-level miss that cascades into store allocation problems.

The ROI pathway is not “AI improves forecasting.” That sentence has been used so often it should probably be composted.

The better pathway is: better temporal representations improve demand estimates; better demand estimates change order quantities; better order quantities reduce shortage and holding costs; lower cost and higher fill rate justify investment in the forecasting pipeline. Each arrow in that chain needs to be tested. This paper tests several of them in one framework.

Where the result applies, and where it should not be stretched

The paper is useful because it is concrete. It is also limited because it is concrete.

The strongest interpretation applies to retailers with daily SKU-store demand patterns that resemble the selected M5 CA_FOODS_1 subset, using short-horizon forecasts and a simple point-forecast replenishment rule. In that setting, the paper gives a credible demonstration that model choice can affect inventory cost and service metrics, not merely forecast error.

The boundaries are equally important:

The experiment uses one department/state subset, not the full diversity of retail categories.
The reported test horizon is 28 days, which is useful for controlled comparison but short for seasonal, promotional, and regime-change analysis.
The ordering rule maps point forecasts directly to order quantities, rather than optimizing order quantities from predictive distributions.
The setup does not fully model price elasticity, promotion lift, substitution, perishability, lead-time uncertainty, supplier constraints, or capacity limits.
The two-echelon framing is directionally important, but the most detailed reported numerical evidence remains the single-echelon table and shortage-cost sensitivity table.

These are not fatal flaws. They define the correct use of the paper. The study should be treated as a decision-evaluation template, not as a final answer to retail inventory optimization.

In fact, the next step is fairly obvious: replace point forecasts with probabilistic or quantile forecasts and choose order quantities according to service targets and cost ratios. That would align the forecasting model more directly with the newsvendor decision. The current paper already points in that direction. It just does not fully travel there.

Forecasting finally has to pay rent

The most useful contribution of this paper is not that Temporal CNN beats LSTM on one inventory metric, or that deep learning beats ARIMA on a food-sales subset. Those results are interesting, but they are not the durable idea.

The durable idea is evaluation discipline.

A forecast should not be celebrated merely because it reduces error. It should be examined through the decisions it triggers. Does it reduce average inventory cost? Does it improve fill rate? Does it remain useful when shortage costs change? Does it behave differently at the store level and the DC level? Does it improve the system, or only the metric?

That last question is where many AI projects become uncomfortable. Good. They should.

Forecast accuracy is a proxy. Cash flow is not. The paper’s strength is that it forces the two to meet in the same room, under the supervision of an inventory simulator. That is a small methodological step, but a large managerial correction.

Cognaptus: Automate the Present, Incubate the Future.

Swata Marik, Swayamjit Saha, and Garga Chatterjee, “Beyond Accuracy: Evaluating Forecasting Models by Multi-Echelon Inventory Cost,” arXiv:2603.16815, 2026. https://arxiv.org/html/2603.16815 ↩︎

The real object is the decision pipeline, not the forecast#

Accuracy still matters, but it is not the final ranking#

The sensitivity test checks whether the ranking survives different shortage costs#

The two-echelon extension is the right business direction, but the numeric evidence is thinner#

What each experiment is actually doing#

The business lesson is not “use deep learning”; it is “evaluate the whole chain”#

Where the result applies, and where it should not be stretched#

Forecasting finally has to pay rent#