TL;DR for operators

Most forecasting teams already have decent numerical forecasters. Their problem is not that ARIMA, ETS, Lag-Llama, Chronos, or internal demand models suddenly forgot how Tuesdays work. The problem is that many important forecast shocks arrive as text: heat-wave notices, maintenance schedules, holiday effects, price caps, promotions, policy changes, store closures, one-off events, and all the other messy little business facts that refuse to fit politely into a clean covariate table.

The paper behind this article studies how LLMs can help with that specific problem: context-aided forecasting, where text is necessary to forecast correctly.1 Its answer is more useful than the usual “prompt it better” confetti. The authors propose four strategies around Direct Prompting:

Strategy What it is really for Operational use
FxDP Diagnosis Check whether the model understands the context and whether it can apply that understanding numerically.
CorDP Accuracy through correction Keep your existing forecaster, then ask the LLM to adjust its probabilistic forecast where the text matters.
IC-DP Accuracy through examples Give the model one solved context-forecast example so it can imitate the causal pattern.
RouteDP Efficiency Use a smaller model for routine cases and route hard cases to a larger model under a fixed budget.

The business reading is simple: do not ask one heroic prompt to be analyst, statistician, auditor, and finance controller. That is how prompts become tiny consulting firms with no liability insurance. Build a layered system instead. Use FxDP to diagnose. Use CorDP when you already trust your base forecast. Use IC-DP when contextual patterns repeat. Use RouteDP when large-model calls are too expensive to spray across every task.

The key warning is equally simple: explaining the effect of context is not the same as executing it in a forecast. The paper calls this the Execution Gap. Even strong models can say the right thing and still output the wrong numbers. In forecasting, this is not a philosophical problem. It is the difference between “the store is closed” and a demand forecast that still expects customers to materialise through the locked door.

Forecasting breaks when the important variable arrives as a paragraph

The paper uses the Context-Is-Key benchmark, a collection of context-aided forecasting tasks built to ensure that textual information is not decorative. The benchmark includes 71 manually designed tasks, 2,644 time series, and seven real-world domains including energy, economics, transport, retail, climatology, mechanics, and public safety. The point is not to reward models for generic time-series extrapolation. The point is to test whether they can use text when the text changes what the forecast should be.

That distinction matters. Many business dashboards already have numerical models that perform acceptably on ordinary weeks. The difficult weeks are different. A city has a heat wave. A sensor was down for maintenance in the historical window but will not be down in the forecast window. A holiday suppresses traffic. A constraint caps the future value. A retailer runs a promotion for only part of the prediction horizon.

The paper evaluates forecasts using RCRPS, a Region-of-Interest version of the Continuous Ranked Probability Score. Lower is better. The metric prioritises the windows where context is expected to matter and penalises constraint violations. That is an important design choice because a model can look competent on average while failing exactly where the business needed contextual intelligence. A good overall forecast that misses the closure window is like a restaurant reservation system that works beautifully except on Friday nights. Lovely in theory. Operationally adorable.

The baseline method is Direct Prompting: feed the LLM the history, the context, and the forecast timestamps, then ask it to produce forecasts. Direct Prompting is attractive because it is simple and does not require training. It is also too blunt to be the whole architecture. The paper’s contribution is to split the forecasting problem into four operational categories.

The paper’s real contribution is a toolbox, not a master prompt

A weaker article would summarise the paper as “four prompting methods improve forecasting.” Accurate enough, and not especially useful. The sharper reading is that the paper decomposes LLM forecasting into separate jobs:

Job Paper strategy What the paper directly tests Business interpretation
Understand the context FxDP Can the model explain the forecast effect, and does that correlate with better forecasts? Use explanations as diagnostic telemetry, not as decorative reasoning.
Modify a quantitative forecast CorDP Can LLMs improve an existing probabilistic forecast using text? Keep the forecasting stack; add an LLM correction layer.
Reuse recurring contextual patterns IC-DP Does one solved example improve forecast accuracy? Maintain a small library of effect-matched examples.
Spend model capacity selectively RouteDP Can a model rank task difficulty and route hard tasks upward? Treat large-model inference as a budgeted escalation path.

This is why the accepted structure for this article is category-based. The paper is not really about one method winning forever. It is about choosing the right operational lever for the failure mode in front of you.

That is also where the likely misconception sits. The easy version is: “LLMs are good at reading context, so better prompts will make them good forecasters.” The paper’s evidence says: not quite. Reading context, explaining the effect, applying the effect numerically, preserving probabilistic calibration, and controlling inference cost are different capabilities. Bundling them into one prompt is convenient. Convenience is often where system design goes to nap.

FxDP: diagnosis is cheaper than guessing why the forecast failed

FxDP, or Direct Prompting with Forecast Effect Explanation, asks the model to first explain how the textual context should affect the forecast, then produce the forecast. The important part is not that the explanation magically improves accuracy. The important part is that the explanation becomes an evaluable object.

The authors build ground-truth forecast effects for the benchmark tasks and then evaluate model explanations against them. They validate LLM-judge evaluation against human annotators. In the appendix, the LLM judges reach high agreement with human majority votes on a 60-item validation subset: GPT-5.2 and Claude Sonnet 4.5 both reach 91.7% agreement, while Gemini 2.5 Pro reaches 93.3%; the LLM majority reaches 98.3%. That makes FxDP more than “look, the model wrote a paragraph.” It becomes a reasonably structured diagnostic protocol.

The main diagnostic finding is the Execution Gap. Models often know what the context means but fail to turn that knowledge into a sufficiently improved forecast. In Figure 3, larger models generally produce more accurate forecast-effect explanations. A threshold appears around the 14B scale, where models move from mostly failing to explain effects to often explaining them correctly. The best frontier models reach about 75% full success: accurate explanation plus improved forecast. But the grey middle segment remains: accurate explanation, insufficient forecast improvement.

This is the section operators should not skim. In business forecasting, a correct natural-language explanation can feel reassuring. “The heat wave will increase electricity demand for the affected hours.” Wonderful. Now show me the distribution. If the numerical forecast does not actually lift the affected hours, the model has merely narrated the forecast it failed to make.

The appendix makes the diagnosis more credible. The authors test different improvement thresholds from 30% to 80%. Absolute success changes, as expected, but the model ranking remains broadly stable and the Execution Gap persists. That makes the gap less likely to be an artefact of a single arbitrary threshold. The paper also examines failure types: models may omit exact quantitative details, mishandle maintenance periods in historical data, or make basic time-window and percentage mistakes. Smaller models need better context reasoning; larger models need better translation from reasoning to numerical execution.

The business lesson is uncomfortable but useful: explanation quality is a leading diagnostic, not a substitute for forecast evaluation. FxDP is best used for vendor testing, model monitoring, and post-mortem analysis. It tells you whether the model failed because it misunderstood the context or because it understood the context and still could not do the numbers. Those are different remediation paths.

CorDP: let the LLM correct the forecast instead of cosplaying the forecaster

CorDP is the most immediately enterprise-shaped idea in the paper. Instead of asking the LLM to forecast from scratch, the method gives it a probabilistic forecast from a quantitative model and asks it to correct the parts affected by textual context.

This is a better fit for how forecasting systems are actually deployed. Companies do not throw away a tuned forecasting stack every time a language model learns a new trick. They have baselines, backtests, monitoring, model owners, and occasionally someone in finance who still remembers last quarter’s “temporary spreadsheet workaround.” CorDP respects that machinery.

The paper tests two variants:

CorDP variant Mechanism Better fit
Median-CorDP The LLM repeatedly corrects the median forecast to generate a context-informed distribution. Full-horizon shape changes and hard constraints.
SampleWise-CorDP The LLM corrects each sampled forecast path from the base distribution. Partial-window effects where only some timestamps should move.

The aggregate result is strong: CorDP achieves the best performance for 13 of 17 evaluated LLMs, with improvements of up to 50% over Direct Prompting. Most winning CorDP configurations use Lag-Llama as the base forecaster, which is not a trivial detail. The correction layer is only as useful as the forecast it is correcting. If the baseline is poor, the LLM starts from a bad map and then proudly annotates the wrong city.

The paper also reports that CorDP improves on the base quantitative forecaster for most models in the relevant comparisons. But Direct Prompting remains best for four models: Llama-3.1-405B, Qwen2.5-14B, Qwen2.5-72B, and Gemini-2.5-Pro. That exception is valuable. It prevents the result from turning into a universal recipe. Some models may already be strong direct forecasters; some may not benefit from being anchored to a base forecast.

The task-type analysis is especially useful for deployment. SampleWise-CorDP tends to do well on partial Region-of-Interest tasks, where the context affects a subset of the prediction window. Median-CorDP is stronger on tasks where the whole forecast shape changes and on constraint-heavy tasks; in some constraint cases, large models achieve zero or near-zero constraint violation. The appendix examples support this interpretation rather than introducing a second thesis: they show where each correction style has practical shape.

For business users, CorDP is the first pilot I would run. It is not the most glamorous method. That is part of its charm. It keeps the existing forecast, adds a context-aware adjustment layer, and can be evaluated against the same proper scoring rules the team already uses. The relevant question is not “Can an LLM replace our forecasting system?” The better question is “Can an LLM reduce error in the windows where text changes the answer?” A less theatrical question, yes. Also the one that might survive procurement.

IC-DP: one good example can beat another round of prompt incense

IC-DP adds one solved example to the prompt: history, context, forecast timestamps, and ground truth. The example comes from the same task type but uses different data and context. In plain terms, the model sees a worked example of how a contextual effect should translate into a forecast.

The result is substantial. IC-DP improves performance for 14 of 16 tested models. Small models gain 14–56%, while mid-size and large models show 20–40% gains. Llama-3.1-405B improves by 25%. GPT-4o improves by 48%. The biggest frontier models show a more mixed pattern: GPT-5.2 and Gemini-2.5-Pro improve modestly, while Claude Sonnet 4.5 is already near optimal under Direct Prompting and shows no aggregate improvement.

That pattern is exactly what we would expect if IC-DP is acting less like generic prompt decoration and more like a local behavioural template. It helps most when the model needs to learn the mapping from context to forecast effect. It helps less when the model already performs that mapping well.

The appendix table reinforces the task-specific nature of the gain. IC-DP provides outsized improvements on full-Region-of-Interest tasks and improves constraint handling for several smaller models. Its advantage is not evenly spread over every forecasting condition. That matters because enterprise teams should not build example libraries by keyword similarity alone. The useful similarity is causal or operational: same type of effect, same kind of forecast adjustment, similar placement in the forecast horizon.

A holiday traffic drop and a one-day store closure may have more in common for forecasting than two texts that both mention “summer.” Example retrieval should be effect-aware. A compact internal library might label examples as “temporary uplift,” “temporary shutdown,” “historical sensor artefact removed in future,” “hard cap,” “weekday holiday suppression,” or “promotion affects only evening hours.” That is not fancy. It is the kind of boring taxonomy that makes systems work.

Cost matters here too. The paper reports that IC-DP is only about 1.1–1.5× more expensive in USD than Direct Prompting for the main API models, with similar runtime. That is still extra cost, but it is not the same as fine-tuning, model retraining, or calling the largest model for every forecast. IC-DP is a memory substitute: not permanent model memory, but operational memory embedded in the prompt.

RouteDP: “use the big model” becomes a budget dial

RouteDP addresses the issue every production team meets after the demo: cost. Larger models perform better on average, but using them for every forecast is wasteful when many tasks are easy. RouteDP asks a router model to score task difficulty from 0 to 1 using the history and context. The hardest $k$ tasks go to a large model; the rest stay with the small model.

In the main setup, the large model is Llama-3.1-405B, while Qwen models serve as small models and routers. The paper compares learned routing against random routing and an ideal upper bound. With Qwen2.5-0.5B as both the small model and its own router, routing only 20% of tasks to the large model improves average RCRPS from 0.592 to 0.316: a 46.6% improvement over the small-model baseline. The router also captures 66% of the area between random and ideal routing.

That result is operationally important because it turns model scale into a controllable spend curve. You no longer need to choose between “cheap but weak” and “strong but expensive” as a binary decision. You can route 20%, 40%, 60%, or whatever your service-level agreement and budget tolerate.

There are two subtleties worth preserving. First, each model tends to be its own best router, at least in the tested setup. Qwen2.5-0.5B is especially effective at routing for itself. That suggests routing quality may depend on the model’s own uncertainty or task perception rather than a universal difficulty scale. Second, routing helps the smallest models most because they have the largest performance gap to close. The paper reports that routing 20% of tasks improves Qwen2.5-0.5B by 46.6%, while Qwen2.5-14B improves by 16.6% under comparable best-router conditions.

The appendix tests RouteDP beyond the main Direct Prompting setup. Route scores can also be used with IC-DP and CorDP downstream, and the paper observes similar improvements across several models. This is best read as an exploratory integration result: promising, practically relevant, but not a fully optimised production router. Still, the implication is strong. Difficulty scoring can sit above the forecasting method. It can decide which cases deserve expensive treatment, regardless of whether the downstream forecast is produced by DP, IC-DP, or CorDP.

The evidence map: what supports what, and what it does not prove

The paper is large, with many appendix tables and figures. Not every table carries the same inferential weight. For operators, the useful reading is:

Evidence Likely purpose What it supports What it does not prove
Figure 3, FxDP categories Main diagnostic evidence Models can explain context yet fail to apply it; the Execution Gap is real in this benchmark. That explanations are sufficient for reliable forecasting.
Human/LLM judge validation Evaluation validation Explanation scoring is reasonably reliable for these tasks. That LLM judges are safe for all domains or ambiguous business contexts.
Threshold analysis from 30–80% Robustness/sensitivity test The Execution Gap is not just a 50% threshold artefact. That the exact gap size transfers to production.
Table 1, CorDP aggregate RCRPS Main accuracy evidence Forecast correction often beats Direct Prompting and can reach large gains. That every base forecaster or model benefits.
CorDP task-group tables Subgroup analysis Median and SampleWise variants suit different task shapes. That the paper has found a universal variant-selection rule.
Figure 4 and Table 11, IC-DP Main accuracy evidence One effect-matched example improves most tested models. That arbitrary examples or keyword-matched examples will help.
Table 2 and Figure 5, RouteDP Main efficiency evidence Routing hard tasks to a large model can capture much of the attainable gain. That zero-shot routing is production-optimal.
IC-CorDP and Route-with-other-methods Exploratory integration The strategies can combine. That all combinations have been exhaustively tuned.
Cost and latency appendix Implementation detail IC-DP and CorDP can be near DP-like in cost/time under reported settings; routing creates a compute dial. That costs remain stable under different vendors, token prices, or longer business context.

This map is important because papers often tempt readers into treating every appendix result as another claim. Here, the appendices mostly clarify robustness, task-type behaviour, implementation cost, and method combination. They do not eliminate the need to validate on your own forecast portfolio.

How to translate the paper into an enterprise forecasting stack

The practical architecture implied by the paper is modular:

  1. Keep the numerical forecaster. Use the model your team already trusts as the baseline. If it is bad, fix that first. No LLM correction layer wants to inherit a haunted spreadsheet.

  2. Add context parsing and task labelling. Identify whether the context implies a partial-window event, full-horizon shape change, hard constraint, historical artefact, recurring seasonal event, or irrelevant note.

  3. Run FxDP for diagnostic samples. Do not run expensive explanations on every task unless auditability requires it. Use FxDP on model evaluations, post-mortems, vendor comparisons, and anomalous forecast windows.

  4. Pilot CorDP where the base forecast is strong. Test Median-CorDP and SampleWise-CorDP separately. Choose by task class, not by vibes. Track RCRPS or an equivalent proper score on the context-sensitive windows.

  5. Build an effect-matched example library. For IC-DP, examples should match the forecast effect, not just the vocabulary. One good example can be worth more than a longer instruction paragraph. Shocking development: examples teach.

  6. Introduce RouteDP as a budget controller. Score task difficulty and route only the hardest cases to the larger model. Sweep routing budgets: 10%, 20%, 40%, and so on. Choose the smallest escalation rate that meets accuracy and latency targets.

  7. Monitor calibration, not only point error. The paper uses probabilistic forecasting. If a method improves median accuracy while destroying interval calibration, the dashboard may look cleaner while risk management gets worse. This is how organisations discover “accuracy” was not the only thing they cared about.

The strongest near-term business case is not “LLMs as autonomous forecasters.” It is “LLMs as context-aware correction and allocation layers around existing forecasting systems.” That is less cinematic. It is also more deployable.

Where this evidence stops

The paper is deliberately focused on CiK-style tasks where textual context is essential, relevant, and comparatively well-structured. Real enterprise context is often messier. It may be irrelevant, duplicated, stale, contradictory, political, legally sensitive, or too long for a prompt to handle cleanly. The authors explicitly note that more unconstrained settings, including irrelevant or excessively long context, require further benchmark development.

The benchmark tasks also make the context effect fairly crisp: heat waves, holidays, constraints, maintenance periods, caps, scenario windows. That is useful for controlled evaluation. It is not the whole enterprise world. In real demand planning, context may interact with pricing, competitor actions, channel inventory, weather, campaign spend, and human overrides. The forecast effect may not be known cleanly enough to judge explanation correctness with the same confidence.

The model set is broad, covering Qwen, Llama, GPT, Gemini, and Claude families, but the results are still conditional on particular model versions, prompts, decoding settings, 25 forecast samples, and the CiK evaluation setup. The relative ordering of strategies is more portable than the exact numbers. Treat the 46.6% routing improvement, the 25–50% accuracy range, and the 75% FxDP full-success figures as evidence of mechanism, not guaranteed procurement math.

Cost claims also need local validation. The paper reports cost and runtime under its experimental infrastructure. Your inference pricing, hosted model throughput, context length, privacy constraints, and audit requirements may move the economics. Especially if your “context” is a 40-page PDF that someone renamed final_final_reallyfinal.pdf.

Conclusion: stop asking one prompt to do four jobs

The paper’s most useful contribution is architectural discipline. It shows that context-aided forecasting with LLMs is not one problem. It is at least four:

  • Can the model understand what the text implies?
  • Can it apply that implication numerically?
  • Can examples teach recurring context patterns cheaply?
  • Can expensive models be reserved for cases that justify them?

FxDP, CorDP, IC-DP, and RouteDP answer those questions separately. That separation is the point. A direct prompt is a convenient baseline, but production forecasting needs more than convenience. It needs diagnostics, correction, memory, and routing.

For business readers, the recommendation is not to replace forecasting systems with LLMs. The recommendation is to surround existing forecasters with context-aware modules. Diagnose with FxDP. Correct with CorDP. Reuse patterns with IC-DP. Spend selectively with RouteDP. The future of LLM forecasting may be less about asking the model for the answer and more about deciding which role the model is allowed to play.

That is not as magical as “the model forecasts everything.” Good. Magic is a poor operating model.

Notes

Cognaptus: Automate the Present, Incubate the Future.


  1. Arjun Ashok, Andrew Robert Williams, Vincent Zhihao Zheng, Irina Rish, Nicolas Chapados, Étienne Marcotte, Valentina Zantedeschi, and Alexandre Drouin, “Beyond Naïve Prompting: Strategies for Improved Context-aided Forecasting with LLMs,” arXiv:2508.09904, https://arxiv.org/abs/2508.09904↩︎