Forecasting the Forecast: Why Agentic AI Is Learning to Doubt Itself

Forecasting is where executive optimism goes to be measured.

A sales team says the pipeline is healthy. A policy team says the election risk is manageable. A trading desk says the market has mostly priced in the event. Everyone has a probability. Few people have a disciplined process for updating it.

That is also the problem with many AI forecasters. They can produce a number quickly, sometimes impressively, sometimes with the emotional stability of a quarterly sales forecast. But the harder question is not whether an AI can answer, “What is the probability?” The harder question is whether it can revise that probability as evidence arrives, remember why it changed its mind, and avoid turning a confidence score into decorative typography.

Kevin Murphy’s paper, Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs, is interesting because it treats forecasting less as a prompt-writing problem and more as a state-management problem.¹ The proposed system, Bayesian Linguistic Forecaster, or BLF, is not merely another retrieval-augmented model with a nice leaderboard number. It is a forecasting harness built around three mechanisms: a structured linguistic belief state, multi-trial aggregation in logit space with shrinkage, and hierarchical calibration.

The reported result is strong. On the paper’s main ForecastBench comparison, BLF using Gemini 3.1 Pro reaches 73.3 overall Brier Index, ahead of Cassi at 70.8, GPT-5 zero-shot at 70.2, Grok 4.20 at 70.5, Foresight-32B at 69.3, and a non-LLM crowd-plus-empirical baseline at 69.9. But the more useful story is not “AI beats benchmark.” That headline is already tired, and frankly it should be taxed.

The useful story is architectural: BLF shows that better forecasting agents may need less theatrical reasoning and more disciplined bookkeeping.

The system is built around a belief, not a transcript

Most tool-using agents follow a familiar pattern. They receive a question, search the web or call a tool, append results to the context window, search again, append again, then eventually answer. This is context accumulation. It feels reasonable until the context becomes a landfill.

BLF uses a different unit of memory. At each step, the model maintains a semi-structured belief state containing a probability estimate, confidence level, evidence for the event, evidence against it, and open questions to investigate next. The model does not simply collect facts. It updates a working judgment.

A simplified view looks like this:

Question
  → initial belief
  → choose action: search / fetch data / inspect source / submit
  → observe evidence
  → update belief state
  → repeat
  → aggregate multiple trials
  → calibrate final probability

That change sounds small. It is not.

In a normal retrieval-augmented forecast, the model’s “memory” is basically a growing pile of retrieved text. The model must rediscover what matters each time it reasons. BLF instead asks the model to carry forward a compact representation of its current view. The belief state becomes the agent’s notebook: not every document, not every search result, but the current probability and the reasons that probability should move.

This is why the paper’s use of the word “Bayesian” needs careful handling. BLF is Bayesian in style, not in the formal sense. The paper is explicit that the belief update is an LLM forward pass, not a likelihood-based posterior calculation. There is no explicit likelihood, no marginalization, and no guarantee that the update satisfies the consistency requirements of Bayesian inference.

That distinction matters. The contribution is not “LLMs now do proper Bayesian inference.” They do not suddenly become statisticians because a JSON field says belief. The contribution is more practical: a structured belief slot can force the model to expose and carry forward the shape of an update. It is a scaffold, not a theorem.

And in enterprise AI, scaffolds are often where the money is.

The agent loop makes forecasting sequential instead of ceremonial

Forecasting is path-dependent. The useful second question depends on what the first search reveals.

Suppose the task is to forecast whether a website will update a politically sensitive geographic label by a certain date. A shallow agent might search the policy change, notice a government order, and raise the probability. A better forecaster asks a narrower second question: does this website use dynamic map labels, or static images that require manual updating?

The paper’s qualitative trace illustrates this exact pattern. In one AIBQ2 example, several BLF trials forecast whether WorldAtlas.com would display “Gulf of America” before July 1, 2025. One trial initially moved toward the event after finding evidence of the executive order and related map-label changes. Later, it discovered that WorldAtlas used static map images, not merely dynamic labels. That evidence changed the relevant mechanism, so the trial cut its probability and correctly forecast “No.”

This example is not the main statistical proof. It is a mechanism demonstration. It shows the kind of thing a belief-state agent can do when it is working properly: not just add more facts, but identify the next uncertainty that actually matters.

The paper’s agent loop supports several action types: web search with date restrictions, reading saved search-result files through a summarizer, URL lookup, source-specific data tools for time-series or Wikipedia-style questions, and final submission. Tool availability is source-dependent. DBnomics questions, for example, bypass the LLM and use a custom KNN model. This is an important detail. BLF is not pretending one agent behavior is universally optimal. Sometimes the right “agent” is a boring statistical tool that does the job and does not ask for applause.

That source-aware design is one reason the paper is more operationally useful than a generic agent demo. Business forecasting systems are rarely one problem. They are a portfolio of problem types: market prices, macro indicators, product launches, regulation, supply-chain incidents, competitor actions, and messy text-only events. A single prompt policy is too blunt. BLF’s primitive meta-controller is early, but the design direction is sensible: choose tools and action policies based on the source and question type.

Multi-trial aggregation treats the model as noisy, which is polite but necessary

A single LLM forecast is not just a forecast. It is one sample from a stochastic process: one search path, one interpretation path, one final number. Anyone who has run the same agent twice knows the feeling. The model can sound equally serious while walking down two different alleys.

BLF runs multiple independent trials per question and combines them. The basic intuition is familiar: if one analyst may miss the key clue, ask several analysts. The less comforting version is that your expensive model is not a deterministic oracle. It is a noisy instrument, so measure more than once.

The paper discusses several aggregation choices. Arithmetic averaging has a clean formal justification for Brier-style scoring: averaging probabilities improves expected score relative to the average individual trial under the usual convexity argument. But the paper finds that averaging in logit space works better empirically on ForecastBench.

The logit transform is:

$$ \operatorname{logit}(p) = \log\left(\frac{p}{1-p}\right) $$

Averaging in logit space preserves more extremity when independent trials agree. It also handles failed or neutral trials more gracefully. If four trials strongly lean one way and one trial falls back to 0.5 because of a timeout, probability-space averaging pulls the result toward the center more mechanically. Logit-space averaging treats 0.5 as a neutral logit and softens the damage.

The system also explores shrinkage. When trial forecasts disagree strongly, BLF can shrink the aggregated forecast toward a prior: the crowd estimate for market questions, an empirical base rate for dataset questions, or 0.5 when no useful prior is available. This is the right instinct. Disagreement among trials is not an inconvenience to hide; it is information about uncertainty.

For business use, this is the transferable lesson: do not just aggregate answers. Aggregate disagreement.

A forecasting agent should expose whether it found five independent paths to the same probability or five charming routes to confusion. The first supports action. The second supports caution, escalation, or more data collection. Same average probability, different management meaning.

Calibration is where the forecast becomes usable

A forecast that says 80% should behave like 80% over time. That sentence is boring, so naturally many AI products ignore it.

BLF applies Platt scaling as a post-processing calibration step. On ForecastBench, the paper uses hierarchical Platt scaling with per-source intercept offsets. The reason is straightforward: different sources have different base rates and different error patterns. A global calibration curve may over-correct one source while under-correcting another.

This is especially visible in the paper’s calibration ablations. For the full BLF system, calibration has limited marginal effect, because the agent is already relatively strong. For zero-shot baselines, hierarchical calibration matters much more. In the reported ForecastBench ablation, hierarchical calibration improves the zero-shot system with crowd and empirical priors by 2.7 overall BI, driven largely by a 4.9 BI gain on dataset questions. Global Platt scaling can over-shrink predictions derived from strong empirical priors, while hierarchical calibration preserves source-specific structure.

That is a practical warning. Enterprise AI teams love attaching confidence scores to outputs. But an uncalibrated confidence score is not risk management. It is decoration with decimals.

For decisions such as credit monitoring, supplier risk, compliance alerts, market event forecasts, or demand planning, calibration is not a final polish. It is the difference between a probability that can enter a decision rule and a probability that merely looks mature in a dashboard.

The main evidence says BLF is better, but the baseline is stronger than the slogan

The paper’s headline comparison is based on two ForecastBench tranches: one with forecast due date October 26, 2025, and another with forecast due date November 9, 2025. Together they include 400 questions: 200 market questions and 200 dataset questions. Because dataset questions can have multiple resolution dates, the evaluation contains 791 binary resolution events.

The main Brier Index comparison is:

Method	Market BI	Dataset BI	Overall BI	Interpretation
BLF (Pro)	83.8	62.7	73.3	Best point estimate across all three columns
Cassi	82.0	59.6	70.8	Strong leaderboard method, but lower dataset performance
GPT-5 zero-shot	80.4	60.1	70.2	Strong one-shot baseline with crowd signal
Grok 4.20	79.5	61.4	70.5	Point estimate below BLF, but comparison is partly underpowered
Foresight-32B	81.0	57.6	69.3	Competitive trained forecaster, weaker on this dataset subset
Crowd + empirical prior, no LLM	81.5	58.3	69.9	The annoying baseline that refuses to be weak

Two interpretations matter.

First, BLF’s overall score is not just a leaderboard ornament. The paper reports paired bootstrap comparisons and finds BLF significantly better than Cassi, GPT-5, and Foresight on overall BI. It also significantly beats the no-LLM crowd-plus-empirical baseline overall, with the dataset subset contributing a clear part of that gap.

Second, the market side is more subtle. The crowd signal is extremely strong for market questions. The no-LLM baseline reaches 81.5 market BI. BLF reaches 83.8. That is directionally better, but the paired test against the baseline on market questions does not produce the same clean statistical story as the dataset side. The only market gap against external methods that reaches significance is against Foresight.

This is not a weakness of the paper. It is a useful correction to lazy interpretation. In markets, the crowd is not a straw man. If a forecasting agent claims to improve market predictions, it must beat a price-like prior that already contains substantial information. BLF appears to add value, but the evidence is more convincing overall and on dataset questions than as a sweeping “AI beats markets” claim. Fortunately, the paper does not need that slogan. We have enough bad slogans already.

The ablations show a stack, not one magic component

The paper’s ablation work is valuable because it separates several possible explanations. BLF could be better because it searches, because it tracks beliefs, because it averages trials, because it calibrates, because it uses a strong base model, or because the comparison happens to be favorable. The authors test these possibilities across several configurations and base LLMs.

The useful reading is not “every component always helps equally.” It does not. The evidence is more conditional:

Component or test	Likely purpose	What it supports	What it does not prove
ForecastBench SOTA comparison	Main evidence	BLF has the best reported point estimate and significant overall gains against several public methods	It does not prove live forecasting dominance across all domains
Belief-state ablations	Ablation	Structured belief tracking can improve a search-enabled agent beyond raw text accumulation	It does not prove the update is formally Bayesian
Multi-trial aggregation tests	Ablation and variance control	More trials improve reliability; logit-space aggregation is empirically useful	It does not prove logit averaging is theoretically optimal
Calibration ablations	Calibration/sensitivity test	Hierarchical calibration especially helps weaker or zero-shot regimes with source-specific base rates	It does not mean calibration is the largest marginal gain for full BLF
Leakage audit	Backtesting validity check	The evaluation makes a serious attempt to prevent future information leakage	It does not eliminate all residual leakage risk
Cross-LLM analysis	Robustness/exploratory extension	The harness helps some models more than others, especially Kimi and Pro/Flash	It does not identify the exact causal mechanism inside the model
Ensemble analysis	Exploratory negative result	Same-harness LLM ensembles may lack enough diversity to help	It does not rule out more diverse ensemble designs

The cross-LLM results are especially useful for enterprise builders. In the full crowd/empirical-prior regime, BLF with Pro reaches 73.3 overall BI. BLF with Flash reaches 72.1, Kimi reaches 72.0, GPT-5 reaches 71.5, and Sonnet reaches 71.3. Kimi is particularly interesting: its NoBel baseline is much weaker, but the BLF scaffold lifts it substantially. In one reported regime, Kimi improves from 65.8 to 72.0 overall BI, a 6.2-point gain.

That is not merely a model ranking story. It suggests that a good harness can partially compensate for a weaker or cheaper model, especially when the model otherwise struggles with tool discipline or submission behavior. The paper reports that Kimi’s NoBel mode successfully submitted on only 6.2% of market questions, while the structured belief-state mode raised that to 82.4%. In plainer language: the scaffold helped the model finish the job instead of wandering around the tool loop like a consultant looking for the meeting room.

The gains are not uniform. BLF helps Pro and Flash significantly, but improvements for Sonnet and GPT-5 are not statistically significant in the same way. The paper’s hypothesis is that the belief state only helps if the model actually uses the belief slots as reasoning operands rather than treating them as write-only logs. A consistency check shows models do read the final belief at submission, but that does not prove the per-step updates are well grounded. The authors suggest a stronger future test: overwrite the belief mid-loop with a counterfactual value and observe whether later searches and final forecasts move accordingly.

That would be a genuinely useful intervention. It asks whether the belief state is causally steering the agent or merely documenting what the agent would have done anyway.

Market questions reward investigation; dataset questions reward the right tool

The paper repeatedly shows an asymmetry between market-style questions and dataset questions.

Market questions are judgmental. They often require reading incentives, institutional constraints, news context, and ambiguous public signals. Sequential search helps because the next useful query depends on the last clue. A structured belief state also helps because the agent must keep track of competing evidence and unresolved questions.

Dataset questions are different. Many are transformed time-series problems: will a value from yfinance, FRED, DBnomics, Wikipedia, or ACLED exceed a reference value at future dates? Sometimes the best approach is not a long reasoning chain. It is retrieving the correct historical series, using a simple model, and avoiding leakage. For DBnomics, the paper bypasses the LLM entirely and uses a KNN model. For yfinance, the paper notes that all methods are essentially at chance, consistent with the random-walk nature of short-horizon stock-price forecasting.

This is a major business lesson. “Agentic” should not mean “let the model talk longer.” It should mean “route the problem to the right procedure.”

For a regulatory event, the agent may need search, source inspection, and belief revision. For a commodity price threshold, it may need a time-series model and calibrated uncertainty. For a contract-risk alert, it may need document retrieval and source-specific rules. The intelligent part is not always the LLM. Sometimes the intelligent part is knowing when the LLM should stop being the hero.

The leakage controls are part of the contribution, not an appendix chore

Backtesting future-event forecasts is dangerous because the future has already happened. If the model sees post-resolution information through web search, tool calls, or parametric memory, the benchmark becomes clairvoyance with extra steps.

The paper treats this problem seriously. Its backtesting framework uses four layers of defense:

search-engine date filtering;
LLM-based filtering of search results that may contain post-cutoff information;
algorithmic date clamping for data tools;
URL blocking for resolution sources that might reveal the answer.

The post-hoc audit reports that among search results the agent actually saw in ForecastBench tranche A, 21 out of 1,377 kept results were classified as leaks, an undetected leakage rate of about 1.5%. The runtime filter caught 320 of 341 leaks but also dropped 577 clean results, a false-positive cost the authors accept for stronger leakage defense.

This is not glamorous, but it is exactly the kind of work that makes an evaluation worth reading. Without leakage control, a forecasting benchmark rewards systems that accidentally read tomorrow’s newspaper. That may be useful for a fictional hedge fund, but less useful for product design.

The paper also checks parametric leakage by using models whose knowledge cutoffs precede the ForecastBench tranches. On AIBQ2, Kimi’s cutoff creates two detected leakage cases; the authors discuss them rather than burying them. Good. Benchmarks do not become trustworthy because researchers sound confident. They become trustworthy when the boring failure modes are inspected.

What Cognaptus would infer for business forecasting systems

The paper directly shows that BLF performs strongly on a specific backtested binary forecasting benchmark, with gains supported by ablations and paired analysis. That is the evidence.

The business interpretation is broader but should be stated as inference, not as proof. For enterprise forecasting agents, BLF suggests four design principles.

Design principle	Operational consequence	ROI relevance	Boundary
Maintain an explicit belief state	The agent carries forward probability, evidence, counter-evidence, and open questions	Reduces repeated reasoning and makes forecasts auditable	Works only if the model actually uses the state, not merely writes it
Run multiple independent trials	The system can measure disagreement across search and reasoning paths	Supports escalation rules and uncertainty-aware decisions	Costs more tokens and time
Aggregate and shrink disagreement	Noisy or conflicting trials are pulled toward a defensible prior	Avoids overconfident decisions based on one lucky or unlucky run	Requires a meaningful prior or enough historical backtest data
Calibrate by source type	Probabilities become more usable across heterogeneous domains	Enables thresholds, alerts, and portfolio-level risk controls	Needs labeled historical outcomes and ongoing recalibration

For a business, the immediate application is not “replace the strategy team with an AI forecaster.” Please do not. The immediate application is to build forecasting workflows where the AI leaves a structured trail: what it believed, why it moved, what evidence contradicted it, how much independent trials disagreed, and how calibrated the resulting probabilities have been historically.

That trail is valuable even when the forecast is wrong. Especially then.

A wrong forecast with a readable belief path can be audited. Was the prior bad? Did the search miss a source? Did the agent over-weight one document? Did calibration fail for this event class? A wrong forecast with only a final paragraph is just a confident intern again, now with an API bill.

The limits are not decorative; they define the deployment boundary

BLF is restricted to binary outcomes in the reported work. Many business decisions require categorical, numerical, or continuous forecasts: demand volumes, price ranges, churn rates, project completion dates, cash-flow distributions. The paper discusses these as future work, but they are not solved here.

The evaluation is also backtested. The leakage controls are careful, but live forecasting is still the real test. Live settings introduce delayed resolution, changing data availability, adversarial information, tool failures, and feedback loops. Backtesting can rank designs, but it does not fully simulate operational pressure.

The system is also not cheap in the casual sense. The paper reports that agentic methods consume roughly 50–100 million tokens across five trials on the full dataset, with a Gemini 3.1 Pro full-method evaluation costing about $250 under the stated pricing and caching assumptions. A single question takes one to eight minutes depending on API latency and loop length. That may be perfectly acceptable for high-value strategic forecasts. It is not appropriate for every real-time alert.

Finally, the strongest practical results depend on source-specific tools, priors, and benchmark structure. The crowd signal is powerful for market questions. Empirical priors are powerful for some dataset sources with extreme base rates. A company deploying this architecture in procurement, compliance, sales, or geopolitical risk would need its own outcome database, calibration regime, source taxonomy, and monitoring loop. Copying the prompt is not the product.

The real lesson is doubt with memory

BLF’s most important idea is not that AI should be uncertain. That is easy. Many models are uncertain in the same way a weather app is uncertain when it says 50% and then rains anyway.

The better idea is that uncertainty should have memory.

A useful forecasting agent should know what it believed before the latest evidence, why the evidence moved the probability, which uncertainty remains unresolved, whether independent trials agree, and whether its past 80% forecasts have behaved like 80% forecasts. That is not a larger context window. That is a different discipline.

The paper’s results are strong enough to take seriously, but not broad enough to turn into prophecy. BLF is a backtested binary forecaster with careful controls, useful ablations, and a very clear architectural message. It does not prove that agentic AI can see the future. It shows that if we insist on asking AI to forecast, we should at least make it keep score, remember its reasons, and calibrate its confidence before sending the number upstairs.

A modest proposal, yes. Also, apparently, progress.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Kevin Murphy, “Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs,” arXiv:2604.18576v3, 2026. https://arxiv.org/abs/2604.18576 ↩︎

The system is built around a belief, not a transcript#

The agent loop makes forecasting sequential instead of ceremonial#

Multi-trial aggregation treats the model as noisy, which is polite but necessary#

Calibration is where the forecast becomes usable#

The main evidence says BLF is better, but the baseline is stronger than the slogan#

The ablations show a stack, not one magic component#

Market questions reward investigation; dataset questions reward the right tool#

The leakage controls are part of the contribution, not an appendix chore#

What Cognaptus would infer for business forecasting systems#

The limits are not decorative; they define the deployment boundary#

The real lesson is doubt with memory#