TL;DR for operators

FutureX is less interesting as a leaderboard and more interesting as an operating model for evaluating AI agents that claim to forecast the future. The benchmark runs a live loop: collect future-facing questions from curated web sources, ask agents to predict before the answer exists, wait for resolution, crawl the answer, and score the prior prediction. That matters because most “forecasting” evaluations are either historical backtests with leakage risk or static datasets quietly ageing into trivia.

The paper’s headline result is sober. Search-and-reasoning agents do better overall than base LLMs, and Grok-4 leads the reported overall ranking, followed by Gemini-2.5-flash Deep Research, GPT-o4-mini with Think&Search, and Seed1.6 from Doubao. But the hardest tasks — open-ended, high-volatility future events — remain ugly. That is the point. A benchmark that flatters every model is not a benchmark; it is a brochure with axis labels.

For business teams, FutureX is a useful template for procurement and internal evaluation. Do not ask only “which model scored highest?” Ask: Can the agent search fresh sources? Does it distinguish reliable from fake information? Does more tool use improve results or just create theatre? Does performance survive domain shifts? Does the evaluation window match the decision window? The paper does not prove agents can replace analysts. It shows how to start measuring whether they deserve to sit anywhere near the analyst workflow.

Forecasting agents do not need a crystal ball; they need a calendar

Forecasting is where AI demos go to become management problems.

A chatbot can summarise last quarter’s earnings transcript, produce a competitor memo, or explain why a stock moved yesterday. That is useful, but it is still mostly retrospective. The harder commercial question is forward-looking: what will happen next week, which signal will move, which ranking will change, which release will matter, which event will resolve in a way that affects a decision?

That is precisely the terrain of FutureX, a live benchmark for LLM agents in future prediction.1 Its design choice is simple and brutal: the answer must not exist when the model predicts. No quiet answer-key leakage. No historical web search polluted by hindsight. No “forecasting” task where the model can accidentally retrieve the resolved outcome and pretend it had foresight. The benchmark turns evaluation into a scheduled operation.

That is why the mechanism matters more than the leaderboard. FutureX is not merely asking, “Which model is best?” It is asking whether modern agents can operate inside a live information cycle: gather fresh evidence, form a prediction, wait for reality, and then be judged.

Very rude of reality to insist on arriving after the slide deck.

The core innovation is the evaluation loop

FutureX starts by building a source base. The authors begin with 2,008 candidate websites, use LLM-assisted screening to reduce that pool, then manually review the remainder into 195 curated websites. The sources span prediction markets, news sites, entertainment rankings, government data, and real-time data platforms. From these sources, the benchmark constructs future events across domains such as politics, sports, crypto, culture, finance, business, technology, weather, health, science, and space.

The daily loop has four stages:

Stage What happens Why it matters operationally
Event database construction Curate source websites and maintain the source pool Evaluation quality starts with source quality, not model glamour
Future event curation Convert live source material into prediction questions and filter out easy, harmful, or subjective events Forecasting tasks need hygiene, otherwise the benchmark becomes noise collection
Agent daily prediction Run models before events resolve, with a time budget per question The agent must act under temporal constraint, like a real workflow
Answer acquisition Crawl resolved events later, extract outcomes, and score previous predictions Evaluation becomes prospective rather than retrospective

This is the paper’s real contribution. The model results are downstream of the machine that produces them.

The authors report around 500 events per week, 70–100 high-quality events per day for evaluation, and 1,272 events in the July 20 to August 3 evaluation window. The answer acquisition pipeline crawls resolved events multiple times per day and reaches a reported success rate above 97% in the stable online version. That last number is not decorative. In a live benchmark, answer availability is the difference between an evaluation system and a pile of unresolved prompts.

The benchmark also adopts a one-week prediction window. This is a practical compromise. A shorter window gives faster feedback but fewer and noisier events. A longer window gives more coverage but slows learning and increases operational pressure. One week is not universally optimal, but it is a credible default for measuring whether an agent can work with non-trivial uncertainty without turning evaluation into archaeology.

FutureX separates lookup, search, and actual forecasting

A useful part of FutureX is its four-tier design. The tiers are not just difficulty labels. They correspond to different kinds of agent competence.

Tier Event type What it mainly tests Business analogue
Level 1: Basic Few-choice questions Lightweight selection from limited options Simple directional calls or templated classification
Level 2: Wide Search Many-choice questions with multiple correct answers Exhaustive discrimination without false positives Compliance lists, vendor screening, incident triage
Level 3: Deep Search Open-ended, low-volatility events Multi-step source navigation and synthesis Analyst research where the answer is discoverable but not pre-packaged
Level 4: Super Agent Open-ended, high-volatility events Forecasting under ambiguity and moving signals Market, geopolitical, product, or demand forecasting where facts do not sit still

This tiering prevents a common evaluation mistake: averaging together tasks that require different abilities. A model that does well on Level 1 may be good at selecting plausible options. That does not mean it can forecast a volatile numerical target. A model that searches aggressively may improve on Level 3 but still fail on Level 4 because the world, annoyingly, is not a database.

The scoring scheme reinforces this separation. Single-choice events use simple 0/1 accuracy. Multi-choice events use F1 because missing a valid answer and adding a wrong one are both costly. Open-ended ranking events award full credit for exact ordered matches and partial credit for overlap. Numerical events are scored relative to recent volatility, using the past seven-day standard deviation as a tolerance scale.

That is a thoughtful design choice. In real forecasting, being “wrong by a little” is not the same as being wrong by another planet. But the tolerance cannot be arbitrary either. Tying it to recent volatility gives the benchmark a way to distinguish forecast error from normal movement.

The leaderboard says search helps, but not all search is intelligence

The main results cover 25 systems: base LLMs, LLMs with integrated thinking and search, open-source deep research agents using SmolAgent and AgentOrchestra, and closed-source deep research systems. The overall score weights the four difficulty tiers at 10%, 20%, 30%, and 40%, respectively, deliberately giving more influence to harder tasks.

The reported overall ranking places Grok-4 first, followed by Gemini-2.5-flash Deep Research, GPT-o4-mini with Think&Search, and Seed1.6 from Doubao. The general pattern is unsurprising but still important: search-and-reasoning systems outperform plain base models overall, especially as the tasks become more complex.

But the paper’s more useful observation is not “search good.” That is too crude, and also exactly the kind of sentence that gets promoted into a procurement memo before anyone has checked the invoices.

The real point is that search quality, planning quality, source reliability, and latency interact. FutureX finds that base LLMs can perform surprisingly well on Level 1 and Level 2 tasks. Doubao-Seed1.6-Thinking is especially strong on these easier tiers, even outperforming some agents with web search. That suggests limited-choice future tasks may still reward internal knowledge, pattern recognition, and straightforward reasoning more than expensive browsing.

The relationship changes at harder levels. On Level 3 and Level 4 events, external information becomes more important because the answer often depends on recent developments. Models without timely retrieval struggle to produce meaningful answers. Yet even search-enhanced systems remain weak on the hardest open-ended, high-volatility tasks. Retrieval is necessary, but not sufficient. A model can collect thirty tabs and still not understand the event.

This is the benchmark’s most business-relevant lesson: more browsing is not the same as better judgment.

The hardest tier is where the analyst-replacement story breaks

A predictable misreading of this paper would be: “LLM agents are now forecasting like analysts.”

No. They are not.

FutureX includes a human comparison using 40 industry experts from accounting, consulting, and investment banking backgrounds. The authors randomly sample 300 questions and compare expert scores against model scores using the same metrics. Humans significantly outperform LLM agents on Level 1, Level 3, and Level 4. Some models surpass humans on Level 2, plausibly because many-choice tasks require exhaustive option comparison, where humans may not check every possibility.

That distinction matters. Agents can be useful precisely where exhaustive, structured comparison is tedious. They may scan many options faster than humans. They may maintain checklists more consistently. They may retrieve current information without getting bored, which is a genuine advantage because boredom has quietly ruined many expensive workflows.

But on deep, open-ended, volatile tasks, the human gap remains. The paper’s hardest tier demands wide-scope information search, synthesis of ambiguous evidence, and probabilistic reasoning under uncertainty. Current systems often fail to score at all. That is not a small caveat. That is the boundary between “analyst assistant” and “analyst replacement.”

For enterprise use, that boundary should determine routing. Use agents to gather evidence, generate candidate scenarios, monitor signals, and force structured comparison. Keep human review around high-volatility, high-impact judgments. The benchmark does not make humans obsolete. It makes lazy evaluation obsolete.

The side studies are diagnostics, not a second thesis

The paper includes several focused and out-of-benchmark analyses. They are useful, but they should not be read with the same weight as the main FutureX leaderboard.

Analysis Likely purpose What it supports What it does not prove
Missing prediction simulation Robustness / sensitivity test Small missing-prediction rates do not appear to dominate weekly score variance That all future live runs will be equally robust under API failures
Past vs future prediction Diagnostic comparison Separates retrieval-after-resolution from forecasting-before-resolution That past retrieval skill directly equals forecasting skill
SmolAgent planning analysis Implementation diagnostic Better plans use more reliable sources, stronger coverage, and clearer actions That the same planning coefficients apply to closed-source agents
Search count analysis Behavioural diagnostic Aggressive retrieval can correlate with stronger performance and lower latency can matter That more queries always improve accuracy
Wall Street analyst comparison Exploratory extension LLMs show some capability in Q2 2025 EPS and revenue forecasting That models beat professional analysts overall
Fake website test Safety stress test Some deep research agents can be misled by fabricated web content That one model family is universally robust to misinformation
Real-time search test Exploratory efficiency test Fresh, low-signal retrieval remains difficult for specialised research agents That sports retrieval generalises to all time-sensitive domains

This separation is important because the paper’s main benchmark and its out-of-benchmark studies use different data and model sets. The Wall Street comparison, for example, examines Q2 2025 forecasts for S&P 500 constituents against sell-side analyst consensus from Yahoo Finance. The top-performing models beat professional analysts on 37.5% of revenue prediction tasks and 32.3% of EPS tasks. Gemini-2.5-pro has average win rates of 33.0% for revenue and 33.7% for EPS, excluding ties. No model crosses a 50% win-rate threshold, and model MAPE remains higher than analyst MAPE.

That is not a victory parade. It is a sign of emerging usefulness under a bounded task. Useful, yes. Dominant, no. Please keep the confetti in the drawer.

The fake website study is more alarming. The authors construct fabricated web pages designed to push agents toward implausible target answers. Across five examples, GPT-o3 Deep Research, Doubao Deep Research, and Qwen3-235B Deep Research are consistently misled, despite retrieving additional sources. Gemini-2.5-Pro Deep Research is reported as unaffected in those examples, refusing to incorporate or cite the fake site. This is not enough to declare one system safe. It is enough to show that retrieval-augmented forecasting agents need adversarial source tests before they touch sensitive workflows.

The real-time search case study points in the same direction. Using five examples from live esports match scores, the paper finds that some dedicated deep research agents do not clearly outperform general Think&Search models in fresh, low-signal retrieval. That matters because many business questions are not just uncertain; they are time-sensitive. An agent that finds the answer tomorrow may still be correct and still be useless.

What Cognaptus infers for business use

The paper directly shows that FutureX can run a prospective, live benchmark over diverse future events and that current LLM agents perform very unevenly across difficulty tiers, domains, and retrieval conditions. It also shows that search-and-reasoning systems lead overall, while hard open-ended volatility remains a major weakness.

The business inference is more specific: companies should evaluate forecasting agents as workflows, not as model names.

A useful internal evaluation should include at least five design elements borrowed from FutureX:

Evaluation element How to adapt it inside a company Why it matters
Prospective testing Ask the agent to make predictions before outcomes are known Prevents historical leakage and accidental hindsight
Difficulty tiers Separate simple choices, exhaustive multi-choice, open-ended stable tasks, and volatile forecasts Stops easy tasks from hiding hard-task failure
Source governance Maintain approved source lists and score source reliability Reduces false confidence from low-quality retrieval
Delay-aware scoring Choose an evaluation window matching the business decision cycle Makes metrics operational rather than decorative
Adversarial misinformation tests Seed controlled fake or low-quality sources in a safe test environment Reveals whether the agent trusts the web too eagerly

This is especially relevant for procurement. A vendor demo may show a research agent producing a beautiful memo with citations. FutureX asks a harsher question: did the memo help before the answer was known, did the sources deserve trust, and did the forecast survive resolution? That is the difference between a research assistant and a very articulate browser history.

For product teams, the roadmap implication is tiered deployment. Low-volatility, evidence-heavy tasks can be automated earlier. High-volatility, high-impact forecasts should remain human-supervised, with agents used for signal collection, scenario generation, and contradiction checks. The model may write the briefing. It should not quietly become the investment committee.

Where the benchmark still has boundaries

FutureX is strong because it is live, broad, and operationally disciplined. But its results still depend on the benchmark’s source universe, event templates, scoring choices, model availability, and evaluation window.

Some model omissions are practical rather than conceptual. The paper notes API stability, policy limitations, and refusal behaviour for some systems. AgentOrchestra is only tested on Level 1 and Level 2 because it is computationally intensive, so it is excluded from the overall results. Search-count comparisons exclude GPT-o4-mini and GPT-4o Think&Search because their search counts could not be measured. These are not fatal flaws. They are reminders that live evaluation is infrastructure, and infrastructure has seams.

The human comparison is also a rough indicator, not a courtroom verdict. The experts and models did not answer exactly the same question set, so the reported gaps may vary with question distribution and annotator background. Still, the directional message is credible: current agents are not consistently at human expert level on the tasks that matter most.

The scoring system, while thoughtful, also encodes values. Weighting Level 4 at 40% is reasonable if the goal is to reward difficult forecasting. A different operator might weight stable retrieval more heavily because their business process is not speculative trading but compliance monitoring. FutureX offers a template, not a universal KPI handed down from the mountain.

The useful future is not autonomous prophecy

FutureX’s most valuable contribution is not that Grok-4 tops a chart or that search agents beat base models. Leaderboards age quickly. Mechanisms age more slowly.

The durable lesson is that forecasting agents need live accountability. They should be tested before outcomes are known, scored after reality resolves, evaluated by difficulty and domain, inspected for source quality, and attacked with misinformation before attackers do it in production. That is the boring machinery behind trustworthy AI forecasting. Naturally, it is also the part most demos skip.

For operators, the right takeaway is measured optimism with a clipboard. Agents are becoming useful in forecasting workflows, especially where broad search, option comparison, and structured synthesis matter. But the frontier is still fragile where volatility, ambiguity, and adversarial information collide.

A crystal ball is fantasy. A cron job with clean sources, delayed scoring, and brutal error analysis is much more useful.

Cognaptus: Automate the Present, Incubate the Future.


  1. Zhiyuan Zeng et al., “FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction,” arXiv:2508.11987, 2025, https://arxiv.org/pdf/2508.11987↩︎