FIRE-BENCH: Playing Back the Tape of Scientific Discovery

Opening — Why this matters now

Agentic AI has entered its confident phase. Papers, demos, and product pitches increasingly imply that large language model (LLM)–powered agents can already “do research”: formulate hypotheses, run experiments, and even write papers end to end. The uncomfortable question is not whether they look busy—but whether they actually rediscover truth.

FIRE-BENCH lands precisely on this fault line. Instead of asking agents to generate novel papers (hard to verify) or to optimize a single metric (easy but shallow), it asks something far more revealing: Can an AI agent independently rediscover known scientific insights when only given the original research question?

The answer, empirically, is: not reliably.

Background — The evaluation trap we’ve built

Evaluating AI research agents has drifted into two unsatisfying extremes:

Paper-generation benchmarks — expressive, impressive, and largely judged by other LLMs.
Leaderboard-style tasks — objective and scalable, but focused on narrow engineering wins rather than scientific reasoning.

Both miss the core of science: designing appropriate experiments and drawing conclusions that are faithful to evidence.

FIRE-BENCH proposes a third path: constrained rediscovery. Agents are handed a high-level research question extracted from a real, peer-reviewed ML paper. They must then plan, implement, execute, and conclude—without access to the original methodology or results. Their output is evaluated against the actual empirical findings, claim by claim.

Analysis — What FIRE-BENCH actually tests

At its core, FIRE-BENCH formalizes research as a full-cycle process, not a code-writing exercise.

From papers to rediscovery tasks

Each benchmark task is derived from a recent, high-impact ML analysis paper (ICLR, ICML, NeurIPS 2024–2025). The authors decompose each paper into a research-problem tree:

Root node: the broad research question
Intermediate nodes: logical sub-questions
Leaf nodes: concrete experiments tied to figures or tables

Instead of forcing agents to replicate a known experiment, FIRE-BENCH selects an intermediate node as the task prompt. This preserves exploratory freedom while keeping evaluation grounded in verifiable evidence.

Evaluation that cares about meaning, not vibes

Agent conclusions are broken down into atomic empirical claims and compared against human-authored ground truth using precision, recall, and F1. This avoids holistic “paper quality” judgments and instead asks a blunt question: Did the agent recover the key findings, or not?

Findings — The numbers, without excuses

Across 30 benchmark tasks and multiple frontier agents, results are sobering.

Overall performance

Agent	Avg F1	Std Dev
Claude Code (Sonnet-4)	46.7	±23.4
Codex (gpt-5-medium)	41.9	±25.4
OpenHands (gpt-5)	37.9	±23.0
OpenHands (o4-mini)	31.9	±17.6

Even the strongest agent fails to rediscover more than half of the ground-truth claims—and variance is enormous. Identical tasks, different runs, wildly different outcomes.

Where agents fail

Error analysis reveals a consistent pattern:

Research planning failures: flawed experimental design, missing controls, goal drift
Conclusion failures: misreading results, overgeneralization, unsupported claims

Notably, low-level coding errors are not the dominant problem anymore. The bottleneck has moved upstream—to thinking.

False positives aren’t creative detours

One might hope agents are discovering alternative but valid insights. In practice, most false positives are simply contradictory or irrelevant. Genuinely plausible alternative conclusions are rare (≈5–10%). This is not creative science—it’s confusion.

Cost vs. insight

Agent	Avg Cost / Task	Avg F1
Claude Code	$0.84	46.7
Codex	$0.15	41.9
OpenHands (gpt-5)	$0.72	37.9

Higher performance generally costs more, but efficiency varies widely. Codex, in particular, shows that shorter, more disciplined execution traces can deliver respectable results at low cost—an underrated design signal.

Implications — What FIRE-BENCH quietly tells us

Three uncomfortable takeaways stand out:

Agentic AI is not yet a scientist. It can execute workflows, but struggles with experimental judgment.
More capable models help—but don’t solve the problem. Planning and inference remain brittle even with frontier backbones.
Evaluation is destiny. If we keep rewarding surface competence, we’ll keep getting agents that look productive while missing the point.

FIRE-BENCH doesn’t just benchmark agents—it diagnoses them. And the diagnosis is clear: the hard part of science was never running experiments. It was knowing which ones mattered, and why.

Conclusion — Science is more than a pipeline

FIRE-BENCH is a reminder that scientific discovery is not a linear automation problem. It is an epistemic one. Until agents can reliably design controls, reason about causality, and align conclusions with evidence, claims of “AI researchers” should be read with restraint.

Progress will come—not from bigger models alone—but from better agent architectures, stronger planning priors, and evaluation frameworks that punish confident nonsense.

FIRE-BENCH raises the bar. Current agents, for now, are still reaching for it.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The evaluation trap we’ve built#

Analysis — What FIRE-BENCH actually tests#

From papers to rediscovery tasks#

Evaluation that cares about meaning, not vibes#

Findings — The numbers, without excuses#

Overall performance#

Where agents fail#

False positives aren’t creative detours#

Cost vs. insight#

Implications — What FIRE-BENCH quietly tells us#

Conclusion — Science is more than a pipeline#