FIRE-BENCH: Playing Back the Tape of Scientific Discovery

A demo can make an AI research agent look impressive in ten minutes. Give it a task, watch it create files, install packages, run experiments, generate tables, and write something that sounds like a conclusion. Productivity theater, now with terminal logs.

The harder question is less cinematic: did it actually discover the right thing?

That is the point of FIRE-Bench, a benchmark designed to evaluate whether autonomous AI agents can rediscover established scientific findings when given only a high-level research question, not the original paper’s method or answer.¹ The benchmark does not ask agents to write a plausible paper. It does not ask them to chase a leaderboard metric. It asks them to complete a constrained version of scientific work: plan an empirical investigation, implement it, run it, and synthesize conclusions that match verified findings from real machine-learning research.

The result is not flattering. The strongest evaluated agent, Claude Code using Claude-4-Sonnet, reaches an average claim-level F1 score of 46.7. Codex with gpt-5-medium reaches 41.9. OpenHands with gpt-5 reaches 37.9, and OpenHands with o4-mini reaches 31.9. The variance is also large, meaning that the same agent can perform quite differently across repeated runs.

So the business lesson is not “AI cannot do research.” That would be too easy, and also wrong. The sharper lesson is this: current research agents can often perform the visible labor of research, but remain fragile at the invisible labor—choosing the right experimental structure, controlling confounders, and turning evidence into claims without quietly wandering off the epistemic road.

A surprising amount of AI strategy still confuses activity with competence. FIRE-Bench is useful because it punishes that confusion.

The benchmark starts where normal demos usually stop

Most evaluations of research agents fall into two familiar camps.

One camp asks agents to generate full papers. This is expressive but difficult to verify at scale. The output may look scientific, but evaluating whether it is actually correct requires human expertise, experimental replication, or another LLM-as-judge layer. That can become a hall of mirrors with better formatting.

The other camp asks agents to optimize a clear metric: improve accuracy, solve a programming task, or reproduce a known benchmark score. This is easier to evaluate, but it narrows “research” into engineering execution. Useful, yes. Sufficient, no.

FIRE-Bench is positioned between those extremes. It uses recent empirical machine-learning papers as source material, then transforms them into rediscovery tasks. Each task begins from a high-level research question and gives the agent the relevant experimental scope—such as datasets, models, or evaluation criteria—but withholds the original authors’ detailed method and conclusion.

That design choice matters. If the agent had the original paper, the task would become reproduction. If the agent had only a vague topic, evaluation would become subjective. FIRE-Bench instead creates a controlled rediscovery setting: open enough to require planning, constrained enough to be judged against known evidence.

The benchmark contains 30 tasks, each derived from an empirical analysis paper about LLM behavior. The source papers are selected from recent high-impact venues and filtered for practical evaluability: open inputs, compute-light execution, and non-trivial but verifiable findings. In plain English, the benchmark tries to avoid both toy tasks and impossible science projects.

The agent is not being asked to invent relativity. It is being asked to rediscover a well-specified empirical insight from modern ML research. Apparently, that is already hard enough.

The headline result is weak performance with unstable execution

FIRE-Bench evaluates agent conclusions at the claim level. The ground-truth finding from the source paper is decomposed into atomic empirical claims. The agent’s final conclusion is decomposed the same way. Precision measures how many generated claims are correct. Recall measures how many ground-truth claims were recovered. F1 balances both.

That is a better test than asking whether the final write-up “sounds right.” Science is not a vibes business, despite occasional evidence to the contrary.

Agent	Precision	Recall	F1 score
Claude Code / Claude-4-Sonnet	52.1 ± 26.1	48.3 ± 24.8	46.7 ± 23.4
Codex / gpt-5-medium	44.8 ± 24.1	49.0 ± 28.5	41.9 ± 25.4
OpenHands / gpt-5	41.7 ± 22.7	41.4 ± 24.9	37.9 ± 23.0
OpenHands / o4-mini	36.8 ± 18.5	36.6 ± 19.2	31.9 ± 17.6

The obvious reading is that stronger agents perform better. Claude Code leads, Codex follows, and OpenHands improves when its backbone moves from o4-mini to gpt-5. Model strength helps.

The less comfortable reading is that model strength does not solve the core problem. Even the best system fails to recover more than half of the target claims on average. The standard deviations are not decorative; they tell us the process is unstable. In business terms, that means you are not buying a reliable analyst. You are buying a probabilistic workflow that may produce a strong run, a weak run, or a beautifully formatted mistake.

This is where FIRE-Bench becomes more interesting than a leaderboard. If the paper only reported average scores, the conclusion would be predictable: better models are better. Fine. Invoice paid. The useful part is that the authors inspect where the failures arise.

The real bottleneck is not coding; it is experimental judgment

A lazy interpretation would say agents failed because coding agents are not mature enough. The paper does not support such a simple diagnosis.

The authors’ trajectory inspection suggests that environment setup and basic execution are not the dominant bottlenecks. The agents can often install packages, write scripts, and run experiments. The failure modes concentrate elsewhere: research planning and conclusion formation.

That distinction is important. In an enterprise context, many teams treat agentic AI as a workflow automation problem: connect tools, give permissions, add memory, wrap everything in a dashboard, and the system becomes “autonomous.” FIRE-Bench suggests the missing capability is not just tool access. It is methodological discipline.

The paper’s example of medical racial bias is especially revealing. The underlying source insight depends on control-based reasoning: remove racial indicators, then selectively reintroduce them, so that the effect of race labels can be isolated from clinical content. Across evaluated agents and runs, the agents fail to recover that control structure. Instead, they tend to introduce race information directly without establishing the proper baseline.

That is not a small implementation error. It is a failure to understand what the experiment is for.

This is the difference between running a comparison and designing a comparison that means something. Many business analytics tasks have the same shape. A model may compare two customer segments, two marketing campaigns, two branches, or two loan cohorts. But unless the comparison controls for the right confounders, the conclusion can be efficient, confident, and wrong. The holy trinity of bad analytics.

FIRE-Bench makes that failure visible in a scientific setting.

Easy tasks look like pipelines; hard tasks require controls

The paper reports that agents perform better on tasks with relatively direct experimental procedures. Examples include “Lost in the Middle,” “Persona with Catch,” “CoT Without Prompting,” and “Hallucination Snowballing,” where the best observed F1 scores reach 91.7, 88.6, 82.6, and 80.9 respectively.

These are not necessarily trivial tasks. But their structure is more pipeline-like: define the input, run the evaluation, observe the pattern. Agents are comparatively good when the research path is close to executable specification.

Performance drops when the task requires multi-step design, causal isolation, or careful control construction. FIRE-Bench formalizes this using a difficulty rubric with three axes:

Difficulty dimension	What it asks	Why it matters for agents
Conceptual decomposition	Does the task require one direct test or a multi-stage research design?	Agents often struggle to convert a broad question into the right sequence of subquestions.
Confound and causality burden	Does the task require explicit controls or counterfactual comparisons?	Agents may run plausible experiments that do not isolate the variable of interest.
Measurement and analysis complexity	Does the result require calibration, sensitivity analysis, or nuanced interpretation?	Agents may produce data but miss the actual empirical pattern.

This is the useful managerial translation: research-agent risk rises when the task moves from “execute this known analytical recipe” to “decide what recipe would make the answer credible.”

That does not mean agents are useless. It means their reliability profile is uneven. They are more suitable for structured empirical routines than for open-ended causal diagnosis. The former can be delegated with guardrails. The latter should still have a human method owner.

A disappointing sentence, perhaps. Also a cheaper one than discovering the problem after a board presentation.

False positives are mostly not hidden discoveries

One possible defense of rediscovery benchmarks is that an agent may find a valid alternative insight that differs from the original paper. In that case, the benchmark might unfairly penalize creativity.

The FIRE-Bench authors check this. They categorize false-positive claims into contradictory, unrelated, overgeneralized, and alternative conclusions. The “alternative” category is the generous one: plausible hypotheses or patterns related to the task but not supported by the original result.

The results are not especially romantic. Most false positives are contradictory or unrelated. Depending on the agent, contradictory plus unrelated claims account for 76.4% to 95.0% of false positives. Alternative conclusions are rare, ranging from 4.5% to 10.9%.

So when agents miss the target, they are usually not performing brave independent science. They are more often making claims that conflict with the ground truth or do not address the research question.

That matters because “agent creativity” is often used as a soft excuse for evaluation failure. FIRE-Bench does not prove agents cannot generate novel insights. It does show that, in this benchmark, deviations from the verified findings are usually not useful alternatives. They are errors wearing a lab coat.

The evaluation stack is not perfect, but it is more serious than vibes

FIRE-Bench still uses LLM-assisted evaluation. That should be acknowledged without turning the entire article into a ritual disclaimer.

The claim extraction and matching process is automated using LLM-based components. To validate this, the authors conduct human evaluation on a subset of reference instances. They report claim-extraction precision of 0.95, recall of 0.86, and F1 of 0.89. This does not make the evaluator infallible, but it gives evidence that the automated scoring approximates human decomposition reasonably well.

The paper also validates its problem-tree construction process. Human annotators score extracted problem trees on criteria such as research-question groundedness, experiment completeness, hallucination elimination, structural coherence, and question-conclusion alignment. Reported averages are high: 5.0 for groundedness, 5.0 for completeness, 4.8 for hallucination elimination, 5.0 for structural coherence, and 4.8 for alignment.

These appendix checks are not a second thesis. Their purpose is narrower: they support the benchmark’s internal validity. They do not prove that FIRE-Bench captures all forms of scientific discovery. They do make the benchmark harder to dismiss as arbitrary prompt judging.

Paper component	Likely purpose	What it supports	What it does not prove
Main performance table	Main evidence	Current agents have limited rediscovery performance and high variance.	Agents are incapable of all scientific work.
Error-stage analysis	Diagnostic evidence	Failures concentrate in planning and conclusion formation.	Exact causal attribution for every failure mode.
False-positive categorization	Diagnostic check	Most deviations are contradictory or unrelated, not valid alternative insights.	Agents cannot ever discover useful alternatives.
Difficulty stratification	Validity and sensitivity check	Harder tasks behave as expected under the rubric.	Difficulty labels are universally correct.
Data-contamination analysis	Robustness check	No strong systematic pre-cutoff advantage appears after conditioning on difficulty.	Training contamination is impossible.
Claim-extraction human evaluation	Evaluator reliability check	Automated claim decomposition is reasonably aligned with human judgment.	Automated judging is flawless.

This distinction is important for readers who want to use FIRE-Bench as an operational lens. The benchmark is not a final exam for scientific intelligence. It is a structured stress test for full-cycle empirical reasoning.

That is already useful.

Cost efficiency changes the procurement question

The paper also reports operational cost. Claude Code achieves the highest average F1, but also the highest total API-based cost across the evaluated setting: $12.67 total and $0.84 per task. Codex achieves a lower average F1 of 41.9, but at an estimated $2.21 total and $0.15 per task. OpenHands with gpt-5 costs $10.74 total, while OpenHands with o4-mini costs $8.90.

The exact dollar amounts should not be overinterpreted. Pricing, model defaults, and agent implementations change. The authors themselves note assumptions in estimating Codex cost. But the pattern is strategically relevant: cost efficiency depends not only on model price, but also on execution behavior. Shorter action sequences and more disciplined tool use can matter.

For business adoption, that shifts the procurement question.

The naive question is: “Which model is best?”

The better question is: “Which agent produces the highest verified claim quality per dollar, under the task type we actually need?”

That requires measuring outputs at the level of supported claims, not just completed workflows. An agent that writes a 20-page market analysis at low cost may still be expensive if five pages are unsupported. Conversely, a slower and more expensive agent may be justified in high-stakes analytical settings if it reduces false claims and improves recall.

FIRE-Bench does not give a universal ROI formula. It gives the skeleton of one.

What businesses should take from FIRE-Bench

The paper directly shows that current frontier agents struggle on constrained rediscovery tasks drawn from empirical ML research. It also shows that failures are not dominated by simple execution problems. Planning, controls, and conclusion fidelity are the fragile parts.

Cognaptus’ business inference is that agent deployment should be evaluated at three levels.

First, evaluate workflow completion. Did the agent run? Did it access the right data? Did it produce an output? This is the level most demos show.

Second, evaluate methodological adequacy. Did the agent choose a design that can actually answer the question? Did it include the right baselines, controls, and sensitivity checks? This is where many agent demos become less charming.

Third, evaluate claim fidelity. Are the final claims supported by the evidence generated? Are unsupported claims separated from supported findings? Are limitations attached to the right part of the argument rather than sprinkled like parsley?

For internal analytics, finance, compliance, market research, and scientific R&D support, this suggests a practical governance pattern:

Deployment layer	Recommended control	FIRE-Bench lesson
Structured execution	Let agents automate repeatable code and data workflows.	Agents are relatively stronger when the path is well specified.
Experimental design	Require human approval of plans, controls, and evaluation criteria.	Hard failures often begin before code is written.
Evidence synthesis	Use claim-level review rather than document-level approval.	Polished conclusions can hide unsupported or missed claims.
Model/vendor evaluation	Benchmark on internal rediscovery tasks with known answers.	Real evaluation should test whether agents recover verified findings, not whether they sound useful.
Cost management	Track verified insight per dollar, not just token spend.	Efficient action traces can outperform more expensive wandering.

The most immediately useful idea is internal rediscovery. A company does not need to wait for a perfect public benchmark. It can create its own FIRE-Bench-like test set from past analyses, audit reports, pricing studies, product experiments, or operational investigations. Give the agent the original business question and available data, but hide the final report. Then score whether it recovers the key claims.

That is far better than asking whether the agent “seems smart.” Many things seem smart. Some of them are just verbose.

Boundaries: what FIRE-Bench does and does not settle

FIRE-Bench is built from 30 empirical analysis tasks in machine learning, mostly around LLM behavior. That gives the benchmark coherence, but also limits its coverage. It does not directly measure wet-lab science, hardware engineering, social science fieldwork, legal reasoning, or corporate strategy.

The benchmark also evaluates rediscovery, not open-ended novelty. This is a feature for verification, but a boundary for interpretation. A system optimized for rediscovery may not be the same as a system optimized for exploratory hypothesis generation. The paper partially addresses the concern that agents might generate valid alternative findings, but within this benchmark those cases are rare.

The contamination analysis is also necessarily coarse. The authors stratify performance by task difficulty and publication timing relative to model knowledge cutoffs, and they find no consistent pre-cutoff advantage. That weakens the simple memorization objection. It does not prove that no benchmark content appeared in training data. Training corpora are not transparent enough for that level of certainty.

Finally, the evaluation pipeline uses LLM-assisted claim extraction and matching. Human validation supports its reliability, but automated semantic judging remains an approximation. Still, compared with holistic paper scoring, claim-level evaluation is a methodological improvement. It at least asks the right question: which specific claims were recovered, missed, or hallucinated?

The agent researcher is not dead; it is underqualified

FIRE-Bench is not anti-agent. In fact, it is more useful than most optimistic agent papers because it describes what progress would have to look like.

Better research agents will need stronger planning priors, not just better coding ability. They will need mechanisms for proposing controls, checking whether an experiment identifies the target effect, and refusing to conclude when the evidence is insufficient. They will also need better self-auditing at the claim level: each final claim should be traceable to a specific result, comparison, or statistical pattern.

For businesses, the near-term opportunity is not replacing analysts or researchers wholesale. It is building hybrid workflows where agents handle structured execution and humans retain responsibility for method design and evidence interpretation. Over time, more of that judgment may become automatable. But FIRE-Bench shows that we should measure that transition, not narrate it into existence.

The fashionable phrase is “AI researcher.” FIRE-Bench’s quieter phrase is more accurate: “research agent under evaluation.”

That is less exciting. It is also how serious systems get built.

Cognaptus: Automate the Present, Incubate the Future.

Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, and Zhiting Hu, “FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights,” arXiv:2602.02905, 2026. https://arxiv.org/abs/2602.02905 ↩︎

The benchmark starts where normal demos usually stop#

The headline result is weak performance with unstable execution#

The real bottleneck is not coding; it is experimental judgment#

Easy tasks look like pipelines; hard tasks require controls#

False positives are mostly not hidden discoveries#

The evaluation stack is not perfect, but it is more serious than vibes#

Cost efficiency changes the procurement question#

What businesses should take from FIRE-Bench#

Boundaries: what FIRE-Bench does and does not settle#

The agent researcher is not dead; it is underqualified#