From Benchmarks to Beakers: Stress‑Testing LLMs as Scientific Co‑Scientists

Benchmarks are clean. Research is not.

A benchmark asks a model to answer a question, then politely stops. A research workflow asks the model to form a hypothesis, test it, read the result, notice what went wrong, adjust the plan, and try again without wandering into scientific nonsense. One is a quiz. The other is a beaker with a budget, a deadline, and a surprisingly expensive simulation queue.

That difference is the center of Evaluating Large Language Models in Scientific Discovery, which introduces Scientific Discovery Evaluation, or SDE.¹ The paper’s real contribution is not simply another leaderboard for science-flavored AI. We have plenty of those already, and some are very good at making executives believe that a model with a strong science score is basically a junior principal investigator with better uptime. SDE tests a more inconvenient question: does performance on static scientific questions predict performance inside a discovery loop?

The answer is: sometimes, incompletely, and not in the way procurement decks would prefer.

The paper evaluates frontier LLMs across two levels. First, it builds 43 expert-curated research scenarios across biology, chemistry, materials science, and physics, then samples 1,125 scenario-grounded questions from them. Second, it evaluates models on eight iterative discovery projects where models propose hypotheses, receive feedback from computational or experimental oracles, and refine their proposals over multiple rounds.

That two-level design makes the paper useful for business readers. It does not merely ask whether LLMs “know science.” It asks where knowledge translates into discovery, where it fails, and where a model can still stumble into useful search behavior despite weak granular understanding. That is a more practical question, and also a more annoying one. Naturally, it is the one that matters.

The paper compares quiz competence with discovery competence

Most science benchmarks test isolated correctness. GPQA, MMMU, and similar benchmarks ask whether a model can answer difficult scientific questions. That matters. But the paper argues that scientific discovery has a different structure: questions are not isolated; they sit inside workflows.

SDE therefore begins from research projects rather than from free-floating question pools. Domain experts define realistic projects, decompose them into modular research scenarios, and then build questions tied to those scenarios. A retrosynthesis project, for example, naturally depends on scenarios such as retrosynthesis, reaction mechanism reasoning, and forward reaction prediction. A symbolic regression project depends more on computation, core physics knowledge, and statistics.

That decomposition is the quiet machinery of the paper. It allows the authors to compare three things that normal benchmarks often collapse into one:

Evaluation layer	What it tests	Why it matters for deployment
Scenario-grounded questions	Whether a model can solve research-relevant subproblems	Useful for diagnosing weak links in an R&D workflow
Project-level discovery loops	Whether a model can propose, test, and refine hypotheses	Useful for testing whether the model can operate under feedback
Scenario-to-project mapping	Whether subproblem accuracy predicts downstream discovery	Useful for deciding whether to route, constrain, or reject a model

This is not just a methodological nicety. In business terms, SDE turns “Which model is best?” into “Which model is safe and useful for this specific stage of this specific scientific workflow?” The second question is less glamorous. It is also the only question that resembles actual R&D management.

High science benchmark scores do not survive contact with the lab bench

The first major comparison is between decontextualized science QA and scenario-grounded scientific questions.

The paper reports that state-of-the-art models reach scores of 0.71 in biology, 0.60 in chemistry, 0.75 in materials, and 0.60 in physics on SDE’s scenario-grounded benchmark. By contrast, GPT-5 reaches 0.84 on MMMU-Pro and 0.86 on GPQA-Diamond. The gap is not a rounding error. It says that a model can look impressive on general science exams while still being meaningfully weaker on tasks that resemble pieces of real discovery work.

A more revealing example comes from within chemistry. GPT-5 scores 0.85 on retrosynthesis planning questions but only 0.23 on NMR-based structure elucidation. Same broad domain. Same model. Very different practical reliability.

That is the first business lesson: domain labels are too coarse. “Good at chemistry” is not an operational capability. A procurement team evaluating an AI assistant for drug discovery, materials screening, or lab planning should not ask for a general chemistry benchmark and call it a day. It needs scenario-level evidence.

The paper’s question-level tests are main evidence, not decoration. Their purpose is to measure scenario-specific competence under controlled scoring. They do not prove that a model can run an entire lab process, but they do reveal where the process may break if the model is inserted without guardrails.

A model that is strong at retrosynthesis multiple-choice questions but weak at spectroscopy interpretation may still be useful. It just should not be trusted as a general chemistry co-pilot. Give it a lane. Paint the lane. Add a barrier.

Reasoning helps, but “more reasoning” is not a magic solvent

The next comparison is between reasoning-enabled models and their non-reasoning counterparts.

The paper finds that explicit test-time reasoning improves performance in many scenario-grounded tasks. The clearest example is DeepSeek-R1 versus DeepSeek-V3.1, where the reasoning model improves across biology, chemistry, materials, and physics. In one Lipinski Rule of Five task, accuracy rises from 0.65 to 1.00 when reasoning is enabled. That result is unsurprising in the right way: Lipinski assessment requires integrating multiple molecular descriptors and threshold conditions. It is exactly the kind of task where stepwise reasoning should help.

But the paper then complicates the story. Increasing reasoning effort for GPT-5 produces only modest gains on SDE. Between medium and high reasoning effort, reported accuracy barely moves in biology, materials, and physics, and improves more visibly only in chemistry. Scaling model size from GPT-5 nano to mini to full GPT-5 helps, but the authors also observe signs of slowing improvement when comparing recent frontier models. GPT-5 does not simply dominate o3 across scenarios, and GPT-5-chat does not clearly separate itself from GPT-4o once reasoning is isolated.

This is not an argument that scaling is dead. That obituary has been written too many times by people who later had to pretend they were speaking metaphorically. The sharper point is that scientific discovery stresses competencies that generic reasoning benchmarks do not fully reward: problem formulation, hypothesis refinement, experimental interpretation, validity checking, and tool-mediated execution.

The paper’s scaling and reasoning analyses function partly as robustness and sensitivity tests. They ask whether the same improvement patterns seen in coding and math transfer cleanly into discovery-grounded science. The answer is mixed. Reasoning is useful, but the marginal return is not uniform; scale helps, but it does not erase scenario-specific failure.

For R&D teams, this changes the purchasing logic. Paying for more expensive reasoning should not be a default ritual. It should be justified by task type. Use high-reasoning models where the workflow genuinely requires multi-step derivation, evidence integration, or hypothesis revision. Do not burn inference budget to ask a very expensive model to be confidently mediocre at a bottleneck it was never trained to handle.

Model diversity is weaker when models fail together

A common enterprise instinct is to ensemble models. If one model is unreliable, ask several and take the majority vote. This sounds sensible because committees are traditionally used to convert uncertainty into minutes.

SDE shows why this approach has limits. The paper finds that top-performing models from different providers often rise and fall on the same scenarios. In chemistry and physics, pairwise correlations among GPT-5, Grok-4, DeepSeek-R1, and Claude Sonnet 4.5 are greater than 0.8. Even where overall model performance differs, the models frequently converge on the same difficult questions and the same wrong answers. In MOF synthesis, for example, four top models make the same mistake on four of 22 questions.

The authors then construct SDE-hard: 86 questions, two from each research scenario, selected from cases where top models make the most mistakes. All evaluated LLMs score below 0.12 on this hard subset. GPT-5 Pro improves meaningfully and answers nine questions that all other tested models miss, but at roughly 12 times the cost of GPT-5, according to the paper’s discussion. That is an improvement, not a rescue mission.

This evidence is important because it challenges a lazy version of “multi-model governance.” If model errors are independent, ensembling can help. If the models share training distributions, interface habits, and representational blind spots, the ensemble may simply agree on the wrong answer with better typography.

For business use, model diversity must be measured, not assumed. A serious evaluation should ask:

Governance tactic	When it helps	When it fails
Majority voting across models	Errors are weakly correlated	Frontier models share the same blind spots
Escalation to expensive reasoning	A small hard subset is economically important	The hard subset remains mostly unsolved
Specialist routing	Scenario strengths differ by model	Routing is based only on broad domain labels
Tool verification	Outputs can be checked by simulators, databases, or validators	The workflow lacks reliable external oracles

The paper’s shared-failure analysis is main evidence for model-risk management. It does not prove that ensembles are useless. It proves that ensemble value must be tested at the level of the scenario, not assumed at the level of the vendor logo.

The discovery loop tests capabilities that quizzes cannot see

The project-level part of SDE is where the paper becomes more interesting for real deployment.

The authors introduce sde-harness, a framework that places models inside iterative discovery loops. Each project defines a hypothesis space, an oracle or simulator, and a selection rule. The model proposes candidates. The oracle evaluates them. Promising candidates are retained. The model receives feedback and proposes again. This is not full laboratory autonomy, but it is much closer to scientific work than answering static questions.

The eight projects span retrosynthesis pathway design, molecule optimization, transition metal complex optimization, crystal structure discovery, protein sequence optimization, gene editing, symbolic regression, and solving an Ising model. The point is not that these eight exhaust science. They do not. The point is that they create a testbed where models must operate under feedback, validity constraints, and search pressure.

The project-level evaluation is main evidence for discovery behavior. It supports claims about iterative hypothesis generation under the paper’s chosen workflows. It does not prove general scientific autonomy, and the authors do not need it to. A benchmark becomes more useful when it stops pretending to be the universe.

Several project results are especially revealing.

In transition metal complex optimization, GPT-5, DeepSeek-R1, and Claude Sonnet 4.5 all find the optimal high-polarisability solution within a 1.37 million-candidate TMC space across five random seeds. Claude Sonnet 4.5 converges quickly, while DeepSeek-R1 provides the most expanded and balanced Pareto frontier when balancing polarisability and HOMO-LUMO gap. GPT-5-chat-latest performs worse, suggesting that reasoning matters in this workflow.

In symbolic regression, LLM-guided methods outperform PySR by large margins on the reported accuracy metrics. DeepSeek-R1 achieves the best in-distribution accuracy and lowest in-distribution NMSE, while GPT-5 matches the best out-of-distribution accuracy and has the lowest reported OOD NMSE among LLM methods. This is a case where LLMs seem to contribute useful global priors over symbolic structure, rather than merely running local search with more words.

In crystal structure discovery, LLM-based search achieves strong validity and metastability results compared with diffusion baselines. GPT-5 records the highest S.U.N. rate among reported methods, and DeepSeek-R1 and Grok-4 show competitive metastability rates. That result is interesting because crystal structure generation is not an obvious natural-language task. The model is not “understanding crystals” in a magical sense; it is serving as a proposal mechanism inside a constrained search loop.

These are the optimistic results. They matter. They show that LLMs can already be useful in scientific search when the task has structured representations, external evaluation, and a feedback loop that converts speculation into scored candidates.

But optimism has a boundary, and the paper is useful precisely because it draws one.

Scenario accuracy predicts some projects, then betrays you

The most business-relevant comparison in the paper is between scenario-level accuracy and project-level success.

Sometimes the relationship is straightforward. Strong performance on molecular property prediction, SMILES manipulation, gene manipulation, protein localization, and algebra corresponds to better performance in related projects such as organic molecule optimization, gene editing, symbolic regression, and protein design. Weak performance in quantum information and condensed matter theory corresponds to weak performance on the all-to-all Ising model, where most models fail to beat the simulated annealing baseline.

That is the intuitive part: if a model cannot handle the building blocks, the project suffers.

Then comes the betrayal.

In TMC-related scenarios, no model demonstrates high proficiency in granular tasks such as oxidation state, spin state, and redox potential prediction. Yet GPT-5, DeepSeek-R1, and Claude Sonnet 4.5 perform strongly in TMC optimization, finding high-polarisability candidates and exploring Pareto frontiers. The paper interprets this as evidence that explicit structure-property mastery is not always required for useful discovery. A model may still navigate the search space productively by recognizing optimization directions and supporting serendipitous exploration.

The opposite also happens. Top models score well on retrosynthesis-related questions, reaction mechanisms, and forward reaction prediction, but struggle to generate valid multi-step synthesis routes. On the Pistachio Hard benchmark, GPT-4o solves 60% of targets, outperforming GPT-5 at 53%, GPT-5-chat at 49%, Claude Sonnet 4.5 at 53%, and DeepSeek-R1 at 42%. Some traditional search methods still outperform all tested LLMs. The failure mode is not simply “bad chemistry.” It includes invalid SMILES, route-format problems, reaction validity failures, and errors that accumulate over a long chain.

This is the heart of the paper. Knowledge and discovery are related, but not linearly.

A model can fail local scientific questions yet still be useful as a search heuristic. A model can pass local questions yet fail under long-horizon validity constraints. That distinction matters for companies because many AI adoption failures happen exactly here: teams evaluate what is easy to score, then deploy into what is hard to control.

The right question is not: “Does the model understand the field?”

The better question is: “Which failure mode becomes expensive when this model is placed inside our workflow?”

The paper’s tests serve different evidentiary roles

SDE is easy to misread as a single scoreboard. It is better read as a set of diagnostic instruments. Each experiment has a different purpose.

Paper component	Likely purpose	What it supports	What it does not prove
Scenario-grounded questions	Main evidence for subtask competence	Models vary sharply by research scenario	A model can run an end-to-end project
Reasoning and model-size comparisons	Sensitivity test	Reasoning and scale help, but returns diminish unevenly	Scaling no longer matters
Cross-model error correlations	Robustness and risk analysis	Frontier models share difficult failure modes	All ensembles are useless
SDE-hard subset	Stress test	Current models remain weak on the hardest scenario-linked questions	GPT-5 Pro or any future model solves discovery
Eight project loops	Main evidence for iterative discovery behavior	Some models can guide useful search under oracles	General autonomous science is solved
Retrosynthesis failure analysis	Implementation-detail diagnosis	Validity and format constraints can dominate project success	The models lack all chemistry knowledge
Project-scenario mapping	Mechanistic interpretation	Question accuracy partially predicts project outcomes	Subtask scores are sufficient for deployment

This table is not just a reading aid. It is the deployment template. Before adopting an LLM for R&D, a firm should classify its own evaluation evidence in the same way. Which tests are main evidence? Which are ablations? Which are robustness checks? Which are just pretty demos wearing a lab coat?

The business value is workflow diagnosis, not leaderboard worship

For companies building or buying AI for R&D, the paper points toward a practical evaluation architecture.

First, define the workflow. Not “chemistry assistant,” “biology copilot,” or “materials AI.” Those are brochure labels. Define the actual stages: literature triage, candidate generation, protocol design, structure validation, simulation orchestration, result interpretation, route planning, or experiment prioritization.

Second, decompose the workflow into scenario-level capabilities. SDE’s method is useful because it treats a project as a composition of modules. A business version could map a pharmaceutical lead-optimization workflow into descriptor prediction, scaffold modification, toxicity filtering, synthesis feasibility, and assay interpretation. A materials workflow could map into structure generation, stability prediction, text-mined synthesis constraints, PXRD interpretation, and safety classification.

Third, test both local accuracy and closed-loop behavior. A model that performs well on local questions may still fail when outputs must remain valid across many steps. Retrosynthesis is the warning label. Conversely, a model with imperfect local knowledge may still add value if an external oracle can score its proposals and guide exploration. TMC optimization is the more encouraging case.

Fourth, route models by workflow role. The paper finds no single model wins across all projects. That should push firms away from one-model monoculture. But routing must be empirical. A model that is strong in symbolic regression may not be the right model for synthesis planning. A model that gives balanced Pareto exploration may not be the fastest converger. A model that is cheap may be good enough for broad candidate generation but not for hard edge cases.

Fifth, add validators where failure is structural. If invalid SMILES, invalid routes, malformed outputs, or failed tool calls dominate, the solution is not necessarily a larger model. It may be schema enforcement, constrained decoding, domain-specific validators, retrieval over reaction templates, simulator integration, or post-processing repair. Boring engineering, yes. But science has survived worse.

What Cognaptus infers, and what the paper directly shows

The paper directly shows that SDE creates a tighter evaluation bridge between research-relevant questions and iterative scientific projects. It shows that frontier models score lower on SDE’s scenario-grounded tasks than on standard science benchmarks. It shows diminishing returns from reasoning effort and scaling in some discovery-oriented contexts. It shows shared failure modes among top models. It shows that project-level performance only partially follows scenario-level accuracy.

Cognaptus infers a broader operational lesson: AI evaluation for R&D should be designed as workflow insurance. The benchmark should not merely rank models; it should identify where model failure becomes costly, where tool verification can contain risk, and where model routing can create better performance-per-dollar.

What remains uncertain is how far these results generalize beyond the paper’s scope. SDE covers four domains, 43 scenarios, and eight projects. That is substantial, but not universal. Earth sciences, social sciences, many engineering workflows, clinical trial design, regulatory science, and wet-lab automation at scale are not fully represented. Project-level tests use a subset of frontier models because evaluation is expensive. The closed-loop experiments rely on one broad evolutionary-search protocol and a specific prompting setup. Commercial API endpoints can also shift under provider-side updates, which complicates reproducibility.

These limitations do not weaken the paper’s practical value. They define it. SDE should be read less as the final exam for AI scientists and more as a scaffold for building better exams.

From model selection to research-system design

The most tempting interpretation of SDE is that LLMs are not ready to be autonomous scientists. That is true, but it is also too easy. Most serious R&D teams were not going to hand the lab keys to a chatbot anyway. At least, not while sober.

The more useful interpretation is that scientific AI systems should be designed around complementarity. LLMs can propose, explain, recombine, and explore. Oracles, simulators, databases, validators, and human experts must score, constrain, and correct. In some workflows, the model’s value is expert-like reasoning. In others, it is structured serendipity: generating plausible directions that a hard evaluator can test.

That means the future scientific co-scientist is unlikely to be a single model sitting in a chat window. It will be a system: model routers, scenario diagnostics, tool interfaces, validity gates, experiment logs, cost controls, safety checks, and escalation paths. Less glamorous than artificial superintelligence. More likely to work before the next funding cycle.

SDE’s contribution is to make that system view measurable. It shows that benchmarks must move from isolated answers to discovery loops; from broad domains to scenarios; from vendor comparison to failure-mode diagnosis; from “Can it answer?” to “Can it improve under evidence without breaking the workflow?”

The beaker, inconveniently, is where the truth lives.

Cognaptus: Automate the Present, Incubate the Future.

Zhangde Song et al., “Evaluating Large Language Models in Scientific Discovery,” arXiv:2512.15567v2, 8 May 2026, https://arxiv.org/html/2512.15567. ↩︎

The paper compares quiz competence with discovery competence#

High science benchmark scores do not survive contact with the lab bench#

Reasoning helps, but “more reasoning” is not a magic solvent#

Model diversity is weaker when models fail together#

The discovery loop tests capabilities that quizzes cannot see#

Scenario accuracy predicts some projects, then betrays you#

The paper’s tests serve different evidentiary roles#

The business value is workflow diagnosis, not leaderboard worship#

What Cognaptus infers, and what the paper directly shows#

From model selection to research-system design#