When LLMs Stop Guessing and Start Calculating

A simulation job does not care how elegant the prompt was.

It cares whether the input files are valid, whether the parameters are compatible, whether the previous step produced the right intermediate state, whether the solver converged, and whether the final number actually means what the workflow says it means. This is where the romance of “AI scientists” usually meets the concrete wall of scientific computing. The model can sound like a postdoc. The machine still wants the correct INCAR tag.

The paper behind this article, An Agentic Framework for Autonomous Materials Computation, takes that wall seriously.¹ It does not ask a general-purpose LLM to “do materials science” by vibes, memory, or confident autocomplete. It wraps LLMs inside an expert-informed agentic framework for VASP-based first-principles materials computation, then tests the system on structural relaxation, band structure, adsorption energy, and transition-state tasks.

The important shift is not that the LLM becomes brilliant. It is that the LLM is no longer allowed to freelance.

The agent is a workflow machine, not a free-form scientist

The paper’s central design choice is deceptively simple: materials simulations already have established scientific procedures, so the agent should operate through those procedures rather than invent them on demand.

The authors formalize expert practice into Workflows. A workflow represents the high-level strategy for a scientific goal, such as structural relaxation, band structure calculation, adsorption energy calculation, or transition-state analysis. Each workflow is then executed through reusable Modular Components: file reading, file writing, command-line execution, data extraction, error handling, and LLM-based parameter generation.

That architecture matters because most failure in this setting is not a failure to write fluent scientific language. It is a failure to preserve procedural discipline.

A simplified version of the mechanism looks like this:

Stage	What the system does	Why it matters
User request	Receives the simulation goal and essential files such as POSCAR, POTCAR, and KPOINTS	Grounds the task in actual computational inputs
Workflow selection	Maps the request to a predefined expert workflow	Prevents the model from inventing invalid scientific procedures
Hierarchical prompting	Supplies domain background, current state, intermediate outputs, and output constraints	Turns parameter generation into a bounded task
Modular execution	Writes files, runs commands, parses outputs, and handles errors	Makes the agent operate in the computational environment, not just in text
Result extraction	Parses files such as OUTCAR into user-readable outputs	Separates “the job ran” from “the result was measured”

The LLM still plays an important role. It generates context-aware simulation parameters, especially for files such as VASP’s INCAR. But it does this inside a scaffold of scientific constraints and executable components.

That is the mechanism-first lesson. The model supplies flexible reasoning. The framework supplies discipline. In scientific computing, discipline is not decorative. It is the product.

The benchmark tests execution, not eloquence

The paper also contributes a benchmark, and this is where the work becomes more than another nice demo.

The benchmark covers 80 practical computational scenarios across four common materials-computation task types. The authors also state that the underlying benchmark involves over 100 materials, with data sourced from public databases or literature and then computationally reproduced and validated.

The four task categories are useful because they represent different levels of workflow difficulty:

Task	Scientific role	Evaluation focus
Structural Relaxation (SR)	Finds a stable, low-energy structure	Energy and structural similarity using SOAP descriptors
Band Structure (BS)	Characterizes electronic properties such as band gaps	Predicted band gap against reference values
Adsorption Energy (AE)	Measures molecule–surface interaction, important in catalysis	Adsorption, surface, adsorbate, and gas-phase energies
Transition State (TS)	Locates reaction barriers and mechanism feasibility	Initial/final energies, NEB interpolation, reaction energy, activation barrier

This is not a benchmark where the model answers questions about science. It is a benchmark where the system must execute scientific computation.

That distinction is not a minor detail. A chatbot can produce a plausible answer in seconds. A transition-state data point, according to the paper, may require 36 to 72 hours on a conventional 28-core CPU node, because it involves initial structure construction, transition-state search using NEB, and validation. When computation is this expensive, a “nearly correct” workflow is often just a slow way to burn compute.

A serious benchmark must therefore ask two separate questions:

Did the task complete?
Was the result accurate?

The paper wisely refuses to collapse those into one feel-good metric.

Completion is not accuracy, and the paper proves it

The authors evaluate six models: DeepSeek-V3, GPT-4o, Qwen3-32B, o4-mini, Gemini-2.5 Pro, and Claude-3.7 Sonnet. Each model is tested with and without the agent framework.

The broad result is clear: agent support improves both completion rate and accuracy across the tested models. For GPT-4o, the paper reports completion rising from 66.46% to 97.92%, and result accuracy rising from 45.74% to 73.07%. That is not a small prompt tweak. That is the difference between a model that often fails to run the scientific process and a system that usually gets through the process with more reliable outputs.

But the more interesting result is the gap between “ran successfully” and “was scientifically right.”

The authors report that most task types reach nearly complete execution after agent integration. Accuracy also improves: structural relaxation rises above 95% on average, while band structure and adsorption energy exceed 80%. Then transition-state tasks spoil the party, as they should. Completion improves substantially, but accuracy remains persistently low.

This is the paper’s most useful correction to a common misconception:

Reader belief	Paper’s correction	Why it matters
If the agent finishes the workflow, the science is probably reliable	Transition-state tasks can complete yet still produce inaccurate results	Operational reliability and scientific validity are different layers
Bigger models should solve most of the issue	Agentic structure improves all models, including open-source ones	Architecture can narrow model gaps, but does not erase domain difficulty
Tool use is enough	The hard cases require parameter understanding, context consistency, and inspection	Tool access without scientific governance is just faster failure

The transition-state result is especially important because it prevents the article from becoming an “agents solve science” brochure. The agent improves execution. It does not magically remove the complexity of reaction-path calculations.

Good. Science needed that reality check.

The failure cases read like an engineering requirements document

The paper’s failure analysis is short, but it is probably the most directly useful part for anyone building agentic systems outside materials science.

The authors identify three recurring failure modes.

First, LLMs sometimes initialize incorrect or missing INCAR tags. For example, they may fail to set tags such as LHFCALC or AEXX for hybrid functional calculations, or they may invent non-existent tags. This is the most basic failure: the job may not even start.

Second, LLMs struggle with tag interdependence. The paper gives the example of IBRION and POTIM. IBRION controls the ionic relaxation algorithm, while the meaning of POTIM depends on that algorithm. Setting either tag in isolation is not enough. The combination must be scientifically and computationally coherent.

Third, LLMs can fail to manage workflow context across steps. In transition-state calculations using NEB, the paper notes that if cell relaxation with ISIF=3 is used in the initial or final structural relaxation, the resulting inconsistent cells may cause the later NEB interpolation to fail.

These failures are not random hallucinations in the casual chatbot sense. They are state, dependency, and protocol failures.

That matters for business deployment because many enterprise processes have the same shape:

Materials-computation failure	Enterprise analogue
Missing required simulation tag	Missing required regulatory field
Incompatible parameter combination	Conflicting contract, accounting, or risk assumptions
Bad state passed from one step to the next	Workflow automation built on stale or inconsistent intermediate data
Completion without scientific accuracy	Process automation that closes tickets but creates hidden downstream errors

The paper is about VASP and DFT, but the failure taxonomy travels well. In serious workflows, the agent’s job is not only to answer. It must preserve valid state, respect interdependencies, and know when completion is not enough.

Open-source models become more usable when the workflow carries expertise

One of the paper’s more business-relevant experiments compares open-source and proprietary models.

Without the agent framework, proprietary models show a clear advantage. The paper’s group-level plots report open-source models at 58.59% completion without agent support versus 78.67% for proprietary models. With agent support, open-source models rise to 92.88%, while proprietary models reach 99.58%.

Accuracy follows the same pattern. Open-source models improve from 37.73% to 64.84% with the agent. Proprietary models improve from 48.73% to 76.45%.

The conclusion is not “open-source is now equal.” It is more practical than that: workflow structure narrows the gap.

For organizations handling sensitive R&D data, that matters. A locally deployable model wrapped in a strong domain workflow may be more attractive than sending proprietary structures, simulation setups, or experimental hypotheses to an external API. The paper itself points toward secure local deployment as a benefit of stabilizing open-source models through agentic structure.

Still, the business interpretation needs discipline. Open-source deployment is not automatically cheaper or safer. Local infrastructure, expert workflow design, validation, model monitoring, and compute management all have costs. The paper shows that agentic scaffolding can make open-source models more viable. It does not show that every firm should immediately migrate scientific computing to local agents and declare victory over vendors. Please do not print that on a conference banner.

Reasoning models help, but scaffolding still does the heavy lifting

The paper also compares reasoning-capable models with standard models. The group-level results again support the architecture story.

Reasoning models improve from 78.74% to 99.79% completion with agent support, while non-reasoning models improve from 61.21% to 98.72%. For result accuracy, reasoning models rise from 47.58% to 77.16%, while non-reasoning models rise from 39.17% to 66.15%.

Reasoning helps. That should not surprise anyone. Multi-step simulations require planning, contextual memory, and sensitivity to interdependent parameters. A model better at long-range reasoning should perform better.

But the agent effect remains larger than the model-brand story. Both groups benefit because the framework constrains the problem. It decomposes tasks, preserves context, enforces output formats, executes commands, and extracts results. The LLM is important, but it is not the only source of intelligence. Some of the intelligence is encoded in the workflow library. Some is in the parser. Some is in the benchmark. Some is in the decision to measure completion and accuracy separately.

This is a healthier mental model for AI deployment. Instead of asking, “Which model is smartest?”, ask:

System layer	Practical question
Workflow library	Have experts encoded the right procedure?
Prompt hierarchy	Does the model receive the right context at the right step?
Execution layer	Can the system run real tools and capture failures?
Validation layer	Does it measure scientific correctness, not just task completion?
Governance layer	Does a human review the cases where automation is structurally weak?

That last row is where many agent projects quietly go to die. They confuse “the model can call tools” with “the system is governed.”

The business value is throughput with auditability, not scientist replacement

The natural lazy headline is that AI agents are getting closer to autonomous science. True, but too broad to be useful.

The more actionable interpretation is narrower: R&D teams can encode expert computational procedures into auditable agent workflows, then use LLMs to handle variable parameter generation and procedural coordination inside those guardrails.

That has several practical implications.

First, automation value is strongest where workflows are frequent, expensive, and protocol-driven. Structural relaxation, band structure, and adsorption energy calculations fit this pattern better than open-ended scientific reasoning. The agent can reduce repetitive labor and improve throughput because the tasks have known procedural skeletons.

Second, the framework can support reproducibility. A workflow library, modular execution components, and explicit parsing scripts create a trail of what was selected, generated, run, and extracted. That is very different from asking a model to produce a final answer and hoping its hidden reasoning was charming.

Third, benchmark design becomes part of product design. If a company builds an internal scientific agent, it should not only demo five impressive cases. It should maintain task suites that distinguish completion, numerical accuracy, convergence quality, and failure type. A system that cannot tell the difference between “ran” and “right” is not ready for expensive science.

Fourth, local deployment becomes more plausible when the agent carries part of the domain expertise. The paper suggests that open-source models, once wrapped in the agent framework, can approach high completion rates. For firms concerned about confidential materials structures, proprietary catalysts, or unpublished R&D pipelines, this is not a footnote. It is a deployment pathway.

Here is the clean separation:

Level	What the paper directly shows	What Cognaptus infers for business use	What remains uncertain
Technical result	Agent support improves completion and accuracy across six LLMs on VASP-based benchmark tasks	Agentic design should focus on workflow encoding, not only model selection	Whether the same gains hold across other scientific software stacks
Operational value	Routine tasks become more reliably executable	R&D teams can reduce repetitive setup and execution overhead	Actual ROI depends on compute cost, human review cost, and failure recovery
Model strategy	Open-source models improve substantially with agent scaffolding	Local deployment becomes more credible for sensitive workflows	Local systems still require infrastructure, monitoring, and expert maintenance
Scientific boundary	Transition-state accuracy remains difficult	Human review gates are still required for complex reaction-path work	Fully autonomous discovery remains unproven

This is not less exciting than “AI replaces researchers.” It is more useful.

Transition-state work is the warning label

The transition-state results should shape how this paper is read.

Transition-state calculations are not just another row in the benchmark table. They require locating saddle points and estimating activation barriers, often through NEB-based workflows. They are sensitive to initialization, interpolation paths, convergence behavior, and intermediate geometry choices. The paper’s own analysis notes that even when explicit runtime errors disappear, transition-state outputs can still fail because of poor convergence, numerical errors from surface reconstruction, or divergent optimization from imprecise interpolation.

That is the difference between workflow automation and scientific judgment.

The agent can execute the ritual. It may still need a scientist to decide whether the ritual produced a valid result.

For enterprise AI, this is the useful pattern: automate the parts where procedure is stable, add validation where outputs are measurable, and preserve expert review where correctness depends on subtle domain interpretation. The failure is not needing humans. The failure is pretending you do not.

Boundaries: this is not yet autonomous scientific discovery

The paper positions the work as a step toward fully automated scientific discovery. That is fair, but the boundary should be kept visible.

The study is demonstrated in VASP-based first-principles materials computation. Its strength comes from domain-specific workflows, expert-curated benchmark data, structured inputs, and measurable outputs. Those are advantages, not limitations to be embarrassed about. They are exactly why the system works.

But they also define the boundary.

The paper does not prove that a general LLM can autonomously conduct open-ended scientific discovery. It does not remove the need for expert workflow design. It does not solve transition-state accuracy. It does not validate physical-lab synthesis or experimental confirmation. It does not show that an agent can formulate the right research question, choose the right computational campaign, and close the loop from hypothesis to real-world material deployment without expert oversight.

What it does show is more concrete: if scientific work can be decomposed into validated computational workflows, an agent can substantially improve execution reliability and result accuracy compared with standalone LLM use.

That is already enough.

The real lesson: stop asking the model to be the system

The paper’s most durable contribution is architectural. It shows that reliable scientific automation is not produced by letting an LLM “think harder” in a blank text box. It comes from putting the model inside a system that knows the domain’s procedures, tools, files, states, metrics, and failure modes.

The agent stops the LLM from guessing in public. It makes the model calculate inside a controlled workflow.

That is the broader lesson for any organization building serious AI agents. Do not start with the fantasy of autonomy. Start with the anatomy of the work: the required inputs, the valid procedures, the hidden dependencies, the execution environment, the measurable outputs, and the cases where completion is not correctness.

Then build the agent around that.

The future of useful AI in science may be less like a genius in a chat window and more like a disciplined lab operator who knows when to run the protocol, when to parse the output, and when to call the actual scientist back into the room.

Less glamorous, perhaps. Much less likely to waste a 72-hour computation because it hallucinated a tag.

Cognaptus: Automate the Present, Incubate the Future.

Zeyu Xia et al., “An Agentic Framework for Autonomous Materials Computation,” arXiv:2512.19458, 2025, https://arxiv.org/html/2512.19458. ↩︎

The agent is a workflow machine, not a free-form scientist#

The benchmark tests execution, not eloquence#

Completion is not accuracy, and the paper proves it#

The failure cases read like an engineering requirements document#

Open-source models become more usable when the workflow carries expertise#

Reasoning models help, but scaffolding still does the heavy lifting#

The business value is throughput with auditability, not scientist replacement#

Transition-state work is the warning label#

Boundaries: this is not yet autonomous scientific discovery#

The real lesson: stop asking the model to be the system#