Think Like a Scientist: When LLMs Stop Guessing and Start Reasoning

Factory dashboards are full of curves. Temperature curves, vibration curves, pressure curves, yield curves, defect curves. Most AI systems are happy to predict the next point on the curve and call it intelligence. Useful, yes. Scientific, not quite.

Engineers often want something more stubbornly old-fashioned: an equation. Not because equations look elegant in a slide deck, although they do help meetings feel temporarily civilized. They want equations because equations can be inspected, simulated, challenged, simplified, embedded into control systems, and argued over by humans who still prefer causes to vibes.

That is the promise of symbolic regression: given observations, recover a compact mathematical expression that explains the relationship behind the data. The catch is that symbolic regression has always been a search problem wearing a lab coat. Choose the wrong operators, function library, regularization strength, or structural assumptions, and the true equation either disappears from the search space or drowns inside it.

The paper Think like a Scientist: Physics-guided LLM Agent for Equation Discovery introduces KeplerAgent, a physics-guided LLM agent that tries to solve this bottleneck by changing the role of the language model.¹ The LLM is not asked to hallucinate the final law from raw data. Instead, it coordinates scientific tools, extracts structural clues, and uses those clues to configure symbolic regression engines such as PySINDy and PySR.

That distinction matters. The paper is not mainly saying, “LLMs are better at guessing equations now.” That would be the obvious, slightly boring reading. The sharper claim is this: LLMs become more useful when they act as coordinators of constrained search. They help decide how to search before letting the solver search.

In other words, the model is less Newton under an apple tree and more a reasonably competent research assistant who knows which instrument to pick up next. Given the current state of AI agents, that is already ambitious enough.

Equation discovery fails when the search space is politely allowed to explode

Symbolic regression sounds simple when written as a product demo:

Data in. Equation out.

The actual workflow is less magical. A symbolic regression system needs a hypothesis space: allowed variables, operators, polynomial degree, derivative terms, functional forms, sparsity assumptions, and stopping criteria. Every design choice matters.

A small hypothesis space is efficient but brittle. If the true equation requires a rational term, a sine function, or a derivative term that the library does not include, the method cannot discover it. A large hypothesis space is generous but expensive. It may contain the truth, along with an impressive number of mathematically legal distractions. Search then becomes slow, unstable, and occasionally creative in the worst possible way.

A useful mental model is:

$$ H = {\text{expressions buildable from variables, operators, and constraints}} $$

If the operator set is large and the expression depth increases, the number of possible expressions grows very quickly. Symbolic regression is not merely trying to fit numbers. It is trying to choose a language in which the system is allowed to speak.

Human scientists rarely attack this problem by brute force. They inspect plots. They notice oscillation, saturation, symmetry, conservation, separability, scale behavior, or known physical structure. Then they narrow the candidate equation family before fitting anything seriously.

The paper’s critique of many LLM-based symbolic regression systems is that they still behave too directly:

$$ \text{data} \rightarrow \text{candidate equation} \rightarrow \text{score} \rightarrow \text{repeat} $$

That is automation, but not much scientific reasoning. KeplerAgent instead follows a slower-looking but smarter workflow:

$$ \text{data} \rightarrow \text{structural diagnosis} \rightarrow \text{constrained search} \rightarrow \text{equation} \rightarrow \text{simulation check} $$

The middle step is the point. Without it, the LLM is just another generator of plausible-looking expressions. With it, the LLM becomes a search-space manager.

KeplerAgent’s real job is hypothesis pruning, not equation poetry

KeplerAgent is built as a ReAct-style tool-using agent. It receives the dataset, tool specifications, a workspace listing, and an experience log. At each step, the model decides whether to analyze the data, inspect generated plots, discover symmetry, run PySINDy, run PySR, or stop.

The architecture is deliberately modular:

Component	What it does	Why it matters
Python interpreter	Performs exploratory data analysis and creates plots or summaries	Gives the agent initial evidence about data shape, noise, and variable behavior
Visual subagent	Interprets scientific plots	Converts visual patterns into possible functional forms
Symmetry discovery tool	Estimates Lie symmetries in differential equation systems	Turns physical structure into constraints
PySINDy	Performs sparse identification of dynamical systems	Efficient when the true equation is sparse in a suitable library
PySR	Performs genetic-programming-based symbolic regression	Handles more flexible expression structures but needs careful configuration
Workspace and experience log	Stores intermediate files, results, and tool history	Prevents the agent from acting like every step is its first cup of coffee

The mechanism is not that the LLM “knows physics” in some mystical sense. The mechanism is that it uses partial evidence to restrict the solver’s choices. For PySINDy, that may mean selecting polynomial degree, derivative order, normalization, threshold, or symmetry constraints. For PySR, it may mean proposing a template expression and limiting operators or nesting.

This is a practical design lesson. In many enterprise AI systems, the most valuable thing an LLM can do is not produce the final answer. It is to choose the right tool, translate messy context into tool parameters, read the intermediate output, and decide what should happen next.

That is less glamorous than “autonomous discovery.” It is also much closer to how useful automation actually works.

The strongest mechanism is the conversion of soft clues into hard constraints

The paper’s most important move is translation.

A visual pattern is soft evidence. A symmetry estimate is soft evidence. A hypothesis about separability is soft evidence. KeplerAgent turns those into hard configuration choices: use this operator family, try this template, constrain this search space, rerun SINDy with this symmetry generator.

This matters because symbolic regression backends are powerful but obedient. PySR and PySINDy do not magically know which hypothesis space is scientifically reasonable. They search what they are given. Feed them an overbroad space and they burn compute. Feed them an underbroad space and they miss the truth with perfect discipline.

KeplerAgent’s role is therefore not to replace symbolic regression. It wraps symbolic regression with a reasoning loop.

A simplified version of the mechanism looks like this:

Agent observation	Translation step	Solver consequence
Trajectory suggests oscillation	Include trigonometric candidates or templates	Search can represent periodic structure
Dynamics show coupled variables	Treat equations as a system, not isolated scalar tasks	Search can exploit relationships across equations
Symmetry tool returns approximate generator	Interpret it as a known exact symmetry when justified	PySINDy can search within an equivariant subspace
Previous tool call fails with high error	Adjust library, constraints, or solver choice	Search becomes iterative rather than one-shot

This is why the paper’s mechanism-first reading is stronger than a leaderboard reading. The benchmark numbers are useful, but the architecture explains why the numbers move.

The algebraic benchmark shows efficiency; the dynamical benchmark shows the thesis

The paper evaluates KeplerAgent on two regimes: LSR-Transform from LLM-SRBench and a custom DiffEq benchmark of coupled ODE/PDE systems.

LSR-Transform is important because it reduces memorization. It transforms classical physics equations by changing which variable is treated as the target, so a model cannot simply recall a famous formula in its standard textbook form. The benchmark tests whether the method can rediscover the structure under a less familiar presentation.

On LSR-Transform, the reported results are mixed but informative:

Method	Symbolic accuracy	Reported average NMSE	Runtime	Tokens
PySR	37.84%	2.82	2440s	—
LLM-SR	31.53%	0.09	2118s	209k
KeplerAgent @1	35.14%	1.50	238s	42k
KeplerAgent @3	42.34%	1.21	698s	125k

The obvious temptation is to declare victory because KeplerAgent @3 reaches the highest symbolic accuracy. That is true, but the more careful interpretation is better.

Single-run KeplerAgent is not the symbolic accuracy leader against PySR on this benchmark. PySR has higher symbolic accuracy than KeplerAgent @1, though at much higher runtime. LLM-SR has much lower average NMSE, but the paper argues that this may reflect optimization toward numerical fit rather than exact symbolic recovery. For symbolic regression, a low-error approximation is useful, but it is not the same as recovering the governing form.

So the LSR-Transform result supports a moderate claim: KeplerAgent can improve symbolic recovery under modest parallel attempts while using substantially less runtime and fewer tokens than LLM-SR. It is evidence for efficiency and orchestration. It is not yet the most dramatic proof of “scientific reasoning.”

The DiffEq benchmark is where the paper’s central thesis becomes much more convincing.

The authors construct 10 systems governed by ordinary and partial differential equations, each with two dependent variables and two equations to recover. This matters because coupled dynamical systems contain exactly the kind of structure—symmetry, interaction, derivative terms, long-horizon behavior—that brute-force equation guessing handles poorly.

The results are sharper:

DiffEq metric	PySR	LLM-SR	KeplerAgent
Symbolic accuracy, clean data	40%	30%	75%
Symbolic accuracy, noisy data	15%	10%	45%
Pointwise NMSE, clean data	0.16	0.26	0.04
Pointwise NMSE, noisy data	5.89	4.80	0.15
Long-term NMSE, clean data	1.56	2.18	1.65
Long-term NMSE, noisy data	2.80	2.62	0.33
Runtime, clean data	119s	3648s	120s
Runtime, noisy data	120s	4048s	147s
Tokens, clean data	—	182k	23k
Tokens, noisy data	—	184k	30k

The noisy-data result is the commercial signal hiding inside the academic table. Clean benchmarks are useful. Noisy benchmarks are closer to industrial reality, where sensors drift, derivatives are estimated, sampling is uneven, and the universe refuses to format itself as a CSV for graduate students.

On noisy DiffEq systems, KeplerAgent reaches 45% symbolic accuracy, compared with 15% for PySR and 10% for LLM-SR. Its pointwise NMSE is also far lower. The long-term simulation metric is especially important: it tests whether discovered equations remain useful when integrated forward, not just whether they match a one-step derivative target.

That distinction matters for digital twins, robotics, process control, and engineering simulation. A model that fits one-step data but explodes during rollout is not a model. It is a delayed failure with a neat equation attached.

The appendix trace is the paper’s most revealing case study

The reaction-diffusion trace in the appendix is more valuable than a decorative example. It shows how the agent behaves when the problem is difficult.

The sequence is instructive:

KeplerAgent first runs PySINDy with default parameters on the reaction-diffusion data. The result has a high MAPE of 70.033%.
It then calls the symmetry discovery tool. The tool returns a Lie generator matrix close to a rotation generator.
The agent interprets the approximate numerical result as exact rotational symmetry and reruns SINDy with symmetry constraints.
The resulting equations improve substantially, with MAPE falling to 15.584%.
Further changes, such as increasing polynomial degree, do not improve the result.

This is the mechanism in miniature. The symmetry tool does not directly discover the final equation. It produces an intermediate structural clue. The LLM interprets the clue, cleans it into a physically meaningful exact form, and passes it into a downstream solver as a constraint.

That is the paper’s best argument for agentic scientific reasoning. The LLM is not merely sampling formulas. It is bridging between numerical diagnostics and symbolic modeling choices.

But the same trace also exposes the system’s weakness. After the useful symmetry-guided step, the agent continues trying repetitive SINDy variations even after evidence suggests the low-degree library already captured what it could. The authors note that the agent should have used better noise analysis or explored alternative tools rather than staying too loyal to one path.

This is a useful limitation because it is specific. The problem is not “AI is imperfect,” which is the sort of sentence that can be safely deleted from almost any article. The problem is that the agent lacks a sufficiently formal state representation of the hypothesis space and does not always learn enough from failed tool calls.

For business systems, this is exactly where orchestration quality becomes decisive. Tool use alone is not strategy. A bad agent with many tools is just a more expensive way to be confused.

The evidence stack is stronger when separated by purpose

Not every experiment in the paper is trying to prove the same thing. Mixing them together creates a bland “method beats baselines” story. Separating their purpose gives a cleaner reading.

Evidence item	Likely purpose	What it supports	What it does not prove
LSR-Transform benchmark	Main comparison on transformed algebraic equations	KeplerAgent can improve symbolic accuracy with multiple runs and lower runtime/token use	It does not show universal superiority on numerical fit
DiffEq clean data	Main evidence on coupled dynamical systems	Physics-guided orchestration strongly improves symbolic recovery	It does not eliminate all long-horizon simulation issues
DiffEq noisy data	Robustness test under measurement noise	Structure-guided search is much more robust than baselines in this setup	It does not prove robustness across all real-world sensor regimes
Long-term prediction curves	Practical validation for discovered dynamics	Some discovered equations remain useful under simulation rollout	It does not turn symbolic recovery into guaranteed operational stability
Reaction-diffusion reasoning trace	Implementation and behavior analysis	The agent can translate discovered symmetry into solver constraints	It also reveals repetitive tool use and imperfect stopping behavior

This distinction matters because the business meaning of the paper is not “deploy this tomorrow.” The business meaning is that agentic scientific modeling becomes more credible when the agent’s intermediate reasoning is tied to domain tools and validation checks.

The point is not autonomy. The point is controlled narrowing.

What business readers should take from KeplerAgent

For Cognaptus readers, the value of this paper is not limited to physics. Most companies are not trying to rediscover reaction-diffusion equations before lunch. Some are, but they probably have scarier problems already.

The broader design pattern is more transferable:

$$ \text{raw observations} \rightarrow \text{domain diagnostics} \rightarrow \text{constrained model search} \rightarrow \text{auditable candidate model} $$

This pattern applies to many technical domains where black-box prediction is insufficient:

Research component	Enterprise analogue	Operational value	Boundary
Observational physical data	Sensor data, process logs, experimental measurements	Converts raw traces into candidate mechanisms	Works best where mechanistic structure actually exists
Symmetry and visual diagnostics	Domain-specific feature tests, invariance checks, failure-mode detectors	Narrows the model search space	Requires validated diagnostic tools
PySINDy / PySR backends	Specialized solvers, optimizers, simulators, statistical engines	Uses mature tools instead of forcing the LLM to solve everything	Solver configuration remains a technical skill
Workspace and experience log	Audit trail for agent decisions	Helps governance, reproducibility, and debugging	Logs are not proof of correctness
LLM orchestration	Workflow controller and translator	Reduces manual iteration in model discovery	Needs guardrails against repetitive or poorly justified tool calls

The most immediate use case is not fully autonomous scientific discovery. That phrase has a habit of aging badly. The more realistic use case is assisted model engineering: systems that help experts generate, constrain, test, and document interpretable models faster.

In industrial modeling, the expensive part is often not fitting one model. It is the loop: inspect data, choose assumptions, configure a solver, check failure, revise assumptions, and document why the result is trustworthy enough to use. KeplerAgent automates parts of that loop.

That is not trivial. It is also not magic. Pleasantly, we can survive without magic.

Where the paper stops and business inference begins

The paper directly shows that KeplerAgent improves symbolic recovery on the authors’ selected benchmarks, especially for coupled differential equation systems and noisy DiffEq settings. It also shows that agentic tool orchestration can reduce runtime and token usage relative to LLM-SR in the reported experiments.

Cognaptus’ business inference is that LLM agents are better positioned as workflow orchestrators than as standalone answer engines in technical domains. The LLM should not be forced to carry the entire problem in its weights. It should manage tools, constraints, intermediate artifacts, and validation paths.

What remains uncertain is the cost of adapting this approach to real enterprise settings. A production version would need domain-specific diagnostic tools, robust noise handling, stronger stopping criteria, trace inspection, versioned experiments, human review, and integration with existing simulation or control pipelines.

The paper’s own discussion points in this direction. The authors note that the current toolset is small. Adding more tools could help, but it also creates context-management problems. Tool specifications can bloat the prompt. The agent may forget earlier findings or choose tools poorly. Suggested future directions include workflow graphs, subagents, retrieval over tool specifications, and a more formal state space for symbolic regression.

That last idea is particularly important. A structured state would represent the current hypothesis space, active constraints, candidate equations, failures, and remaining uncertainty. Without that, an agent may look thoughtful while merely looping through parameter variations. We have all seen meetings like this.

The real lesson is architectural, not theatrical

The headline phrase “think like a scientist” could easily become marketing fog. In this paper, it has a more precise meaning.

Thinking like a scientist does not mean producing a confident explanation in scientific prose. It means using observations to infer structure, using structure to constrain hypotheses, using tools to test candidate models, and using failures to decide what to try next.

KeplerAgent only partially achieves this. The successful parts are genuinely interesting: symmetry discovery, template-guided PySR, workspace memory, experience logs, and solver configuration. The weaker parts are equally informative: repetitive tool calls, imperfect noise awareness, and the absence of a formal hypothesis-state representation.

For AI system design, that is the useful lesson. Bigger language models may improve direct equation guessing, but the more durable path is probably orchestration: give the model instruments, let it reason over intermediate outputs, and force it to leave an audit trail.

The future of scientific AI will not be a chatbot staring at a spreadsheet until a formula appears. It will be a workflow where models inspect, constrain, simulate, compare, and revise.

Less oracle. More lab notebook.

That is not as flashy as “AI discovers laws of nature.” It is also much more likely to be useful.

Cognaptus: Automate the Present, Incubate the Future.

Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, and Rose Yu, “Think like a Scientist: Physics-guided LLM Agent for Equation Discovery,” arXiv:2602.12259, 2026. https://arxiv.org/abs/2602.12259 ↩︎

Equation discovery fails when the search space is politely allowed to explode#

KeplerAgent’s real job is hypothesis pruning, not equation poetry#

The strongest mechanism is the conversion of soft clues into hard constraints#

The algebraic benchmark shows efficiency; the dynamical benchmark shows the thesis#

The appendix trace is the paper’s most revealing case study#

The evidence stack is stronger when separated by purpose#

What business readers should take from KeplerAgent#

Where the paper stops and business inference begins#

The real lesson is architectural, not theatrical#