Code That Thinks, Models That Don’t: What SymPyBench Reveals About LLM Scientific Reasoning

Calculator.

That is the boring object hiding inside many “AI reasoning” debates. In technical work, the uncomfortable question is not whether a language model can explain a formula with academic confidence. It is whether the model can still get the answer right after the numbers change, the wording shifts, the unit conversion becomes annoying, and no multiple-choice option politely waves from the corner saying, “Pick me.”

This is where SymPyBench becomes interesting. The paper introduces a physics benchmark built around parameterized problems, step-by-step reasoning, and executable Python code using SymPy and Pint as ground truth.¹ That sounds like benchmark plumbing. It is not. It is closer to an audit machine: the same underlying problem can be re-instantiated with different numbers, phrased in different ways, and presented in different answer formats. The model is no longer being asked, “Can you answer this one problem?” It is being asked, “Do you actually have a stable procedure?”

The distinction matters because much of enterprise AI adoption is quietly moving from text assistance toward technical workflows: engineering review, scientific search, tutoring, compliance checks, simulation support, data analysis, and operations diagnostics. In those settings, a model that is correct once but unstable across nearly identical variants is not “almost reliable.” It is a liability wearing a lab coat.

SymPyBench’s core message is therefore not that LLMs are bad at physics. That would be too easy, and frankly a little lazy. The sharper finding is that ordinary accuracy hides different failure modes: arithmetic errors, unit errors, format dependence, answer-generation failures, and cases where the model understands the symbolic structure but collapses during numerical execution. The benchmark’s value is diagnostic. It tells us not just whether the model failed, but what kind of failure we are probably looking at.

SymPyBench is a benchmark, but its real contribution is the evaluation mechanism

The paper starts from a practical weakness in many scientific reasoning benchmarks. A conventional dataset gives a model a fixed question, then scores whether the answer matches. That is convenient, scalable, and dangerously tidy. It collapses reasoning into a single endpoint.

SymPyBench changes the test object. Each problem is represented as a parameterized scientific task with variables, constants, reasoning steps, and executable Python code. The benchmark then creates controlled variants along three axes:

Variation axis	What changes	What should stay stable
Textual variation	The wording of the same problem	The physical law and solution method
Numerical variation	Input values are perturbed, typically within a controlled range	The computation procedure
Format variation	Free-form, MC-Symbolic, and MC-Numerical versions	The underlying answer logic

This is the mechanism that makes the paper useful. A single benchmark result can tell us that a model answered correctly. A dynamic benchmark can ask whether the answer survives perturbation. That is a different standard.

The paper’s construction pipeline is also worth understanding because it explains why the benchmark can support this kind of test. The authors begin with open-source physics problems, filter out problems that depend on diagrams or previous context, convert each remaining problem into a structured representation, generate parameterized templates, create textual variants, and synthesize Python functions that compute the answer. The code is then validated against the original structured outputs. Failed code is discarded.

That last step is not glamorous, but it is the hinge. The benchmark is not merely storing answers. It is storing a procedure for generating answers. This allows the same problem family to be tested across many inputs. In business terms, it moves evaluation from a static checklist toward repeatable stress testing.

The dataset contains 15,045 undergraduate-level physics problems. Its domain mix is not trivial: mechanics accounts for 33.80%, electricity and magnetism for 26.76%, modern physics for 12.68%, waves and oscillations for 11.27%, thermodynamics for 8.45%, and optics for 7.04%. The question types are also uneven by design: 71.52% are free-form, while MC-Symbolic and MC-Numerical each make up 14.24%. The imbalance matters because free-form questions are where generation, formatting, arithmetic, and multi-step dependencies come out to play. Multiple choice, as usual, is a padded room.

Executable ground truth changes what “correct” means

The most important design choice is the use of executable Python code. For each retained problem, the benchmark has a function that can calculate the ground-truth answer for new parameter values. SymPy handles symbolic algebra and equation solving; Pint handles units and dimensional consistency.

That means the benchmark can generate new numerical instances without manually writing every answer. More importantly, it can test whether a model has learned a stable relationship rather than memorized a static instance.

This is where SymPyBench differs from many existing scientific QA benchmarks. The paper compares it with ScienceQA, SciBench, SciEval, JEEBench, MMLU Physics, and PhysicsQA. Many of those benchmarks are useful, but they generally lack some combination of parameterized numerical variation, textual variation, executable code, and unit validation. SymPyBench’s claim is not that earlier benchmarks are worthless. The claim is that they are weaker instruments for diagnosing scientific reasoning under controlled variation.

The difference can be summarized this way:

Evaluation style	What it can reveal	What it may hide
Fixed-answer benchmark	Whether the model got this instance right	Whether the model generalizes across equivalent variants
Multiple-choice benchmark	Whether the model can recognize or select a plausible answer	Whether it can generate, compute, format, and validate a solution
Code-backed dynamic benchmark	Whether the model’s procedure survives controlled changes	Still limited by task domain, data construction, and evaluation design

The third row is the paper’s real contribution. It is not merely a bigger physics quiz. It is an attempt to evaluate the stability of reasoning.

Accuracy is only the opening bid

The paper reports the usual metrics: exact match accuracy and partial accuracy. Exact match requires the model to solve the whole problem correctly. Partial accuracy scores subproblems within structured multi-part solutions. That distinction matters because many physics problems contain multiple connected parts; failing one step can poison the next.

But the paper’s stronger move is to add robustness metrics:

Metric	What it asks	Why it matters
Exact Match Accuracy	Did the model solve the full problem correctly?	Measures end-to-end correctness
Partial Accuracy	How many subparts were correct?	Separates total failure from partial competence
Consistency Score	Did the model answer all perturbed variants correctly?	Measures stability across equivalent problem families
Confusion Rate	Did performance hover around 40–60% across variants?	Flags unstable or possibly guessing behavior
Complete Failure Rate	Did the model fail all variants in a group?	Identifies persistent blind spots

This metric set is more useful than a leaderboard because it separates “sometimes right” from “procedurally reliable.” A model can have respectable exact-match accuracy and still show poor consistency. That is the kind of result that should make procurement teams put down the champagne.

The headline table is revealing. Anthropic Sonnet-3.7 has the highest exact match accuracy among the reported models at 65.48% and the highest consistency score at 42.42%. Gemini-2.0-Flash has the highest partial accuracy at 71.43% and exact match accuracy of 64.49%. Llama4-Maverick-17B-128E reaches 64.17% exact match and 69.92% partial accuracy. GPT-4 Turbo lands at 53.73% exact match and 33.33% consistency. Qwen2.5-7B-Instruct, by contrast, has 16.44% exact match and only 5.66% consistency.

The obvious reading is that stronger models do better. True, but not very interesting. The more useful reading is that even the leading models do not turn 60–65% exact-match accuracy into comparable consistency. Sonnet-3.7’s 42.42% consistency is the best reported result, yet it still means fewer than half of problem groups are solved correctly across all perturbed variants.

That gap is the article’s central point: a model may look competent on isolated instances while remaining unstable as a reasoning system.

The appendix tests are not side quests; they explain the failure modes

The paper’s detailed performance analysis focuses especially on Llama4-Maverick-17B-128E, Llama4-Scout-17B-16E, and Llama3.1-405B-Instruct. These tests are not merely extra tables. They serve different diagnostic purposes.

Test or analysis	Likely purpose	What it supports	What it does not prove
Textual variant performance	Robustness/sensitivity test	Surface paraphrasing alone does not strongly change accuracy for the three analyzed models	That models are robust to all prompt changes
Question type performance	Main diagnostic evidence	Format changes expose different skill bottlenecks	That multiple-choice success equals true understanding
Five response iterations	Stability check across repeated sampling	Aggregate accuracy is stable across iterations	That individual answers are always stable
Cross-type conditional accuracy	Error-source diagnosis	Many failures are format- or execution-specific	That conceptual understanding is fully solved
Case studies	Exploratory qualitative evidence	Examples illustrate numerical inconsistency, hallucination, and simplification bias	Population-level rates for every failure class

This matters because a superficial summary would flatten all of these into “models fail sometimes.” The paper is more specific than that. Its strongest business value comes from distinguishing failure categories.

The textual variant results show minimal variation across three phrasings for the selected models. Maverick’s exact match ranges from 63.54% to 64.86%; Scout’s from 49.56% to 50.57%; 405B’s from 34.03% to 35.24%. That suggests ordinary paraphrasing is not the dominant source of error in these tests.

So where do the failures come from?

The question-format table gives the clue. Maverick achieves 95.70% exact match on MC-Symbolic questions, 65.23% on MC-Numerical, and 57.69% on free-form. Scout shows a similar pattern: 81.51% on MC-Symbolic, 47.56% on MC-Numerical, and 44.44% on free-form. 405B behaves differently: 57.21% on MC-Symbolic, 59.42% on MC-Numerical, and only 24.95% on free-form.

The interpretation is not “multiple choice is easier,” although yes, thank you, standardized testing has entered the chat. The more precise interpretation is that symbolic selection, numerical computation, and free-form generation are different capabilities. A model may recognize the correct algebraic form while still botching the numerical answer. It may know the formula and still mishandle units. It may solve a multiple-choice version but fail when forced to produce a complete, formatted answer from scratch.

That distinction is operationally important. If a model fails because it lacks the concept, you need better training or a different model. If it fails because it cannot execute arithmetic or unit conversion reliably, you need tool routing, calculators, validators, and structured output checks. These are very different remedies.

Symbolic competence is not numerical reliability

The paper’s most practical evidence comes from cross-type conditional accuracy. The authors ask: when a model fails in one format, how often does it succeed on the same underlying problem in another format?

For Maverick, when it fails on free-form questions, it still gets the corresponding MC-Symbolic version right 95.45% of the time. When it fails on MC-Numerical questions, it gets the MC-Symbolic version right 95.00% of the time. Scout shows the same pattern at lower levels: 81.25% and 79.55%. 405B is weaker, with conditional MC-Symbolic success rates around 60–65%.

This is the paper’s most useful diagnostic insight. For the stronger analyzed models, many failures are not pure conceptual failures. They are failures of execution: generating a complete answer, performing numerical computation, converting units, and maintaining formatting discipline. The model may know the relevant physical relationship but still fail the workbench version of the task.

That is not a small weakness. It is exactly the weakness that appears in real workflows.

A technical assistant in a company does not merely select from symbolic expressions. It writes a calculation, explains it, carries units, follows local assumptions, and produces an answer that another system or human can use. If the model understands the symbolic structure but fumbles the execution, the enterprise fix is not to clap louder at the model. The fix is to design the workflow so the model does not perform fragile computation unaided.

A useful production pattern would look like this:

LLM interprets the problem
        ↓
Structured variables and assumptions are extracted
        ↓
Calculator / symbolic engine / unit validator executes the computation
        ↓
LLM explains the result and flags uncertainty
        ↓
Human or rule-based review handles high-risk cases

SymPyBench does not directly test this production architecture. That is our inference from its results. But the inference is well aligned with the failure pattern: if the bottleneck is execution rather than concept recognition, then externalizing execution is a rational design move.

Free-form answers expose the mess that multiple choice politely conceals

Free-form questions dominate the dataset, and they are harder for a reason. The paper notes that free-form tasks often contain two to three connected sub-questions. The model must solve steps sequentially, and errors can propagate. It must also generate the final answer cleanly, with appropriate units and formatting.

This is closer to real technical work than multiple choice is. A model deployed in engineering support will not usually receive four answer options. It will receive a messy request, possibly with missing information, and will need to decide what can be computed, what assumptions are necessary, and what should be escalated.

The paper’s case studies make this concrete. One example involves a diffusion-time question that omits the diffusion coefficient, a necessary value for computing the answer. Gemini does not ask for the missing coefficient. Instead, it fabricates context, refers to an image that was not provided, and proceeds with a hallucinated calculation. The authors present this as an example of reasoning integrity under under-specified inputs and suggest future work could formalize hallucination-rate measurement.

That case should not be overgeneralized into a universal hallucination statistic. It is an illustrative example, not a complete population estimate. But it does highlight a failure mode that enterprises care about: the model may treat missing information as an invitation to invent the spreadsheet.

The paper also describes implicit simplification bias in advanced physics topics, such as relativistic mechanics. In some cases, a model identifies relevant variables but defaults to simpler Newtonian expressions even when the problem requires a more advanced formulation. This is a different failure class from arithmetic error. It is not a calculator problem; it is a governing-equation-selection problem.

The distinction matters:

Failure type	Example pattern	Likely operational fix
Arithmetic or unit error	Correct formula, wrong numerical result	Calculator, unit validator, executable reference checks
Free-form generation error	Partial reasoning but incomplete or badly formatted answer	Structured output schema, answer parser, review layer
Missing-information hallucination	Invents constants, images, or assumptions	Clarification policy, required-assumption checklist
Simplification bias	Uses an easier but wrong physical model	Domain-specific rules, expert review, better task routing
Conceptual inconsistency	Fails across simplified formats too	Model selection, fine-tuning, or rejecting use case

This is why SymPyBench is useful beyond physics. It encourages organizations to stop treating “model error” as one bucket. Different errors have different economics.

What the paper directly shows, and what businesses should infer

The paper directly shows three things.

First, dynamic, code-backed benchmarking can expose reasoning instability that fixed benchmarks may hide. Because SymPyBench can perturb numbers, wording, and formats while preserving the underlying physics, it tests generalization more rigorously than one-shot answer matching.

Second, top-performing models still show significant gaps between exact accuracy and consistency. The strongest reported consistency score is 42.42%, which is good relative to the field but not comforting if you are imagining autonomous scientific reliability.

Third, format matters. MC-Symbolic performance can be dramatically higher than free-form or MC-Numerical performance, especially for Maverick and Scout. This suggests that many errors arise from execution and generation burdens rather than pure absence of conceptual knowledge.

Cognaptus’ business inference is more practical: companies should evaluate technical LLM deployments using variant-based audits, not just benchmark leaderboards. A model chosen because it performs well on a static dataset may fail once deployed into a because it performs well on a static dataset may fail once deployed workflow where inputs shift, units vary, and users omit key details.

The audit should include at least four checks:

Audit check	Business question answered
Parameter perturbation	Does the model remain correct when numbers change?
Format variation	Is success dependent on multiple-choice scaffolding?
Unit and calculation validation	Are errors coming from execution rather than reasoning?
Missing-information tests	Does the model ask for clarification or hallucinate?

This is not merely “more evaluation.” It is better evaluation. The goal is to identify which parts of the workflow can be automated safely and which parts require tools, constraints, or review.

The business value is cheaper diagnosis, not a magic reasoning score

For enterprises, the tempting use of SymPyBench would be to rank models and pick the winner. That is useful, but it is not the main value. Leaderboards age quickly. Failure modes age more slowly.

The real value is diagnostic cost reduction. A company building an AI assistant for engineering, analytics, or technical support does not only need to know which model performs best today. It needs to know why the model fails and which control reduces that failure.

If the model’s main weakness is numerical computation, attach a calculator. If it fails under missing inputs, force clarification before execution. If it performs well with symbolic choices but poorly in free-form generation, separate reasoning from answer formatting. If it shows complete failure on whole problem groups, do not route those tasks to the model at all. Radical, I know: sometimes the correct AI architecture is “don’t ask the AI.”

This suggests a useful governance pattern:

SymPyBench-style signal	Operational interpretation	Recommended control
High accuracy, low consistency	Model can solve some instances but is unstable	Stress-test variants before deployment
High MC-Symbolic, low MC-Numerical	Formula recognition exceeds computation reliability	Use external computation and unit tools
High free-form failure, high MC success	Generation/formatting is a bottleneck	Use templates, schemas, and parsers
High confusion rate	Model behaves uncertainly across variants	Escalate, ensemble, or require confidence checks
High complete failure rate	Persistent blind spot	Block task class or require expert handling

Notice what is missing here: motivational language about AI “unlocking innovation.” The useful lesson is more boring and more profitable. Build systems that know where the model is brittle.

Boundaries: physics is not the whole enterprise, and benchmarks are not deployment

SymPyBench is strong because it is controlled. That also defines its limits.

The benchmark is physics-focused, largely text-based, and built through a pipeline that includes LLM-generated structured representations, templates, variations, and code. The authors manually review remaining problems and report filtering/error rates across stages, including 12% filtered at the Python code stage due to function signature mismatch, incorrect output, or unit errors. Still, any synthetic or semi-synthetic benchmark carries construction assumptions.

The dataset also does not automatically prove performance in chemistry labs, civil engineering workflows, financial modeling, legal compliance, or production operations. Those domains have their own ontologies, constraints, and failure costs. A unit error in physics and a covenant interpretation error in finance are not the same creature, even if both can ruin your afternoon.

The paper’s model comparisons should also be read as benchmark evidence, not universal model rankings. Models change, prompting protocols change, and deployment stacks add tools. A model that performs poorly unaided may perform much better when paired with symbolic solvers, retrieval systems, validators, or domain-specific workflows.

Finally, the hallucination and simplification-bias examples are valuable but should not be treated as fully quantified rates across all models. They are case-level evidence and exploratory extensions. Their importance lies in identifying what future evaluations should measure systematically.

From quiz-taking models to auditable technical systems

SymPyBench is not just another benchmark with a larger table. Its deeper contribution is methodological: evaluate scientific reasoning as a stable procedure, not as a single answer.

That shift is overdue. Static benchmarks reward answer matching. Multiple-choice formats can reward recognition. Free-form technical work requires something harsher: the ability to extract variables, choose the right governing equation, compute correctly, carry units, handle missing information, and remain stable when the same problem is perturbed.

The paper shows that current LLMs are not uniformly hopeless. Some are surprisingly strong in symbolic recognition. Some retain reasonable aggregate performance across repeated response iterations. Some failures are not conceptual at all; they are execution failures that better workflow design can reduce.

But the paper also shows why “the model got 65% exact match” is not enough. A technical AI system must be judged by consistency, confusion, and complete failure patterns. Otherwise, the organization is not evaluating reasoning. It is admiring a lucky answer.

SymPyBench’s most useful lesson for business is therefore simple: do not ask whether the model can sound scientific. Ask whether its procedure survives contact with changed numbers, changed wording, changed formats, and missing information.

Physics is unforgiving. Benchmarks should be too.

Cognaptus: Automate the Present, Incubate the Future.

Shima Imani, Seungwhan Moon, Adel Ahmadyan, Lu Zhang, Kirmani Ahmed, and Babak Damavandi, “SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code,” arXiv2512.05954, December 5, 2025, https://arxiv.org/abs/2512.05954. ↩︎

SymPyBench is a benchmark, but its real contribution is the evaluation mechanism#

Executable ground truth changes what “correct” means#

Accuracy is only the opening bid#

The appendix tests are not side quests; they explain the failure modes#

Symbolic competence is not numerical reliability#

Free-form answers expose the mess that multiple choice politely conceals#

What the paper directly shows, and what businesses should infer#

The business value is cheaper diagnosis, not a magic reasoning score#

Boundaries: physics is not the whole enterprise, and benchmarks are not deployment#

From quiz-taking models to auditable technical systems#