Scientific Reasoning Under the Microscope: How PRiSM Stress-Tests the New Generation of Multimodal Models

Grades are comforting.

A model solves 80% of the benchmark, the leaderboard smiles, the demo team relaxes, and someone in procurement quietly starts asking whether the engineering team still needs that many humans. This is usually the part where reality coughs politely.

Scientific reasoning is not just answer-getting. It is the ability to read a diagram without hallucinating a missing digit, preserve units across transformations, choose the right physical law, survive a paraphrased version of the same question, and admit when the problem is underspecified. In other words, it is not one skill. It is a stack of fragile skills wearing a lab coat.

The PRiSM paper, PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation, puts that stack under a microscope.1 The paper introduces a dynamic multimodal benchmark of 24,750 university-level mathematics and physics problem instances. Each instance can include generated figures, parameterized text, paraphrased variants, structured reasoning steps, and Python code for ground-truth generation and verification.

That alone would make PRiSM another serious benchmark. Useful, yes. Exciting, if one’s idea of excitement includes unit consistency checks. But the more important contribution is what PRiSM reveals: models that look capable under overall accuracy can still be unstable under variation, weak at executable solution synthesis, poor at diagnosing flawed reasoning, and strangely allergic to saying, “I do not have enough information.”

That last behavior is not a small defect. It is the difference between a scientific assistant and a very confident intern with a graphics tablet.

The headline result is not accuracy; it is the gap between accuracy and reliability

The paper evaluates seven models across Tasks I–IV: Qwen-72B, LLaMA-4 Maverick, Mistral Medium-3, Claude-3.7 Sonnet, Gemini-2.5 Pro, GPT-4o, and o4-mini-high. PRiSM reports not only overall accuracy, but also three diagnostic measures:

Metric What it measures Why it matters
Overall Accuracy Correct answers across all tested variations Useful, but can hide instability across equivalent versions of the same problem
TRUE Score Share of problems solved consistently across variations Closer to “reliable understanding” than raw accuracy
Volatility Rate Share of problems where performance fluctuates around the middle Signals sensitivity to paraphrases, visuals, or input perturbations
Total Failure Rate Share of problems where the model fails every variation Reveals problems that are not merely phrasing-sensitive but fundamentally unsolved

This metric design is the paper’s most important editorial clue. PRiSM is not asking, “Can the model sometimes get the answer?” It asks, “Does the model understand the problem well enough that equivalent versions remain equivalent?”

That distinction matters because scientific and engineering work rarely arrives as a clean exam question. It arrives as a graph pasted into a report, a diagram from a vendor, a partially labeled chart, a unit convention inherited from a previous team, or a text description that changed slightly because someone wrote it at 11:47 p.m. on a Friday. A model that succeeds only on one surface form is not reliable. It is lucky with formatting.

The results show why this matters. In Task I, which varies input values and paraphrases the problem text, Gemini-2.5 Pro and o4-mini-high reach strong overall accuracy, around 78–81%. But TRUE Score tells a narrower story: o4-mini-high reaches 60%, Gemini-2.5 Pro reaches 56.7%, while Qwen-72B has overall accuracy above 50% but a TRUE Score of only 8.3%.

That is the paper’s first useful slap on the wrist. A model can appear broadly competent while failing to solve the same underlying problem consistently. In a leaderboard, that looks like “respectable accuracy.” In production, it looks like users receiving different answers because the prompt was reworded.

Task II repeats the pressure from the visual side. The input values remain fixed, but visual perturbations such as noise or handwriting style vary. Gemini-2.5 Pro and o4-mini-high again perform strongly, with TRUE Scores of 63.3% and 61.7%. Mistral Medium-3, by contrast, shows a large gap between overall accuracy and reliable consistency. The business translation is plain: image understanding is not a single pass/fail capability. It is a stability problem.

Task III, reasoning with correction, is more subtle. The model is shown partially incorrect step-by-step reasoning and must identify and correct the mistake. Here the task is no longer just “solve.” It is “audit a reasoning trace.” That is exactly what many business users expect from an AI assistant reviewing calculations, worksheets, engineering notes, or student submissions. The paper reports that Qwen-72B and Mistral Medium-3 suffer substantial TRUE Score drops, while stronger models remain more consistent. The appendix adds a qualitative breakdown: in a 100-sample review, Gemini-2.5 Pro explicitly identified and corrected errors in 84% of observed cases, compared with 62% for Qwen-72B. Qwen also silently corrected 24% and uncritically accepted the flawed reasoning 14% of the time; Gemini’s corresponding numbers were 6% and 10%.

That is a useful distinction. Silent correction may produce a better final answer, but it fails as feedback. Uncritical acceptance is worse: it turns an error into a certified-looking error. For education, engineering review, compliance documentation, or scientific workflow automation, the audit trail matters.

Then Task IV ruins the party.

Programmatic solution synthesis asks models to generate executable Python code for solving generic multimodal scientific problems. PRiSM executes the generated code on input variations and compares results against Python-grounded reference answers. The overall accuracy is low across the board: o4-mini-high leads at 21.4%, GPT-4o reaches 17.9%, Claude-3.7 Sonnet reaches 17.3%, and several models sit near or below 15%. Total Failure Rates are brutal, mostly around 75–79%.

The authors note two major failure classes: syntactic errors, such as missing imports or malformed parameters, and conceptual mistakes, such as incorrect symbolic derivations or substitutions. They also report that none of the models consistently handled unit conversion, so the evaluation used SI-unit inputs.

This is not a small implementation detail. It means “the model can explain the solution” and “the model can produce a reusable computational solver” are still very different capabilities. Anyone building AI into scientific analytics, engineering design, or technical QA should keep those boxes separate. A fluent explanation is not an executable method. A correct final number is not a validated calculation pipeline. And no, wrapping it in Python does not magically turn vibes into verification.

PRiSM is built to make models fail in diagnostically useful ways

The benchmark’s construction matters because it explains why these failures become visible.

PRiSM is generated through PrismAgent, an agentic pipeline that starts from undergraduate-level physics and mathematics materials. The pipeline extracts questions, reasoning steps, figures, numerical variables, and outputs. It then builds structured maps of inputs and outputs, parameterizes the problem, generates paraphrases, synthesizes Python solution functions, and reconstructs diagrams through plotting utilities.

The important move is grounding. PRiSM does not merely store final answers. It connects the problem text, variables, reasoning, generated figure, and executable Python code. The code uses libraries such as SymPy for symbolic manipulation and Pint for unit consistency. Generated code is checked through dimensional validation and numerical validation. Generated figures are executed and manually reviewed for consistency with the problem.

This design gives the benchmark two powers that static datasets usually lack.

First, it can generate controlled variations of the same underlying problem. Change values. Rephrase the text. Perturb the visual presentation. Mask variables. Inject flawed reasoning. Ask for code. Because the benchmark has a structured representation behind each problem, it can test whether the model is solving the underlying structure or merely responding to the surface.

Second, it can separate failure types. A wrong answer under visual perturbation is not the same as a wrong answer under paraphrase. Failure to correct a student error is not the same as failure to generate executable code. Refusing to ask for clarification is not the same as misreading a graph. These are different defects, and they imply different engineering fixes.

The dataset itself is weighted toward physics: 81.82% physics and 18.18% mathematics. Within physics, kinematics alone accounts for 36.36% of instances, while magnetism and electricity each represent 12.73%, work and energy 10.91%, and thermodynamics 9.09%. Mathematics includes calculus, application of calculus, and set theory. The appendix reports broad concept coverage: 450 physics concepts and 110 mathematics concepts.

The problems also require multi-step reasoning. PRiSM reports an average of 6.91 reasoning steps per problem, with a minimum of 4 and maximum of 11. This matters because short problems often reward shortcut pattern matching. Longer structured problems expose whether the model can preserve intermediate dependencies.

The multimodal element is not decorative. The appendix reports that problems have, on average, 4.43 input variables and 2.20 output variables. Figures contain an average of 3.50 input variables, while text contains 3.39. Some variables are figure-only; some are text-only. This design forces actual multimodal fusion instead of allowing the model to solve from the text alone while pretending the image mattered. A familiar benchmark problem, in other words.

The appendix is not filler; it explains the mechanism of failure

Many benchmark papers put the interesting behavior in appendices, because apparently readers enjoy treasure hunts. PRiSM’s appendix is worth reading because it explains why aggregate scores move the way they do.

For Tasks I and II, the authors sampled 500 examples and analyzed model outputs. Although the initial analysis focused on LLaMA-4 Maverick, they report similar behaviors across Gemini-2.5 Pro and Qwen-72B. The categories are operationally useful:

Failure mode Likely purpose in the paper What it supports What it does not prove
Modality conflict Main error analysis for robustness tasks Models may override correct symbolic reasoning with misleading visual cues It does not prove vision is always harmful; it shows fusion can be poorly reconciled
Ambiguity-induced errors Robustness/sensitivity analysis Small visual ambiguities, such as unclear digits or decimal points, can shift reasoning It does not isolate OCR from reasoning failure perfectly
Visual misreading Robustness/sensitivity analysis Dense annotations and subtle numeric changes remain difficult It does not imply all visual tasks are equally fragile
Multi-step reasoning gaps Mechanism analysis Models may collapse intermediate steps and lose transformations It does not show that chain-of-thought alone solves the problem
Numerical precision Error typology Correct symbolic setup can still fail through inappropriate approximation It does not reduce scientific reasoning to arithmetic accuracy
Formula misapplication Conceptual failure analysis Models still confuse physical principles, setups, and circuit structures It does not identify whether failure came from training gaps or inference instability

The most revealing category is modality conflict. In one described case, the model correctly solves a problem algebraically but then allows a misleading visual cue to override the symbolic solution. This is not a simple vision failure. It is a coordination failure between perception and reasoning.

That failure mode matters for business systems because multimodal workflows often combine charts, text, tables, and diagrams. A model may extract useful visual cues but lack a robust internal mechanism for reconciling them with computed results. When the visual signal and symbolic calculation disagree, the model needs a conflict-resolution procedure. Without it, the “multimodal” system becomes a polite argument among unreliable subsystems.

Ambiguity-induced and visual misreading errors are equally practical. The appendix describes cases where handwritten values or subtle digits were misread. In one example, a zoomed-in image allowed the model to correct the error. That suggests the issue is not always conceptual ignorance. Sometimes the model needs better visual inspection strategy: crop, zoom, re-read, compare magnitudes, then solve. Human analysts do this automatically. AI systems often pretend the first glance was enough. Adorable, in the way an expensive mistake can be adorable.

The multi-step reasoning gap is different. Here the model begins with clear reasoning but collapses several transformations into one step, especially around exponentials, unit conversions, or algebraic simplification. The failure is not absence of reasoning; it is loss of granularity. In production, this points toward a design principle: do not merely ask the model to “think step by step.” Force intermediate artifacts that can be checked.

Task V shows the most business-relevant weakness: models would rather answer than ask

PRiSM’s Task V introduces ambiguity by masking 20–30% of input variables and replacing them with symbolic placeholders. Models are instructed to request clarification when necessary. The evaluation uses an LLM-as-judge framework, with LLaMA-4 Maverick as judge, and the authors manually review 100 random samples from o4-mini-high, Gemini-2.5 Pro, and Qwen-72B to validate and deepen the analysis.

The result is bleak in a very familiar way. Models rarely defer or ask clarifying questions. Symbolic reasoning appears in only about 4–5% of cases. Deferral appears in only about 3–4%. Instead, models often continue by making unsupported assumptions, inferring missing values from visual context, or using phrases such as “for simplicity, assume…”

That phrase should make every enterprise AI evaluator sit up.

“For simplicity, assume…” is sometimes acceptable in a classroom derivation. It is dangerous in a business workflow unless the assumption is clearly labeled, user-approved, and downstream-safe. In technical QA, missing variables are not an invitation to improvise. In engineering calculations, undocumented assumptions become liability. In financial analytics, “for simplicity” is where small errors go to become expensive.

The paper’s qualitative examples highlight three behaviors:

  1. models infer missing parameters from the image rather than asking;
  2. models make default assumptions without explicit clarification;
  3. models produce partial symbolic expressions that obscure variable dependencies.

The third behavior is easy to underappreciate. A symbolic answer is not automatically good. If the answer leaves dependencies hidden or unsimplified, the user may not understand which missing variable drives the result. For decision support, that weakens interpretability. The model has not hallucinated a number, but it has also not made the uncertainty operational.

Task V is partly qualitative and partly judged by another model, so it deserves careful interpretation. It should not be treated as a precise universal measurement of all ambiguity handling. But as a diagnostic signal, it is valuable. It exposes a behavior that appears across many real AI products: answer completion is rewarded more strongly than uncertainty management.

That is a product problem, not merely a model problem.

What the paper directly shows

The paper’s direct contribution is not “VLMs are bad at science.” That would be lazy, and also false. The stronger models in PRiSM clearly solve many difficult instances. Gemini-2.5 Pro and o4-mini-high perform strongly on robustness tasks, especially compared with weaker models. The point is more specific: current VLMs remain uneven across the components required for reliable scientific reasoning.

PRiSM directly shows four things.

First, robustness differs from accuracy. Overall accuracy can hide large drops in consistent problem-level understanding. TRUE Score is useful because it shifts attention from “How often did the model get a variation right?” to “How often did it solve the underlying problem reliably across variations?”

Second, visual reasoning creates both information and risk. Figures are necessary in PRiSM; variables can appear only in images, only in text, or in both. But visual cues can also mislead models or override correct symbolic reasoning. Multimodal fusion is not solved by adding an image encoder and hoping the tokens become friends.

Third, diagnostic reasoning is separate from answer generation. In Task III, a model may correct an answer silently, accept flawed reasoning, or explicitly identify the error. These are different behaviors with different business value. A tutoring assistant, engineering reviewer, or scientific copilot should not merely output the corrected result; it should identify the fault in the reasoning chain.

Fourth, executable program synthesis remains weak. Task IV’s low accuracy and high total failure rates show that generating reusable code for scientific problem solving is still much harder than producing a fluent explanation or final answer. This is particularly important for companies that hope to use AI to automate analysis pipelines, not just answer one-off questions.

What Cognaptus infers for business use

The business implication is not “use PRiSM as-is for every domain.” PRiSM is focused on synthetic university-level math and physics problems. Enterprise workflows have their own document styles, measurement conventions, diagrams, policies, and messy legacy systems. A factory maintenance report is not a kinematics exam. A civil engineering plan is not a generated matplotlib figure.

The better inference is methodological: PRiSM shows what a serious evaluation stack should look like when multimodal AI is expected to reason, not merely describe.

For business teams, the evaluation question should shift from:

“Does the model answer correctly on our sample cases?”

to:

“Does the model remain correct when the same case is paraphrased, visually perturbed, partially underspecified, and converted into an executable workflow?”

That shift changes procurement, QA, and product design.

Business use case PRiSM-inspired evaluation test What to measure
Engineering QA copilots Rephrase the same technical problem, vary numerical inputs, and perturb diagrams Consistency across equivalent cases, not only average accuracy
Scientific document analysis Compare text-only, image-only, and combined inputs Whether the model reconciles modality conflicts
Technical education platforms Insert common student errors into reasoning steps Whether the model explicitly diagnoses the error
Analytics automation Ask the model to generate executable code from symbolic problem descriptions Runtime success, unit handling, and output correctness
Operations support with incomplete forms Mask required variables and instruct the model to ask questions Deferral rate, assumption labeling, and symbolic dependency clarity

This matters for ROI because AI failures in technical workflows are rarely evenly distributed. A model that is correct 80% of the time may still be unusable if the remaining 20% includes silent unit errors, diagram misreadings, unsupported assumptions, or code that runs only when the moon is emotionally available.

The operational value lies in cheaper diagnosis. A PRiSM-style benchmark helps teams identify where the model fails before it is embedded into a workflow. If the weakness is visual misreading, add zoom-and-crop inspection. If the weakness is unit conversion, enforce Pint-like validation. If the weakness is program synthesis, separate explanation from executable code and add runtime feedback. If the weakness is ambiguity, redesign the interface so asking for clarification is treated as success, not refusal.

This is the difference between “model selection” and “system design.” Model selection asks which model scores highest. System design asks which failure modes remain after the model is wrapped in tools, validators, UI constraints, and human escalation.

PRiSM is useful because it makes that second conversation concrete.

The benchmark is strong, but its boundaries matter

PRiSM’s boundaries are not embarrassing footnotes. They define how the results should be used.

First, the benchmark is synthetic and concentrated in mathematics and physics. This gives PRiSM control and scalability, but it also means performance may not transfer directly to messy real-world domains such as medical imaging workflows, industrial inspection, legal evidence review, or financial modeling. The structure is transferable; the scores are not automatically transferable.

Second, the dataset is physics-heavy. With more than 80% of instances in physics and a large share in kinematics, PRiSM is especially informative for symbolic, quantitative, diagram-based reasoning. It is less informative for domains where scientific reasoning depends on experimental design, uncertainty estimation from empirical data, literature synthesis, or long-horizon causal inference.

Third, the figures are generated through a controlled plotting ecosystem. That makes visual variation systematic, but it is still different from screenshots, hand-drawn notes, scanned engineering diagrams, medical images, or noisy field photos. PRiSM tests meaningful multimodal reasoning, but it does not exhaust the visual messiness of the enterprise world. The enterprise world has PDFs from 2008. Enough said.

Fourth, Task V partly relies on an LLM-as-judge framework. The authors acknowledge potential bias and perform manual review, which strengthens the analysis. Still, ambiguity handling should be interpreted as a qualitative diagnostic finding rather than a final calibrated metric.

Finally, Task IV evaluates code generation without giving models an execution tool for self-correction. The authors explicitly note future work on allowing models to run and repair their code. In practical business systems, code agents often do receive runtime feedback. So the Task IV result should not be read as “models cannot ever synthesize scientific code.” It should be read as “single-pass executable scientific code generation remains unreliable, especially when units, symbolic derivations, and multimodal inputs are involved.”

That is still a serious warning.

The new benchmark lesson: evaluate the behavior you actually need

PRiSM’s real message is simple: scientific reasoning is not one benchmark number. It is a chain of behaviors.

The model must parse the text. Read the figure. Preserve the variables. Apply the right law. Keep units consistent. Perform multi-step transformations. Correct flawed reasoning. Generate executable artifacts when needed. And when information is missing, it must resist the urge to cosplay omniscience.

Most evaluation pipelines do not test that full chain. They test final answers. That is easier to score, easier to display, and easier to misunderstand.

PRiSM gives us a more useful vocabulary. Overall accuracy tells us whether the model can often solve. TRUE Score tells us whether it can solve reliably. Volatility tells us whether surface changes destabilize it. Total Failure Rate tells us where it fundamentally breaks. Task-specific analyses show whether the break comes from visuals, paraphrases, correction, code, or ambiguity.

For business leaders, the lesson is not to wait for perfect models. The lesson is to stop buying demos as if they were guarantees. A multimodal AI system used in technical work should be evaluated like a technical worker: not only by whether it gets some answers right, but by whether it handles variation, documents assumptions, checks its own work, and knows when to escalate.

That is less glamorous than a leaderboard. It is also much closer to reality.

PRiSM does not prove that today’s multimodal models cannot reason scientifically. It proves something more useful: when they fail, they fail in structured, testable, and often predictable ways.

And once failure becomes testable, it becomes engineerable.

Cognaptus: Automate the Present, Incubate the Future.


  1. Shima Imani, Seungwhan Moon, Adel Ahmadyan, Lu Zhang, Kirmani Ahmed, and Babak Damavandi, “PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation,” arXiv:2512.05930, December 5, 2025, https://arxiv.org/abs/2512.05930↩︎