The Benchmark Drop Is Not the Verdict: Re-reading GSM-Symbolic with Statistics

A benchmark result lands on the desk. The chart is clean. The message is dramatic. A model performs well on the original math questions, then worse on symbolic variants. Someone in the meeting says the obvious thing: “So it cannot really reason.”

That sentence is attractive because it is simple. It is also the kind of sentence that should be forced to pass through a statistical checkpoint before being allowed near procurement, product strategy, or a LinkedIn post with too many lightning emojis.

The paper The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic does exactly that checkpoint work.¹ It revisits the GSM-Symbolic benchmark, which had reported consistent performance drops across LLMs when GSM8K math problems were transformed into template-generated variants. The original interpretation was strong: if a model fails when names and numbers change, perhaps it is not performing genuine reasoning but leaning on memorized surface patterns.

The re-evaluation does not say that all is well. It does not claim that LLMs are secretly perfect reasoners wearing modest statistical clothing. The paper’s point is sharper: a raw performance drop is not automatically a reasoning verdict. First, the drop must be statistically reliable. Second, the variant dataset must not have changed something besides superficial form. Third, the remaining failures should be diagnosed rather than bundled into one grand theory of model stupidity.

That is the article’s path: evidence first, conclusion later. A strange idea, apparently still experimental in AI discourse.

The original claim rests on a real concern, but a fragile inference

GSM8K is a dataset of grade-school math word problems. GSM-Symbolic was designed to test whether LLM performance on GSM8K reflected real problem-solving ability or possible exposure to benchmark items during training. The idea was sensible: take 100 original GSM8K questions, generate 50 variants for each by changing names and numbers, and compare performance on the originals versus the variants.

If models truly solve the underlying problem structure, superficial variation should not hurt much. If they rely on memorized wording, familiar templates, or brittle pattern matching, performance should decline. This is a reasonable research design. The trouble begins when the observed decline is treated as one thing.

The re-evaluation asks three practical questions:

Question	What the paper tests	Why it matters
Is the variant drop statistically reliable?	Reproduction using Generalised Linear Mixed Models with per-question random effects	Prevents over-reading noisy benchmark deltas
Did the variants change numeric difficulty?	A “large number effect” metric and distribution comparison	Separates arithmetic burden from reasoning failure
If a drop remains, what kind of failure is it?	Alternative natural-language and code prompt formats, plus error analysis	Replaces one blanket claim with model-specific failure profiles

The first move is the most important. The paper treats each original question and its variants as related observations, not as thousands of independent little votes in a referendum on machine intelligence. That matters because some question templates are simply harder than others. If an evaluation ignores item-level variability, it risks confusing the quirks of a small question sample with a general law about LLM reasoning.

First evidence layer: only half the re-evaluated models show a significant variant effect

The authors re-evaluate 20 open-weight models. They exclude the four proprietary OpenAI models from the original GSM-Symbolic study for practical and economic reasons, and also omit Phi-3-small-128k-instruct because of technical difficulties. The resulting scope is therefore narrower than “all LLMs,” but still broad enough to test whether the reported phenomenon is uniform across accessible models.

Their key statistical tool is a Generalised Linear Mixed Model, or GLMM. The outcome is binary: correct or incorrect answer. The fixed effect is whether a question belongs to GSM-Base or GSM-Variants. The random effect is the question template, allowing each original problem family to have its own baseline difficulty.

In plain language, the model asks:

After accounting for the fact that some math problem templates are naturally harder than others, is being a symbolic variant still associated with lower odds of a correct answer?

That is a better question than “did the average score go down?” Average scores are easy to report and easy to overinterpret. They are also very good at looking confident while quietly ignoring uncertainty.

The first result cuts the headline in half. Out of 20 open-weight models, only 10 show a statistically significant variant effect when considered individually under the original GSM prompt. In the appendix, the authors add that the effect is significant in 50% of tested models under the standard threshold, but only 10% if applying Holm-Bonferroni correction across models.

Some drops are large. For example, phi-2 falls from 60.0% on GSM-Base to 39.7% on GSM-Variants, a raw drop of 20.26 percentage points. Others are smaller or statistically weak. Gemma-7b shows almost no drop in the reproduction: 48.0% on GSM-Base versus 47.7% on GSM-Variants. Mistral-7B-v0.3 even improves slightly on the variants, from 33.0% to 35.5%.

This does not make the benchmark useless. It makes the benchmark less theatrical. The right reading is not “nothing happened.” The right reading is: “something happened for some models, but not uniformly enough to justify a blanket claim about LLMs.”

That distinction is not pedantry. For a business evaluating models, it changes the decision from category-level panic to model-level diagnosis.

Why per-question random effects change the meaning of the evidence

The GLMM is not just statistical decoration. It protects against a specific error: treating language materials as fixed when they should be treated as random samples from a larger population of possible tasks.

Suppose one original problem template is unusually fragile. Its variants may expose a model weakness, but that weakness may not generalize across the task family. Conversely, a model may fail badly on a few templates while behaving normally elsewhere. If all variant instances are treated as independent observations, the apparent evidence becomes artificially large. The model is not being judged by 5,000 fully independent tests; it is being judged by 100 template families with repeated variants.

This is why the paper’s statistical framing is business-relevant. Enterprises rarely care whether a model wins a benchmark argument in the abstract. They care whether the model will fail in their own workflows: claims processing, finance Q&A, legal retrieval, customer support, operational planning. Those workflows also contain repeated task families. Some are easy, some are pathological, and some look easy until a variable changes by one word.

A serious evaluation should therefore ask not only “what is the mean score?” but also:

Evaluation habit	Weak version	Stronger version
Benchmark comparison	Model A scored higher than Model B	Is the difference statistically reliable across item families?
Error review	The model failed several examples	Are failures concentrated in specific templates or distributed broadly?
Deployment inference	Benchmark drop means reasoning failure	Which failure mode predicts our actual workflow risk?
Vendor claim	“Robust reasoning”	Robust against what perturbation, under what uncertainty model?

The paper’s first contribution is therefore methodological before it is philosophical. It reminds the field that benchmark interpretation needs inference, not just arithmetic.

Second evidence layer: the variants changed the number distribution

The next evidence layer is more damaging to the simple story.

The original GSM-Symbolic interpretation leaned on the idea that the variants changed surface form without meaningfully changing the range of numbers. If true, then a performance drop would more plausibly reflect brittleness to symbolic variation. The re-evaluation challenges that assumption.

The authors extract numerically expressed integers from GSM-Base and GSM-Variants and compare their distributions. They find a systematic shift toward larger numbers in GSM-Variants. A two-sample Kolmogorov-Smirnov test confirms the difference, with a K-S statistic of 0.12 and $p < 0.001$.

To test whether this matters for performance, the authors define a large-number-effect metric. Conceptually, it sums the digit load of the numerically expressed integers in a question:

$$ L(q)=\sum_{n \in N(q)} \log_{10}(\max(|n|,1)) $$

The exact implementation ignores spelled-out numbers and fractions, takes absolute values for negative numbers, and treats zero as one before taking the logarithm. The intuition is simple: larger numbers tend to require more digits, interact differently with tokenization, appear less frequently in training data, and often make arithmetic harder.

This is not a minor nuisance. The paper reports that larger numbers predict decreased performance in pooled GSM-Base and GSM-Variants questions for 6 models under the canonical GSM prompt, 8 models under the simple natural-language prompt, and 7 models under the structured natural-language prompt. Code prompts mostly remove the significance of the large-number effect, except for gemma-2-2b under the structured code prompt.

That pattern matters. If code prompts reduce the number effect, then at least part of the failure is arithmetic execution rather than problem-structure reasoning. A model can understand the solution path and still mishandle the calculation. Conversely, a model can perform arithmetic but fail to bind the right quantities. These are not the same failure. Treating them as one thing is analytically convenient and operationally lazy.

After controlling for the large-number effect, the residual variant effect disappears in just over half of the previously significant cases. In the remaining cases, the variant effect persists. This is exactly the kind of result that should make analysts slower, not louder.

The benchmark is not debunked. It is decomposed.

Third evidence layer: prompt variants split one failure into several mechanisms

The paper then moves from “is there a reliable drop?” to “what could be causing it?” This part of the study is best read as diagnostic and exploratory, not as a second universal benchmark leaderboard.

The authors test four alternative prompt formats for models that showed a significant variant effect in the original reproduction:

Experiment	Likely purpose	What it supports	What it does not prove
Simple natural-language prompt	Test sensitivity to the exact GSM-style phrasing	Whether the original few-shot wording contributes to the drop	That rewording alone solves reasoning weakness
Structured natural-language prompt	Provide explicit Given / To find / Solution scaffolding	Whether variable binding and state tracking improve when externally organized	That all structure helps all models
Simple code prompt	Offload arithmetic to generated Python execution	Whether arithmetic execution contributes to the variant drop	That program-aided reasoning fixes conceptual errors
Structured code prompt	Combine explicit structure with executable calculation	Whether structured reasoning plus tool use reduces failures	That more rigid format is always better

The results are not tidy. Good. Tidy results in model evaluation often mean the interesting parts were swept under the rug.

For four models — Phi-3.5-mini-instruct, gemma-2-9b, Mathstral-7B-v0.1, and Mistral-7B-Instruct-v0.1 — the variant effect becomes statistically insignificant across all four alternative prompt formats. For the remaining models, the pattern differs by prompt and model.

The paper identifies several likely failure profiles.

First, fragile variable binding. For gemma-2b, the variant effect remains under the simple reworded natural-language prompt but disappears under the structured natural-language prompt. The authors interpret this as evidence that the model benefits when the prompt forces it to organize entities, values, and target quantities explicitly. In business terms: the model may not need “more intelligence” as much as better scaffolding. Boring, useful, and therefore underrated.

Second, arithmetic limitations. For phi-2, the variant effect remains under natural-language adaptations but disappears with code prompts. For Phi-3.5-mini-instruct and Mistral-7B-Instruct-v0.1, the original GSM-prompt drop can be fully explained by sensitivity to larger numbers. This suggests that arithmetic burden, not lack of reasoning structure, is the main culprit for those cases.

Third, possible reliance on learned patterns. Gemma-2-9b and Mathstral-7B-v0.1 show the variant effect only under the original GSM prompt, while being relatively immune to the number effect. The authors do not find a better explanation than possible reliance on learned patterns from training data. That is not proof of memorization, but it is enough to keep the hypothesis alive.

Fourth, dual-task interference. Some models appear to struggle not because the math is impossible, but because the prompt asks them to solve the math while also following an unfamiliar response format. Meta-Llama-3-8B, for example, shows sharp accuracy drops and many empty answers under the natural-language prompt perturbations. Meta-Llama-3-8B-Instruct and gemma-7b-it show related but different sensitivities under structured natural-language and code formats.

This is the paper’s most useful diagnostic lesson: the same benchmark symptom can come from different mechanisms.

Observed symptom	Possible mechanism	Operational response
Drop vanishes under structured NL prompt	Weak variable binding or state tracking	Add extraction scaffolds, intermediate slots, schema-guided reasoning
Drop vanishes under code prompt	Arithmetic execution burden	Use tools, calculators, deterministic computation, validation checks
Drop appears only under original prompt	Prompt-pattern reliance or format sensitivity	Test prompt variants before claiming capability differences
Empty or malformed outputs increase under format changes	Dual-task interference	Reduce formatting burden or separate reasoning from formatting
Residual drop persists after number correction	Genuine variant sensitivity remains plausible	Run deeper item-level and mechanistic analysis

Notice the practical difference. “The model lacks reasoning” is a diagnosis with no treatment plan. “The model loses variable bindings under surface perturbation” suggests a remedy. “The model fails arithmetic on larger integers” suggests a different remedy. “The model collapses under formatting constraints” suggests yet another.

A benchmark should not merely shame a model. It should tell you what to fix.

The appendix is not decoration; it protects the interpretation

The supplementary sections are important because they clarify which evidence is central and which evidence is diagnostic support.

The reproduction tables are main evidence. They show the first statistical narrowing: only half the models show an individually significant variant effect under the original GSM prompt, and far fewer survive multiple-comparison correction.

The alternative prompt experiments are diagnostic extensions. They are not meant to crown a new best prompt. They help isolate whether the original effect is sensitive to wording, structure, arithmetic offloading, or code-format burden.

The large-number analysis is both confound check and explanatory test. It directly challenges the assumption that GSM-Variants preserved numeric difficulty. It also tests whether number magnitude predicts accuracy after pooling GSM-Base and GSM-Variants.

The failure-mode appendix is qualitative and interpretive support. It defines error classes for natural-language and code outputs, then gives examples of looping calculations, implausible negative answers, empty responses, missing functions, syntax errors, name errors, and other execution failures. This section does not prove a complete cognitive theory of LLM failure. It gives texture to the statistical results: the failures are not all the same species.

That distinction matters for readers. Appendix evidence often gets used badly. People either ignore it entirely or treat every appendix figure as another main result. Here, the better reading is: the appendix supports the diagnosis, not a separate thesis.

What the paper directly shows, and what Cognaptus infers for business use

The paper directly shows that GSM-Symbolic’s broad reasoning claim is statistically and mechanistically too coarse. The variant effect is not reliably present across all re-evaluated open-weight models. Some of the observed decline is plausibly explained by a shift toward larger numbers. The remaining failures vary by model and prompt format.

Cognaptus’s business inference is narrower but useful: benchmark deltas should be treated as diagnostic leads, not final verdicts. Before a company changes vendor strategy, product architecture, or internal AI policy based on a benchmark headline, it should ask what exactly the benchmark changed and whether the observed difference survives reasonable controls.

For enterprise evaluation, the paper suggests a simple governance checklist:

Step	Question to ask	Reason
Statistical reliability	Is the performance difference significant after accounting for item-level variability?	Avoids overreacting to noisy deltas
Confound search	Did the variant change hidden difficulty factors such as number size, length, ambiguity, or format burden?	Prevents mislabeling difficulty shifts as reasoning failures
Prompt sensitivity	Does the result persist across reasonable prompt formats?	Separates capability from interface fragility
Failure-mode diagnosis	Are errors arithmetic, binding, retrieval, formatting, or planning failures?	Turns evaluation into remediation
Workflow mapping	Does the diagnosed failure matter in the company’s actual use case?	Prevents benchmark theatre from replacing risk analysis

This is especially relevant for AI procurement. A vendor can show a benchmark win; a critic can show a benchmark failure. Both may be true and both may be insufficient. The real question is whether the evaluation resembles the company’s task distribution, whether the result is statistically stable, and whether failure modes can be controlled through workflow design.

For example, if a model fails because of arithmetic execution, the fix may be tool use and verification. If it fails because of variable binding, the fix may be structured extraction and state tracking. If it fails because of dual-task interference, the fix may be to separate reasoning from output formatting. If it fails because of learned-pattern dependence, the fix may require broader data, fine-tuning, or model replacement.

These are different investments. Calling all of them “reasoning failure” is like calling every restaurant problem “food failure.” Technically possible. Not helpful.

The boundary: this is not a universal ranking of current models

The paper’s scope is deliberately limited. It evaluates 20 open-weight models, ranging from 2B to 27B parameters, and excludes proprietary frontier models. That means the result should not be used to rank today’s strongest commercial systems. It is not a procurement leaderboard.

There are also reproduction limitations. The authors note differences between their reproduced results and the original GSM-Symbolic study, with variant-delta differences reaching up to 24.1 percentage points in some cases. They attribute likely causes to model versioning, checkpoint changes, and sensitivity to prompt formatting or decoding details. This is not a small administrative footnote. It is another reminder that benchmark results are artifacts of model version, prompt, decoding, and evaluation code — not tablets delivered from the GPU mountain.

The gemma-2-2b case also shows statistical caution in practice. The model has consistent small-to-moderate variant-effect estimates across prompt formats and optimizers, but some GLMM fits are degenerate because standard errors cannot be reliably computed. The authors report the issue and exclude the model from primary inferential analysis rather than rescuing one case with a different estimation framework. That is the correct kind of boring. Benchmark science could use more of it.

Finally, the paper’s mechanism labels are behavioral hypotheses. Terms such as variable binding, arithmetic limitation, learned-pattern reliance, and dual-task interference are interpretations supported by prompt behavior and error patterns. They are not mechanistic interpretability proofs. The authors are clear about this. Business readers should be clear too.

The better benchmark habit is diagnosis before drama

The strongest sentence to take from this paper is not “LLMs can reason.” It is also not “LLMs cannot reason.” Both are too broad for the evidence.

The stronger lesson is this: when a benchmark reveals a performance drop, the first task is not to write a philosophy essay about intelligence. The first task is to find out what changed, whether the change is statistically reliable, and which failure mechanism best explains the remaining signal.

That process is slower than a headline. It is also the only version useful for serious AI adoption.

A company choosing models for math-heavy workflows should not ask, “Did this benchmark prove reasoning?” It should ask:

Did the model fail because it misunderstood the problem?
Did it fail because it lost track of variables?
Did it fail because the numbers became harder?
Did it fail because the prompt format imposed too much extra work?
Did it fail only on a few unstable templates?
Can the failure be reduced by tools, scaffolds, or verification?

This is less glamorous than declaring the death or triumph of LLM reasoning. But it produces decisions that can survive contact with operations.

The benchmark drop is a signal. It is not the verdict.

Cognaptus: Automate the Present, Incubate the Future.

Dominika Długosz, Arlindo Oliveira, and Natalia Díaz-Rodríguez, “The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic,” arXiv:2605.28700v2, 28 May 2026, arXiv. ↩︎

The original claim rests on a real concern, but a fragile inference#

First evidence layer: only half the re-evaluated models show a significant variant effect#

Why per-question random effects change the meaning of the evidence#

Second evidence layer: the variants changed the number distribution#

Third evidence layer: prompt variants split one failure into several mechanisms#

The appendix is not decoration; it protects the interpretation#

What the paper directly shows, and what Cognaptus infers for business use#

The boundary: this is not a universal ranking of current models#

The better benchmark habit is diagnosis before drama#