When Solvers Become Judges (and Fail): Why LLMs Still Struggle to Critique Reasoning

Correction is the expensive part.

Answer generation is already the familiar magic trick. Give a model a problem, ask for a solution, and admire the fluent staircase of reasoning. Sometimes the staircase even reaches the right floor. That is nice. Investors clap. Product managers update the roadmap. Somewhere, a slide says “AI tutor,” “AI reviewer,” or “autonomous verification layer.”

Then comes the less glamorous question: can the same model inspect another person’s reasoning and identify the exact step where it first goes wrong?

That sounds like a small extension. It is not. It changes the job from solving to monitoring. A solver can walk its own path to an answer. A judge must follow someone else’s path, compare each step against the problem constraints, preserve the order of errors, and report the first failure rather than the most obvious later failure. In human terms, this is the difference between being a strong student and being a reliable teacher. The overlap is real. The equivalence is wishful thinking, which remains one of the industry’s cheaper renewable resources.

A recent paper by Liang Zhang, Yu Fu, and Xinyi Jin examines exactly this distinction using GPT-4 and GPT-5 on the GSM8K and MATH subsets of PROCESSBENCH.¹ The paper’s central result is not that stronger models are useless as assessors. The result is more interesting: math-solving success is strongly associated with better assessment performance, but assessment still breaks badly when the task requires localizing real errors.

That distinction matters for AI tutoring. It also matters for any business workflow where an LLM is expected to review reasoning, audit analysis, check calculations, or supervise another agent’s output. A model that can produce a good answer is not automatically a model that can diagnose a bad process.

Solving is object-level reasoning; assessment is process monitoring

The cleanest way to read the paper is through the mechanism gap.

Problem solving is an object-level task. The model receives a math problem and generates a final answer. It may reason step by step, use shortcuts, recover from a small detour, or land on the right answer through a route that is not especially teacher-friendly. The evaluation is simple: did the final answer match the gold answer?

Step-level assessment is a meta-level task. The model receives the original problem plus a provided solution trace. Its job is to identify the earliest erroneous step, using the benchmark’s human label. If the solution is fully correct, it should predict that no step is wrong.

The second task adds at least four burdens:

Capability	What solving needs	What assessment adds
Mathematical competence	Find a valid answer	Check whether each proposed step is valid
Step tracking	Use intermediate steps as needed	Preserve the sequence of someone else’s reasoning
Error judgment	Avoid mistakes in own solution	Detect mistakes in a given trace
Localization	Usually unnecessary after final answer	Identify the first step where reasoning fails

This is why the paper’s mechanism-first framing is more useful than a scoreboard summary. A scoreboard says GPT-5 is better than GPT-4. Fine. That is not the expensive insight. The expensive insight is that the model’s assessment ability depends partly on its solving ability, but also on additional monitoring and localization skills that are not guaranteed by higher final-answer accuracy.

In business language: buying a better generator does not automatically buy a reliable reviewer.

What the experiment actually measures

The study uses the GSM8K and MATH subsets of PROCESSBENCH, a benchmark designed for identifying process errors in mathematical reasoning. Each item contains an original problem, a step-by-step solution trace, and a human label marking the earliest incorrect step. If all steps are correct, the label indicates no error.

The authors use all 400 GSM8K items and a randomly sampled 400-item subset of MATH, giving 800 evaluation items. GPT-4 and GPT-5 are each placed in the same broad “expert math tutor” role and tested on two independent tasks over the same underlying problems:

Problem solving: answer the original problem.
Assessment: inspect a benchmark-provided solution trace and predict the earliest erroneous step.

Both tasks are repeated three times, and the paper reports mean results. Problem solving is evaluated by final-answer accuracy. Assessment is evaluated by exact-match accuracy on earliest-error prediction, plus separate metrics for error-present and no-error cases. The authors also compute a solve–assess gap:

$$ \text{Solve--assess gap} = \text{Problem-solving accuracy} - \text{Assessment F1} $$

That gap is useful because overall assessment accuracy can be deceptively comfortable. A model may look adequate if many traces are correct and it learns to say “all good.” The harder question is whether it can identify where an incorrect solution first becomes incorrect.

The paper’s evidence is best read as follows:

Evidence item	Likely purpose	What it supports	What it does not prove
Figure 1: solving and assessment accuracy	Main performance baseline	GPT-5 performs above GPT-4, and assessment trails solving, especially on GSM8K	That better solving causally creates better assessment
Figure 2 and Table 1: assessment by solved-correct vs solved-incorrect items	Main association evidence	Within the same model, assessment is much higher on items it solved correctly	That solving and assessment are the same capability
Table 2: error-present vs no-error assessment	Error-sensitivity analysis	Models are far better at recognizing correct traces than localizing actual errors	That all forms of critique are equally weak
Table 3: solve–assess gap	Difficulty summary	Step-level assessment remains much harder than direct solving	That the gap is fixed or impossible to reduce
Qualitative mismatch cases	Mechanism explanation	Solving and assessing overlap but diverge in concrete cases	A broad taxonomy of all possible failure modes

There are no elaborate ablation tables hiding a second thesis here. The paper is compact: it tests an association, shows the gap, and uses examples to explain why the gap exists.

The association is strong: models assess better when they can solve the item

The first major result is straightforward and important: when the same model solves a math item correctly, it is much more likely to assess the provided solution trace correctly.

On GSM8K, GPT-4’s assessment accuracy is 48.6% on items it solved correctly but only 6.6% on items it solved incorrectly. GPT-5 shows the same pattern: 50.2% versus 16.1%.

On MATH, the difference is also large. GPT-4 reaches 61.5% assessment accuracy on solved-correct items and 23.0% on solved-incorrect items. GPT-5 reaches 70.5% versus 24.6%.

The paper reports the effect sizes as differences in assessment accuracy between solved-correct and solved-incorrect groups:

Model	Dataset	Assessment advantage when the model solved correctly
GPT-4	GSM8K	42.0 percentage points
GPT-4	MATH	38.4 percentage points
GPT-5	GSM8K	34.1 percentage points
GPT-5	MATH	45.9 percentage points

All reported confidence intervals exclude zero. The interpretation should be disciplined. This is a strong within-model association: the same model is better at assessment on items it can solve. That supports the practical intuition that mathematical understanding helps critique.

But it does not license the stronger claim that solving ability transfers cleanly into assessment ability. Shared item difficulty can affect both tasks. Some problems are simply easier for a model to understand, solve, and assess. The paper does not isolate a causal mechanism where training a solver necessarily produces a reliable assessor.

For product teams, the correct takeaway is therefore narrow but useful: do not evaluate an AI tutor or reviewer only on aggregate math-solving benchmarks. If assessment is part of the product promise, test assessment directly on the same class of reasoning traces users will actually submit.

The uncomfortable part: real errors are where assessment collapses

The second result is the one that should bother anyone building “AI review” features.

The models are much better at recognizing no-error traces than at finding the earliest error in error-present traces. In Table 2, no-error accuracy ranges from 72.9% to 97.1%. Error-present accuracy ranges only from 4.7% to 10.0%.

That is not a rounding error. That is the product risk wearing a name tag.

Model	Dataset	Error-present accuracy	No-error accuracy	Assessment F1
GPT-4	GSM8K	4.7%	91.2%	8.9%
GPT-4	MATH	10.0%	72.9%	17.5%
GPT-5	GSM8K	4.8%	97.1%	9.2%
GPT-5	MATH	7.4%	87.4%	13.6%

The business interpretation is not “LLMs cannot assess.” That would be too broad. The sharper interpretation is that these models are much more comfortable confirming a clean solution than localizing the first failure in a dirty one.

That distinction matters because real assessment value usually comes from the dirty cases. A tutor that says a correct solution is correct is pleasant. A tutor that finds the first misconception in a wrong solution is useful. An auditor that confirms a clean spreadsheet is clean is fine. An auditor that identifies where a flawed assumption first contaminated the model is worth paying for.

This is also where overall accuracy can mislead. If a system performs well on no-error cases, a blended metric may look tolerable. But in formative assessment, the high-value task is not “approve correct work.” It is “diagnose incorrect work early enough to intervene.”

The paper’s F1 numbers make that asymmetry visible. On GSM8K, GPT-4 solves 94.9% of problems correctly, while its assessment F1 is 8.9%. GPT-5 solves 97.4%, while its assessment F1 is 9.2%. The solve–assess gaps are therefore 86.0 and 88.2 percentage points. That is not a small gap between cousins. That is a canyon with a nice UX demo built over it.

On MATH, the gaps are smaller—12.3 points for GPT-4 and 16.9 points for GPT-5—partly because solving accuracy itself is much lower, around 30%. But the assessment F1 remains low, between 13.6% and 17.5%. The harder dataset does not magically create a better assessor; it just makes both sides of the comparison less flattering.

Why solving correctly still does not guarantee judging correctly

The qualitative examples are not decorative. They explain why the association is strong but incomplete.

In one GSM8K case, the model solves the original problem correctly but misidentifies the earliest error in a provided solution. It points to a later counting problem rather than the earlier structural mistake where the reasoning first goes off track. That is a revealing failure. The model has enough mathematical competence to reach the right answer, and even enough critique ability to notice a real error, but not enough process discipline to localize the first error.

In a MATH example involving quadrant signs, the model identifies the conceptual issue correctly—sine and cosine signs in the fourth quadrant—but assigns the issue to the wrong step index. Semantically, the critique is close. Under exact-match earliest-error evaluation, it is wrong.

This may look harsh, but the harshness is the point. In tutoring, the first wrong step matters because feedback should target the point of misunderstanding, not merely a downstream symptom. In auditing, the first invalid assumption matters because every later calculation may be arithmetically consistent and still contaminated. Late diagnosis is often just post-mortem with better formatting.

The reverse mismatch is also informative. The paper gives cases where GPT-4 fails to solve the original problem but correctly assesses a provided solution as error-free. In those cases, the model’s generative path fails, but its recognition of a coherent trace succeeds.

That tells us assessment is not merely solving repeated in another form. Sometimes generation fails while recognition works. Sometimes generation works while localization fails. The two abilities share mathematical understanding, but they route that understanding through different operational demands.

For system design, this means a single “reasoning score” is too crude. A useful evaluation suite should separate at least four capabilities:

Capability	Example product question	Why it needs separate testing
Solving	Can the model produce the right answer?	Final-answer accuracy can hide weak critique ability
Verification	Can the model decide whether a trace is valid?	No-error recognition is easier than error localization
Diagnosis	Can the model find the first wrong step?	This is where the paper shows severe weakness
Feedback	Can the model explain the issue in a way the learner can use?	The paper tests localization, not instructional quality

The last row is especially important. This paper does not prove that models can or cannot generate good pedagogical feedback. It tests earliest-error prediction on curated benchmark traces. A deployed tutor must also explain the error, adapt to the learner, and avoid creating new confusion. As usual, the production system is where benchmark optimism goes to become support tickets.

The business implication: reviewer features need their own evaluation stack

The obvious market reading is education: AI tutors should not be certified by math-solving accuracy alone. But the result travels further because many enterprise AI systems follow the same pattern.

A company first uses an LLM to generate work: draft an analysis, write code, summarize a contract, produce a forecast note. Then it asks the model to review the work: check the logic, find inconsistencies, flag errors, validate reasoning. The second step feels like a cheap add-on because the same model interface can perform both roles.

The paper suggests that this convenience is architecturally suspicious.

What the paper directly shows is limited and specific: in math reasoning traces from PROCESSBENCH, GPT-4 and GPT-5 assess step-level reasoning more accurately when they solved the same item correctly, but they perform very poorly on error-present earliest-error localization.

What Cognaptus infers for business practice is broader but still bounded: systems that sell “AI review,” “AI tutor,” “AI analyst,” or “agent supervisor” should evaluate review ability as a separate product capability, not as a side effect of generation quality.

That implies three practical design moves.

First, build reviewer benchmarks around failure cases, not only clean examples. If most test cases are correct or easy to approve, the system may learn the most commercially convenient habit: nodding politely.

Second, measure localization, not only detection. “There is a problem somewhere” is often operationally weak. A useful reviewer should identify the earliest invalid assumption, the step where evidence no longer supports the conclusion, or the calculation where the result first diverges.

Third, separate roles when the workflow is high-stakes. A generator, verifier, and diagnoser may share the same foundation model, but they should not share the same evaluation logic. In more demanding systems, they may require different prompts, tools, retrieval contexts, symbolic checks, or even different model choices.

This is not a call for theatrical multi-agent architecture where three agents debate until the invoice looks intelligent. It is a call for capability decomposition. If the task has generation, verification, and diagnosis phases, the evaluation should reflect those phases.

A simple deployment checklist for AI tutors and reviewers

For education platforms, assessment vendors, and enterprise workflow tools, the paper points to a practical checklist.

Deployment question	Weak answer	Stronger answer
Can the model solve the task?	Report final-answer accuracy only	Report final-answer accuracy by difficulty and topic
Can it assess reasoning?	Use general “judge” prompts on a few examples	Test step-level assessment on labeled traces
Can it handle wrong work?	Report blended accuracy	Split no-error and error-present cases
Can it localize the first error?	Ask whether the solution is correct	Require earliest-error identification or root-cause localization
Can it explain usefully?	Generate fluent feedback	Evaluate whether feedback targets the actual first misconception
Can it generalize to users?	Test curated benchmark traces only	Add authentic user traces, messy work, and partial reasoning

The key pattern is simple: the evaluation should become more like the product promise. If the promise is “we solve,” test solving. If the promise is “we judge,” test judgment. If the promise is “we diagnose,” test diagnosis. This should not be controversial, but apparently the AI industry needs reminders in table form.

Boundaries: what this paper does not settle

The paper’s boundaries are important because they shape how far the business interpretation should travel.

First, the experiments use GPT-4 and GPT-5 in two PROCESSBENCH subsets. That is enough to show a meaningful pattern in the tested setting, not enough to rank all model families or all tutoring architectures.

Second, the solution traces are curated benchmark traces, not authentic student work. Real learners produce incomplete, ambiguous, handwritten, emotional, and occasionally chaotic reasoning. A benchmark trace is polite by comparison. It usually has the decency to be segmented into steps.

Third, exact-match earliest-error scoring is strict. The qualitative examples show that a model can identify the right conceptual issue but assign it to the wrong step. For benchmark evaluation, that is incorrect. For human tutoring, it may still be partially useful, depending on how the feedback is presented. A mature evaluation system would probably include both exact localization and graded critique quality.

Fourth, the within-model association between solving and assessment should not be interpreted as pure causal transfer. The same item may be easier or harder across both tasks. A model may solve and assess better on an item because the item itself is more transparent, not because solving success mechanically causes assessment success.

None of these boundaries weakens the practical message. They make it sharper. The paper does not say “LLMs cannot be assessors.” It says: do not assume assessor reliability from solver strength, and be especially careful with error-present localization.

The next moat is not answering; it is finding where the answer broke

The current generation of AI products is still heavily organized around output. Faster drafts. Better answers. Longer context. Cleaner summaries. More impressive demos. All useful, within limits.

But as AI systems move into tutoring, analysis, compliance, finance, operations, and agent supervision, the valuable question changes. It is no longer only “Can the model produce a plausible result?” It becomes “Can the model inspect a reasoning process and locate the first point of failure?”

That is a harder product. It requires models that can monitor, align, compare, and localize. It requires metrics that punish comfortable approval and reward precise diagnosis. It requires product teams to stop pretending that the reviewer role is a weekend prompt attached to the generator role.

The paper gives a compact empirical warning. GPT-4 and GPT-5 are much better assessors when they can solve the item, which confirms that mathematical understanding helps. But their performance on error-present traces remains extremely weak, which confirms that solving is not enough.

For AI tutors, the implication is direct: do not sell formative assessment unless you can test first-error diagnosis. For enterprise AI reviewers, the implication is broader: do not confuse fluent critique with reliable process supervision.

A solver can get to the right answer. A judge must know where the wrong answer began.

That second skill is where many AI systems will either become genuinely useful—or continue nodding confidently at mistakes, which is less intelligence than corporate meeting culture with a transformer attached.

Cognaptus: Automate the Present, Incubate the Future.

Liang Zhang, Yu Fu, and Xinyi Jin, “Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?”, arXiv:2603.25633v1, 26 March 2026, https://arxiv.org/abs/2603.25633. ↩︎

Solving is object-level reasoning; assessment is process monitoring#

What the experiment actually measures#

The association is strong: models assess better when they can solve the item#

The uncomfortable part: real errors are where assessment collapses#

Why solving correctly still does not guarantee judging correctly#

The business implication: reviewer features need their own evaluation stack#

A simple deployment checklist for AI tutors and reviewers#

Boundaries: what this paper does not settle#

The next moat is not answering; it is finding where the answer broke#