When Solvers Become Judges (and Fail): Why LLMs Still Struggle to Critique Reasoning

Opening — Why this matters now

Everyone wants AI that doesn’t just answer—but explains, verifies, and corrects.

In education, finance, and operations, the next wave of value isn’t generation. It’s evaluation. Can your AI tell you why something is wrong—not just produce something that looks right?

A recent study on LLMs in math tutoring quietly exposes a problem most AI product teams would prefer to ignore: models that solve well do not necessarily assess well. And worse, they often fail exactly where businesses need them most—pinpointing errors.

If your roadmap includes “AI reviewer,” “AI auditor,” or “AI co-pilot with verification,” this gap is not academic. It’s operational risk.

Background — Context and prior art

Large Language Models have evolved from passive assistants to active reasoning agents. In math education alone, they are already used for:

Solving problems
Generating hints
Grading solutions
Diagnosing reasoning steps

This creates two distinct capability layers:

Layer	Function	Analogy
Object-level	Solve the problem	Student answering a question
Meta-level	Evaluate reasoning	Teacher grading the solution

Historically, research treated these separately. But in real systems—especially AI tutors, auditors, and decision assistants—these capabilities must coexist.

The paper reframes this relationship as a core question:

Does better problem-solving ability translate into better reasoning assessment?

Spoiler: Yes—but only partially. And the “partial” is where things get uncomfortable.

Analysis — What the paper actually tests

The study evaluates two models (GPT-4 and GPT-5) across two tasks on the same problems:

Problem-solving task — generate the correct final answer
Assessment task — identify the earliest incorrect step in a provided solution

This distinction is subtle but brutal.

Solving asks:

“Can you get the right answer?”

Assessment asks:

“Can you track someone else’s reasoning and identify exactly where it first breaks?”

That second task requires:

Step tracking
Logical consistency checks
Temporal reasoning over steps
Error localization precision

In other words: less creativity, more discipline.

Findings — The uncomfortable numbers

1. Solving and assessing are correlated—but not equivalent

Model	Dataset	Solve Accuracy	Assessment Accuracy
GPT-4	GSM8K	94.9%	46.4%
GPT-5	GSM8K	97.4%	49.3%
GPT-4	MATH	29.8%	34.5%
GPT-5	MATH	30.5%	38.6%

Yes, stronger models perform better in both tasks.

But notice the asymmetry:

Solving is high confidence (near-perfect on GSM8K)
Assessment is mediocre at best

The real story appears when we split items by whether the model solved them correctly.

2. If the model solves correctly, it usually assesses better

Scenario	Assessment Accuracy
Solved correctly	High (50–70%)
Solved incorrectly	Very low (6–25%)

This confirms a key intuition:

Understanding the problem helps you critique it.

But it does not guarantee accurate critique.

3. The real bottleneck: detecting actual errors

Metric	Performance
No-error detection	70%–97%
Error localization	4%–10%
F1 score	~9%–17%

This is the critical asymmetry.

LLMs are very good at saying:

“Looks fine.”

They are very bad at saying:

“Step 3 is where everything breaks.”

If that sounds familiar, it should. Many human reviewers behave the same way.

4. The “solve–assess gap” is massive

Model	Dataset	Gap (Solve − Assess F1)
GPT-4	GSM8K	86.0 pts
GPT-5	GSM8K	88.2 pts
GPT-4	MATH	12.3 pts
GPT-5	MATH	16.9 pts

Translation:

Being good at solving does not make you good at judging.

And the gap is not small—it’s structural.

Implications — What this means for real systems

1. “AI reviewer” is harder than “AI generator”

Most AI products today are built on generation:

Write code
Summarize reports
Answer questions

But enterprise value increasingly depends on:

Checking correctness
Detecting inconsistencies
Auditing reasoning

This paper shows those are different capability stacks.

If you deploy an LLM as a reviewer without explicit design for verification, you are effectively trusting a system that:

Can produce convincing outputs
But struggles to diagnose subtle errors

That’s not automation. That’s delegation of risk.

2. Error localization is the real moat

The hardest part is not detecting that something is wrong.

It’s identifying:

Where it first went wrong

Why this matters:

Education → targeted feedback
Finance → root-cause analysis
Operations → debugging workflows

Systems that master this will outperform those that merely flag issues.

3. Multi-agent systems aren’t optional—they’re structural

The findings implicitly justify a design shift:

Role	Capability
Generator	Produces solution
Verifier	Checks correctness
Diagnoser	Localizes errors

Trying to collapse all three into one model is convenient—but suboptimal.

This aligns with your agentic architecture direction:

PG (Perception & Grounding)
RWM (Reasoning)
AE (Action)

What’s missing? A dedicated verification layer.

4. Metrics need to evolve

Most benchmarks reward:

Final answer correctness

But this paper shows that’s insufficient.

Better evaluation should include:

Step-level accuracy
Error localization precision
Process consistency

Otherwise, we’re optimizing for output illusion, not reasoning reliability.

Conclusion — The uncomfortable truth

The industry assumption is simple:

Smarter models → better everything

Reality is less convenient.

Stronger LLMs are indeed better at assessing—but only when they already understand the problem. And even then, they frequently fail at the one task that matters most: pinpointing the first mistake.

In other words:

LLMs can often tell you that something is wrong. They still struggle to tell you where it started.

That gap is where the next generation of AI systems will be built—or fail.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually tests#

Findings — The uncomfortable numbers#

1. Solving and assessing are correlated—but not equivalent#

2. If the model solves correctly, it usually assesses better#

3. The real bottleneck: detecting actual errors#

4. The “solve–assess gap” is massive#

Implications — What this means for real systems#

1. “AI reviewer” is harder than “AI generator”#

2. Error localization is the real moat#

3. Multi-agent systems aren’t optional—they’re structural#

4. Metrics need to evolve#

Conclusion — The uncomfortable truth#