Opening — Why this matters now

Everyone wants AI that doesn’t just answer—but explains, verifies, and corrects.

In education, finance, and operations, the next wave of value isn’t generation. It’s evaluation. Can your AI tell you why something is wrong—not just produce something that looks right?

A recent study on LLMs in math tutoring quietly exposes a problem most AI product teams would prefer to ignore: models that solve well do not necessarily assess well. And worse, they often fail exactly where businesses need them most—pinpointing errors.

If your roadmap includes “AI reviewer,” “AI auditor,” or “AI co-pilot with verification,” this gap is not academic. It’s operational risk.

Background — Context and prior art

Large Language Models have evolved from passive assistants to active reasoning agents. In math education alone, they are already used for:

  • Solving problems
  • Generating hints
  • Grading solutions
  • Diagnosing reasoning steps

This creates two distinct capability layers:

Layer Function Analogy
Object-level Solve the problem Student answering a question
Meta-level Evaluate reasoning Teacher grading the solution

Historically, research treated these separately. But in real systems—especially AI tutors, auditors, and decision assistants—these capabilities must coexist.

The paper reframes this relationship as a core question:

Does better problem-solving ability translate into better reasoning assessment?

Spoiler: Yes—but only partially. And the “partial” is where things get uncomfortable.

Analysis — What the paper actually tests

The study evaluates two models (GPT-4 and GPT-5) across two tasks on the same problems:

  1. Problem-solving task — generate the correct final answer
  2. Assessment task — identify the earliest incorrect step in a provided solution

This distinction is subtle but brutal.

Solving asks:

“Can you get the right answer?”

Assessment asks:

“Can you track someone else’s reasoning and identify exactly where it first breaks?”

That second task requires:

  • Step tracking
  • Logical consistency checks
  • Temporal reasoning over steps
  • Error localization precision

In other words: less creativity, more discipline.

Findings — The uncomfortable numbers

1. Solving and assessing are correlated—but not equivalent

Model Dataset Solve Accuracy Assessment Accuracy
GPT-4 GSM8K 94.9% 46.4%
GPT-5 GSM8K 97.4% 49.3%
GPT-4 MATH 29.8% 34.5%
GPT-5 MATH 30.5% 38.6%

Yes, stronger models perform better in both tasks.

But notice the asymmetry:

  • Solving is high confidence (near-perfect on GSM8K)
  • Assessment is mediocre at best

The real story appears when we split items by whether the model solved them correctly.

2. If the model solves correctly, it usually assesses better

Scenario Assessment Accuracy
Solved correctly High (50–70%)
Solved incorrectly Very low (6–25%)

This confirms a key intuition:

Understanding the problem helps you critique it.

But it does not guarantee accurate critique.

3. The real bottleneck: detecting actual errors

Metric Performance
No-error detection 70%–97%
Error localization 4%–10%
F1 score ~9%–17%

This is the critical asymmetry.

LLMs are very good at saying:

“Looks fine.”

They are very bad at saying:

“Step 3 is where everything breaks.”

If that sounds familiar, it should. Many human reviewers behave the same way.

4. The “solve–assess gap” is massive

Model Dataset Gap (Solve − Assess F1)
GPT-4 GSM8K 86.0 pts
GPT-5 GSM8K 88.2 pts
GPT-4 MATH 12.3 pts
GPT-5 MATH 16.9 pts

Translation:

Being good at solving does not make you good at judging.

And the gap is not small—it’s structural.

Implications — What this means for real systems

1. “AI reviewer” is harder than “AI generator”

Most AI products today are built on generation:

  • Write code
  • Summarize reports
  • Answer questions

But enterprise value increasingly depends on:

  • Checking correctness
  • Detecting inconsistencies
  • Auditing reasoning

This paper shows those are different capability stacks.

If you deploy an LLM as a reviewer without explicit design for verification, you are effectively trusting a system that:

  • Can produce convincing outputs
  • But struggles to diagnose subtle errors

That’s not automation. That’s delegation of risk.

2. Error localization is the real moat

The hardest part is not detecting that something is wrong.

It’s identifying:

Where it first went wrong

Why this matters:

  • Education → targeted feedback
  • Finance → root-cause analysis
  • Operations → debugging workflows

Systems that master this will outperform those that merely flag issues.

3. Multi-agent systems aren’t optional—they’re structural

The findings implicitly justify a design shift:

Role Capability
Generator Produces solution
Verifier Checks correctness
Diagnoser Localizes errors

Trying to collapse all three into one model is convenient—but suboptimal.

This aligns with your agentic architecture direction:

  • PG (Perception & Grounding)
  • RWM (Reasoning)
  • AE (Action)

What’s missing? A dedicated verification layer.

4. Metrics need to evolve

Most benchmarks reward:

  • Final answer correctness

But this paper shows that’s insufficient.

Better evaluation should include:

  • Step-level accuracy
  • Error localization precision
  • Process consistency

Otherwise, we’re optimizing for output illusion, not reasoning reliability.

Conclusion — The uncomfortable truth

The industry assumption is simple:

Smarter models → better everything

Reality is less convenient.

Stronger LLMs are indeed better at assessing—but only when they already understand the problem. And even then, they frequently fail at the one task that matters most: pinpointing the first mistake.

In other words:

LLMs can often tell you that something is wrong. They still struggle to tell you where it started.

That gap is where the next generation of AI systems will be built—or fail.

Cognaptus: Automate the Present, Incubate the Future.