Opening — Why this matters now
Everyone wants AI that doesn’t just answer—but explains, verifies, and corrects.
In education, finance, and operations, the next wave of value isn’t generation. It’s evaluation. Can your AI tell you why something is wrong—not just produce something that looks right?
A recent study on LLMs in math tutoring quietly exposes a problem most AI product teams would prefer to ignore: models that solve well do not necessarily assess well. And worse, they often fail exactly where businesses need them most—pinpointing errors.
If your roadmap includes “AI reviewer,” “AI auditor,” or “AI co-pilot with verification,” this gap is not academic. It’s operational risk.
Background — Context and prior art
Large Language Models have evolved from passive assistants to active reasoning agents. In math education alone, they are already used for:
- Solving problems
- Generating hints
- Grading solutions
- Diagnosing reasoning steps
This creates two distinct capability layers:
| Layer | Function | Analogy |
|---|---|---|
| Object-level | Solve the problem | Student answering a question |
| Meta-level | Evaluate reasoning | Teacher grading the solution |
Historically, research treated these separately. But in real systems—especially AI tutors, auditors, and decision assistants—these capabilities must coexist.
The paper reframes this relationship as a core question:
Does better problem-solving ability translate into better reasoning assessment?
Spoiler: Yes—but only partially. And the “partial” is where things get uncomfortable.
Analysis — What the paper actually tests
The study evaluates two models (GPT-4 and GPT-5) across two tasks on the same problems:
- Problem-solving task — generate the correct final answer
- Assessment task — identify the earliest incorrect step in a provided solution
This distinction is subtle but brutal.
Solving asks:
“Can you get the right answer?”
Assessment asks:
“Can you track someone else’s reasoning and identify exactly where it first breaks?”
That second task requires:
- Step tracking
- Logical consistency checks
- Temporal reasoning over steps
- Error localization precision
In other words: less creativity, more discipline.
Findings — The uncomfortable numbers
1. Solving and assessing are correlated—but not equivalent
| Model | Dataset | Solve Accuracy | Assessment Accuracy |
|---|---|---|---|
| GPT-4 | GSM8K | 94.9% | 46.4% |
| GPT-5 | GSM8K | 97.4% | 49.3% |
| GPT-4 | MATH | 29.8% | 34.5% |
| GPT-5 | MATH | 30.5% | 38.6% |
Yes, stronger models perform better in both tasks.
But notice the asymmetry:
- Solving is high confidence (near-perfect on GSM8K)
- Assessment is mediocre at best
The real story appears when we split items by whether the model solved them correctly.
2. If the model solves correctly, it usually assesses better
| Scenario | Assessment Accuracy |
|---|---|
| Solved correctly | High (50–70%) |
| Solved incorrectly | Very low (6–25%) |
This confirms a key intuition:
Understanding the problem helps you critique it.
But it does not guarantee accurate critique.
3. The real bottleneck: detecting actual errors
| Metric | Performance |
|---|---|
| No-error detection | 70%–97% |
| Error localization | 4%–10% |
| F1 score | ~9%–17% |
This is the critical asymmetry.
LLMs are very good at saying:
“Looks fine.”
They are very bad at saying:
“Step 3 is where everything breaks.”
If that sounds familiar, it should. Many human reviewers behave the same way.
4. The “solve–assess gap” is massive
| Model | Dataset | Gap (Solve − Assess F1) |
|---|---|---|
| GPT-4 | GSM8K | 86.0 pts |
| GPT-5 | GSM8K | 88.2 pts |
| GPT-4 | MATH | 12.3 pts |
| GPT-5 | MATH | 16.9 pts |
Translation:
Being good at solving does not make you good at judging.
And the gap is not small—it’s structural.
Implications — What this means for real systems
1. “AI reviewer” is harder than “AI generator”
Most AI products today are built on generation:
- Write code
- Summarize reports
- Answer questions
But enterprise value increasingly depends on:
- Checking correctness
- Detecting inconsistencies
- Auditing reasoning
This paper shows those are different capability stacks.
If you deploy an LLM as a reviewer without explicit design for verification, you are effectively trusting a system that:
- Can produce convincing outputs
- But struggles to diagnose subtle errors
That’s not automation. That’s delegation of risk.
2. Error localization is the real moat
The hardest part is not detecting that something is wrong.
It’s identifying:
Where it first went wrong
Why this matters:
- Education → targeted feedback
- Finance → root-cause analysis
- Operations → debugging workflows
Systems that master this will outperform those that merely flag issues.
3. Multi-agent systems aren’t optional—they’re structural
The findings implicitly justify a design shift:
| Role | Capability |
|---|---|
| Generator | Produces solution |
| Verifier | Checks correctness |
| Diagnoser | Localizes errors |
Trying to collapse all three into one model is convenient—but suboptimal.
This aligns with your agentic architecture direction:
- PG (Perception & Grounding)
- RWM (Reasoning)
- AE (Action)
What’s missing? A dedicated verification layer.
4. Metrics need to evolve
Most benchmarks reward:
- Final answer correctness
But this paper shows that’s insufficient.
Better evaluation should include:
- Step-level accuracy
- Error localization precision
- Process consistency
Otherwise, we’re optimizing for output illusion, not reasoning reliability.
Conclusion — The uncomfortable truth
The industry assumption is simple:
Smarter models → better everything
Reality is less convenient.
Stronger LLMs are indeed better at assessing—but only when they already understand the problem. And even then, they frequently fail at the one task that matters most: pinpointing the first mistake.
In other words:
LLMs can often tell you that something is wrong. They still struggle to tell you where it started.
That gap is where the next generation of AI systems will be built—or fail.
Cognaptus: Automate the Present, Incubate the Future.