Large language models can talk through a solution like a star pupil—and still get the answer wrong. A new study of four modern LLMs across arithmetic, algebra, and number theory shows where they stumble (mostly procedural slips), when they recover (with a second agent), and how teams should redesign AI tutors and graders to be trustworthy in the real world.
TL;DR for builders
- Single models still flub arithmetic. Even strong general models mis-add partial products or mis-handle carries.
- Reasoning-tuned models help—but not always. OpenAI o1 was consistently best; DeepSeek‑R1 “overthought” and missed basics.
- Two agents beat one. Peer‑review style “dual agents” dramatically raised accuracy, especially on Diophantine equations.
- Most errors are procedural, not conceptual. Think slips and symbolic manipulations—not deep misunderstandings.
- Step‑labeling works. A simple rubric (Correct / Procedural / Conceptual / Impasse) localizes faults and boosts formative feedback.
What the paper really tested (and why that matters)
Most benchmarks hide easy leakage and memorized patterns. Here, the authors build three item models—templates that generate many variants—to stress the models beyond memorization:
-
Multiply two 5‑digit integers. Brutally exposes carry/column addition mistakes.
-
Quadratic word problems. Solve for two two‑digit integers given their sum and product.
-
Diophantine form $p_1 x^a = p_2 y^b$ with small primes and coprime exponents—requires constructive number‑theory reasoning.
They run four models—GPT‑4o, DeepSeek‑V3, OpenAI o1, DeepSeek‑R1—both solo and in a dual‑agent setup (two peers cross‑check and refine). Crucially, they label every step with a lightweight rubric: CC (conditionally correct), PE (procedural error), CE (conceptual error), IE (impasse). This is exactly the grain educators and assessment teams need.
The punchline results, at a glance
1) Final‑answer accuracy (single agent)
- Arithmetic (5‑digit × 5‑digit): o1 was flawless; DeepSeek‑V3 converged to perfect after an initial miss; GPT‑4o was weakest; DeepSeek‑R1 struggled despite being the “reasoning” sibling.
- Quadratic word problems: o1, DeepSeek‑V3, and DeepSeek‑R1 were perfect; GPT‑4o dipped in the third iteration.
- Diophantine: o1 > DeepSeek‑V3 ≈ DeepSeek‑R1 » GPT‑4o.
2) Dual‑agent collaboration
- Arithmetic: GPT‑4o jumped from almost useless to capable; DeepSeek‑V3 hit perfect.
- Quadratics: Both peer setups reached perfect.
- Diophantine: Big gains—especially for base models—when two agents cross‑validate.
Why? Two agents create disagreement surfaces that flush out slips (e.g., mis‑summing partial products) and force explicit verification.
3) What kinds of mistakes?
- Procedural errors (PE) dominate. Transcription, carry, sign, and symbolic‑algebra slips explain most misses.
- Conceptual errors (CE) are rarer but decisive on number‑theory items for some models.
- “Overthinking” is real. R1 sometimes spirals in reflection without executing the needed arithmetic.
4) Can LLMs grade steps reliably?
- Yes, if the model is actually good at math. o1’s step labels achieved substantial agreement with a human rater (κ≈0.74) versus GPT‑4o’s fair (κ≈0.37). That’s a quiet but massive result for automated formative assessment.
A simple table you can act on
Scenario | Arithmetic (5‑digit × 5‑digit) | Quadratic Word Problems | Diophantine $p_1 x^a = p_2 y^b$ |
---|---|---|---|
Best single model | o1 (consistently perfect) | o1 / V3 / R1 (all perfect) | o1 (top) |
Weak link | GPT‑4o (frequent slips) | GPT‑4o (late dip) | GPT‑4o (low) |
Reasoning sibling vs. base | R1 < V3 (over‑reflection) | R1 = V3 (both perfect) | R1 ≈ V3 |
Dual‑agent lift | Big (V3→perfect; 4o large jump) | Both → perfect | Big (esp. base models) |
Directional, based on the study’s three‑iteration protocol and per‑task counts; patterns are robust even if you vary item instances.
Design patterns we’d ship tomorrow
-
Talk‑then‑Tool for arithmetic. Force explicit column‑wise multiplication and route all numeric ops to a calculator or CAS. Require the model to include a checksum (e.g., multiply in two decompositions) before finalizing.
-
Dual‑Agent Peer Review (DAPeR). Spin up a second, diverse base model to cross‑examine derivations. Require a reconciliation turn where each agent must point to concrete lines (line numbers or equation references) it challenges or accepts.
-
Rubric‑Aware Prompting. Make CC/PE/CE/IE labels first‑class:
- At each step, the model self‑tags its step type.
- The peer must either confirm the tag with evidence or downgrade it with a counterexample.
- Log these tags for instructors and for automated hints.
-
Arithmetic Guardrails.
- Carry Checks: When partial sums are present, enforce “carry ledger” tables.
- Sign Discipline: Require a final sign audit before simplification.
- Symbolic Sanity: Before factoring/expanding, restate the exact identity being applied.
-
Proof‑of‑Satisfaction for Diophantine. After proposing $(x, y)$, must substitute back and print the equality in prime‑factor exponents. No print, no pass.
-
Fail‑Closed UI for tutors. If PE/CE exceeds a threshold in any window of steps, the tutor halts, surfaces the suspect step, and asks the learner to co‑repair—turning an AI failure into a teachable moment.
What this means for edtech and enterprise
- Edtech tutoring: Use dual agents + rubric logging to transform shaky single‑agent tutors into reliable co‑pilots. Your biggest win is catching procedural slips before they mis-teach.
- Assessment: Reasoning‑strong models can grade steps with human‑level reliability on well‑scoped math. That unlocks scalable formative assessment and targeted hints.
- Finance/engineering copilots: Anywhere numbers matter, single‑agent LLMs are risky. Adopt Talk‑then‑Tool + DAPeR to keep narratives honest.
- Model shopping: Don’t assume the “reasoning” sibling is better. Validate on your item models; watch for “overthinking without execution.”
A tiny implementation sketch
Pipeline: Item Model Generator → Two Agents (Diverse) → Step‑by‑Step with self‑labels → Calculator/CAS calls for numeric ops → Reconciliation turn → Proof‑of‑Satisfaction checks → Teacher/Student UI with expandable step logs and PE/CE heatmaps.
Metrics to monitor:
- Final accuracy by item model and iteration
- PE rate per 100 steps (by error subtype)
- Label agreement (LLM vs. human) on a rolling audit
- Dual‑agent disagreement rate and resolution time
Open questions we should keep asking
- How much of the dual‑agent lift is diversity vs. simple redundancy?
- What’s the optimal cadence for tool calls without killing latency?
- Can we pre‑train error‑spotter heads that specialize in PE/CE detection and attach them to any model?
- How do we calibrate confidence when procedural slips remain undetected?
Cognaptus: Automate the Present, Incubate the Future