TL;DR for operators

Math tutoring is not a place where “sounds right” is a harmless product feature. The paper behind this article tests four LLMs—GPT-4o, OpenAI o1, DeepSeek-V3, and DeepSeek-R1—on generated arithmetic, algebra, and Diophantine tasks, then inspects not only final answers but the intermediate steps where mistakes appear.1

The useful lesson is not “LLMs are bad at math.” That is now almost a decorative sentence. The useful lesson is sharper: some models fail by calculation, some by concept, some by excessive reasoning, and some improve dramatically when another agent challenges the work. For builders of AI tutors, graders, and formative assessment systems, this means reliability should be engineered as a workflow, not purchased as a model label.

OpenAI o1 is the standout single model in the study. It is flawless on 5-digit multiplication, perfect on the quadratic word problems, and strongest on the Diophantine tasks. But the reasoning-model story does not generalise cleanly: DeepSeek-R1 performs poorly on multiplication and trails DeepSeek-V3 in the tested settings. Apparently, “thinking longer” is not the same as calculating correctly. A shocking development, if one has never met a committee.

The paper’s most operational result is dual-agent collaboration. GPT-4o improves from a very weak solo arithmetic result to 14/30 correct in the dual-agent configuration, and DeepSeek-V3 reaches perfect performance on both multiplication and Diophantine tasks when paired in the collaborative setup. The right product takeaway is not to let two chatbots talk forever. It is to use structured peer review: propose, challenge, verify, reconcile, then compute externally when the task is arithmetic or symbolic.

The limits matter. This is not a classroom efficacy study. It uses three item models, ten instances per category, three independent runs, and a broad step-label rubric. Human verification of step labels is partial and not fully independent. So the paper is best read as a design signal for AI assessment architecture, not as proof that dual-agent tutors are ready to replace human teachers. Sensible, but less exciting for pitch decks. Tragic.

The familiar failure: the tutor explains beautifully, then adds badly

Every education platform wants the same magic trick: a tutor that answers instantly, explains patiently, personalises feedback, grades steps, and never sighs. LLMs are disturbingly good at the surface of that trick. They can narrate a solution with the composure of a model student. They can also mis-add a column, lose a carry, or propose a number-theory solution that collapses when substituted back into the original equation.

That distinction is the heart of Zhang and Graf’s paper. It is not primarily a benchmark race. It is a reliability inspection. The authors ask two questions that matter for instructional systems: do the models get the answer right, and what kind of error appears in the solution path when they do not?

This matters because education products rarely need only final answers. A math tutor has to diagnose where the learner’s reasoning changed direction. A grading assistant has to distinguish a procedural slip from a conceptual misunderstanding. A formative assessment system has to decide whether the next hint should say “check your arithmetic” or “you set up the equation incorrectly.” Those are different interventions. Treating them as the same error is how automated feedback becomes politely useless.

The study therefore compares models across three axes that are more useful than leaderboard applause:

Comparison What it tests Why operators should care
Generated item models vs standard benchmarks Whether models handle structured variants rather than familiar public examples Reduces the comfort of memorised benchmark competence
Base models vs reasoning-enhanced models Whether explicit reasoning training improves actual correctness Tests the common assumption that reasoning branding equals safer tutoring
Single agent vs dual agent Whether peer critique improves final answers Points toward workflow design, not just model selection
Final answer vs step label Whether correctness can be localised Enables targeted feedback and assessment, not just pass/fail scoring

The paper is comparison-based because the business question is comparative. Which failure mode should a product team defend against? Which architecture buys reliability? Which model label is meaningful, and which one is just a very expensive adjective?

The test design avoids the usual benchmark comfort blanket

The authors do not use a standard benchmark as the main testing ground. They use item models: structured templates that generate parallel math problems. That choice is more important than it first appears.

A standard benchmark can tell us whether a model performs well on a known collection. An item model asks whether the model can handle controlled variations of a problem type. That is much closer to real educational use, where a tutor may need to generate or solve many variants around the same skill.

The three tested task categories are deliberately plain and revealing:

  1. Multiplying two 5-digit whole numbers.
  2. Solving algebraic word problems involving two distinct two-digit integers with a given sum and product.
  3. Finding positive integer solutions to Diophantine equations of the form:
$$ p_1x^a = p_2y^b $$

where $p_1$ and $p_2$ are distinct small primes, and the exponents $a$ and $b$ are relatively prime and at most 9.

The first task is computationally dull but diagnostically brutal. A model that cannot reliably multiply two 5-digit numbers is not failing because the problem is philosophically deep. It is failing because token prediction is not arithmetic.

The second task tests whether the model can translate a word problem into a quadratic structure. It is closer to the kind of algebraic reasoning found in school tutoring products.

The third task is the interesting one for reasoning. The Diophantine equation has infinitely many possible satisfying pairs, so correctness is checked by substitution. The model only needs one valid pair, but it must construct it coherently. This is where procedural competence and conceptual representation start pulling in different directions.

The experiment runs four models in a single-agent setting: GPT-4o, OpenAI o1, DeepSeek-V3, and DeepSeek-R1. It then tests a dual-agent setting using GPT-4o and DeepSeek-V3 as peer agents that interact, cross-check, and refine their reasoning. Each setup is repeated across three independent runs.

That is enough to expose patterns. It is not enough to crown a universal winner across all math, all curricula, and all deployment contexts. One should resist the urge to turn a small diagnostic study into a procurement oracle. Procurement already has enough theatre.

Reasoning models help, except when they do not

The cleanest single-model result belongs to OpenAI o1. On 5-digit multiplication, o1 answers all problems correctly across all three iterations. On the quadratic word problems, it also reaches perfect performance. On Diophantine equations, it leads the tested models with 25/30 correct.

That makes o1 the obvious strongest single agent in the study. But the more interesting result is the broken symmetry between the OpenAI and DeepSeek pairs.

For OpenAI, the reasoning-enhanced model beats its base counterpart sharply. GPT-4o is weak on multiplication, with only two correct answers overall in the solo setting, and it performs poorly on Diophantine equations with 8/30 correct. o1 is dramatically better across the same task families.

For DeepSeek, the pattern does not hold. DeepSeek-V3 performs strongly on multiplication, starting at 8/10 and then reaching perfect accuracy in later iterations, for 28/30 overall. DeepSeek-R1, despite being the reasoning-oriented sibling, achieves only four correct multiplication answers across the three runs. On the Diophantine task, DeepSeek-V3 scores 21/30, slightly ahead of DeepSeek-R1 at 20/30.

The authors describe DeepSeek-R1’s multiplication failures as an “overthinking” phenomenon: excessive reflection around intermediate steps that interferes with the basic execution needed to produce the correct result. That phrasing is useful because it punctures a lazy assumption. More reasoning text is not necessarily better reasoning. Sometimes the extra trace is a fog machine.

For AI education vendors, the lesson is simple: do not certify a model for math instruction because it has a reasoning brand. Validate it on your own item models, with your own skill taxonomy, and with error types that map to your feedback product. The model that narrates most elaborately may not be the model that computes most reliably.

The accuracy story changes by problem type

The study’s problem categories behave differently, which is exactly why the paper is more useful than a single average score.

On multiplication, execution dominates. GPT-4o is weak. DeepSeek-V3 is strong. o1 is perfect. DeepSeek-R1 struggles. Here the challenge is not deep mathematical abstraction but reliable procedural computation. This is the category that most clearly says: route arithmetic to a calculator or structured computation tool. Do not ask a language model to be a spreadsheet with opinions.

On quadratic word problems, most models perform well. GPT-4o gets 9/10 in the first two iterations and 7/10 in the third, while DeepSeek-V3, o1, and DeepSeek-R1 all achieve perfect accuracy across the three runs. This task appears more aligned with LLM strengths: translating a problem into a familiar symbolic pattern, then solving.

On Diophantine equations, the ranking becomes more differentiated: o1 at 25/30, DeepSeek-V3 at 21/30, DeepSeek-R1 at 20/30, and GPT-4o at 8/30. This is where conceptual construction and verification matter. A model must propose positive integers that actually satisfy the equation. Fluency does not help if the pair fails substitution.

A compact reading:

Task Strongest signal Operational reading
5-digit multiplication o1 perfect; DeepSeek-V3 strong; GPT-4o and DeepSeek-R1 weak Use tools for arithmetic; do not trust fluent manual calculation
Quadratic word problems DeepSeek-V3, o1, and DeepSeek-R1 perfect; GPT-4o mostly strong but not perfect Familiar algebraic patterns are relatively safe, but still need final checks
Diophantine equations o1 leads; GPT-4o weak; DeepSeek-V3 slightly ahead of R1 Constructive reasoning needs proof-of-satisfaction, not just explanation

The business interpretation is not “choose o1 and go home.” The interpretation is that task type determines the guardrail. Arithmetic needs computation delegation. Algebra needs representation checks. Number theory needs substitution proof. Different errors, different controls.

Step labels turn failure into product information

The paper’s second contribution is step-level error localisation. The authors segment solutions into discrete steps and label each step using a four-part rubric:

Label Meaning Product implication
CC Conditionally correct step, accounting for earlier errors The reasoning step is locally sound
PE Procedural error, such as transcription, arithmetic, or symbolic manipulation mistake Ask the learner or model to recompute; use calculator/CAS verification
CE Conceptual error, such as wrong representation or invalid principle Trigger conceptual remediation, not just recalculation
IE Impasse error, where the solver cannot proceed Escalate hinting, worked example, or human review

This taxonomy is modest, which is a strength and a weakness. It is simple enough to integrate into an assessment workflow. It is too broad to support a rich theory of mathematical misconception without further refinement.

The important operational move is that the labels make errors actionable. A final wrong answer gives a product team almost no information. A procedural error tells the system to verify arithmetic. A conceptual error tells the system the model or learner may be using the wrong representation. An impasse tells the system the reasoning path has stalled.

The paper reports that conditionally correct steps are the most frequent across the labelled cases, while procedural errors are especially prominent for GPT-4o on multiplication and Diophantine tasks. Conceptual errors appear notably in Diophantine settings for GPT-4o, DeepSeek-V3, and DeepSeek-R1. That pattern makes sense: arithmetic invites slips; number theory invites representation mistakes.

One subtle observation in the paper deserves more attention. The authors note cases where incorrect final answers appear even without clearly identifiable procedural or conceptual errors in the labelled steps. Their explanation is that LLMs are token-prediction systems rather than explicit numerical computation engines, leaving them vulnerable to subtle numerical inaccuracies. That is not a philosophical complaint. It is an architecture warning.

For product teams, the step-label layer should not be treated as a cosmetic explanation panel. It should be logged, audited, and used to route the next action.

A practical tutoring workflow might look like this:

Student or model solution
Step segmentation
CC / PE / CE / IE labels
If PE: recompute with external tool
If CE: generate conceptual hint
If IE: provide scaffold or escalation
Final answer requires proof-of-satisfaction

The “teachable moment” is not the model making a mistake. That part is easy. The teachable moment is the system knowing what kind of mistake it made.

Step grading is only as good as the model doing the grading

The paper includes a case study on step-label reliability. An expert human coder verified automated labels produced by GPT-4o and o1 on 70 solution steps from first-iteration multiplication problems. The authors used verification rather than a fully independent coding process because of time constraints, so this should not be read as a complete inter-rater reliability study.

Still, the result is meaningful. GPT-4o reaches fair agreement with the human coder, with Cohen’s $\kappa = 0.366$. o1 reaches substantial agreement, with $\kappa = 0.737$. The paper also reports 91.5% exact-match under LLM labelling with human verification.

This is not just a model-ranking detail. It changes how one should build automated formative assessment.

If a weak math model grades a student’s steps, the product risks compounding the original problem: incorrect diagnosis layered on top of incorrect solution. If a stronger math model performs the annotation, the system has a better chance of identifying whether a student made a procedural slip or misunderstood the concept.

The tempting cost-saving architecture is to let the same cheap model solve, grade, and explain. The paper quietly argues against that. Solver, checker, and pedagogical explainer may need to be separate roles, possibly using different models and tools.

That separation is boring. It is also how reliable systems tend to look after the demo has grown up.

Dual agents improve accuracy by creating disagreement surfaces

The dual-agent part of the paper is the most directly product-relevant result. It tests whether two peer LLM agents can collaborate through chat-based discussion, cross-validation, and refinement.

The gains are substantial in the tested settings.

For multiplication, GPT-4o improves from only two correct solo answers to 14/30 in the dual-agent configuration. DeepSeek-V3 reaches near-perfect or perfect performance in the dual-agent arithmetic setting, with the figure showing 10, 10, and 9 correct across the three iterations. The paper’s surrounding discussion emphasises strong improvement, though the plotted bars are the safer source for the exact per-iteration reading.

For quadratic word problems, both dual-agent configurations reach perfect accuracy across the three iterations.

For Diophantine equations, GPT-4o improves from 8/30 to 15/30. DeepSeek-V3 improves from 21/30 to 30/30.

The likely mechanism is not mystical “emergent intelligence.” It is error exposure. A second agent creates a disagreement surface. It can challenge a carry, question a substitution, or force a reconciliation step before the final answer is accepted. This is especially valuable for procedural errors, because many arithmetic slips are locally visible if another process is instructed to inspect the line rather than continue the narrative.

But dual-agent design should not mean letting two models improvise until the token budget begs for mercy. The collaboration has to be structured.

A production-grade dual-agent pattern would require:

Stage Required behaviour Failure caught
Proposal Agent A solves with numbered steps Produces inspectable trace
Challenge Agent B checks each step against the problem Finds arithmetic, transcription, or representation errors
Reconciliation Agents must resolve disagreements explicitly Prevents silent contradiction
Tool check Calculator/CAS verifies numeric or symbolic operations Prevents fluent miscalculation
Proof-of-satisfaction Final answer is substituted back into the original problem Catches plausible but false answers
Logging PE/CE/IE tags are stored for audit Supports product improvement and instructor review

This is where the business value sits. Dual agents are not valuable because “two AIs are smarter than one.” They are valuable because a structured second pass changes the failure surface.

What belongs to evidence, what belongs to design signal

The paper combines direct evidence, diagnostic analysis, and exploratory extension. Operators should separate these carefully.

Paper component Likely purpose What it supports What it does not prove
Generated item models across arithmetic, algebra, and Diophantine tasks Main evidence Models behave differently across controlled task families Universal math competence across curricula
Single-agent final-answer accuracy Main evidence o1 is strongest in this setup; reasoning labels do not guarantee superiority Stable rankings across all models, prompts, or domains
CC/PE/CE/IE step labels Diagnostic evidence Errors can be localised into useful categories A complete taxonomy of math misconceptions
Human verification of 70 labelled steps Validation check o1 labels align more strongly with expert judgement than GPT-4o labels Fully independent reliability across all tasks
Dual-agent collaboration Exploratory extension with strong design relevance Peer cross-checking can improve accuracy That any multi-agent system will work, or that gains survive all latency/cost constraints
Suggestions for calculators, spreadsheets, and CAS Future-work design direction Tool delegation is a plausible next step Tested tool-integrated performance in this paper

This table matters because product teams often treat every sentence in a paper as equal evidence. It is not. Some parts are demonstrated. Some parts are measured but narrow. Some parts are reasonable design extrapolation. Confusing those categories is how “research-informed” becomes “research-flavoured.”

The business value is cheaper diagnosis, not just higher accuracy

The obvious commercial use case is AI tutoring. A model solves a problem, explains steps, and gives hints. The paper suggests that this architecture is incomplete unless the system can diagnose its own mathematical work.

The less obvious use case is assessment. Step-level labels can convert a raw solution into evidence about skill. A student who sets up the right equation but miscalculates needs different feedback from a student who sets up the wrong equation cleanly. The first may need procedural practice. The second needs conceptual remediation. A grader that cannot tell the difference is grading the surface.

This has implications for product design:

  1. Use item models for internal evaluation. Do not rely only on public benchmarks. Generate controlled variants of the specific skills your platform teaches.

  2. Separate explanation from computation. Let LLMs explain reasoning, but route arithmetic and symbolic manipulation to calculators, spreadsheets, or computer algebra systems where appropriate.

  3. Treat step labels as telemetry. CC/PE/CE/IE rates should be monitored like product metrics. A rising procedural-error rate is not just a model issue; it is a reliability incident.

  4. Use dual-agent review selectively. Deploy it for high-stakes grading, difficult problem types, or when the first agent has low confidence. Do not apply it blindly to every trivial hint unless latency and cost are irrelevant, which they rarely are outside investor decks.

  5. Require proof-of-satisfaction. For tasks with checkable final answers, no answer should be accepted until it is substituted back or otherwise verified.

This is the operational reframing: the goal is not an AI tutor that never errs. The goal is an AI tutoring system that catches, classifies, and repairs errors before they become instruction.

The limits are real, and they define the safe use

The study is useful because it is specific. It is also limited because it is specific.

The dataset uses only three item models and ten generated instances per category. That is enough to expose failure patterns, not enough to cover the full space of school mathematics. The rubric is intentionally broad; more fine-grained categories could reveal different error subtypes. The human verification process for step labels is partial and not independent in the strongest methodological sense. The model set excludes other important commercial models, including more recent OpenAI and Anthropic systems. The paper also does not test classroom learning outcomes, student trust, teacher workflow, or long-term pedagogical effects.

These limits do not invalidate the result. They locate it.

The safest interpretation is:

Directly shown Reasonable business inference Still uncertain
LLMs differ sharply by task type and model family Evaluation should use task-specific item models Whether rankings persist under other curricula and prompts
Procedural errors are frequent in weak settings External computation tools should guard arithmetic Best trigger policy for tool delegation
o1 gives more reliable step labels than GPT-4o in the checked sample Stronger models may be better graders than cheaper solvers Whether this holds across broader step taxonomies
Dual-agent collaboration improves accuracy for GPT-4o and DeepSeek-V3 in tested cases Peer-review workflows can improve tutoring reliability Cost, latency, and robustness in production

That boundary is the difference between good engineering and PowerPoint optimism.

The replacement belief: not “better model,” but “better checking loop”

The wrong lesson from the paper is that education companies should simply buy the strongest model and call the problem solved. The paper does show that o1 is strong in this setup. It does not show that single-model tutoring is safe enough for assessment-heavy use.

The better lesson is architectural. Math tutoring systems need checking loops.

A strong model can propose. A second agent can challenge. A tool can compute. A rubric can localise errors. A proof-of-satisfaction step can verify the final answer. A teacher or auditor can inspect the trace when stakes are high.

That architecture will look less magical than a single glowing chat box. Good. Magic is a weak reliability strategy.

The paper’s quiet contribution is to move the conversation from “Can LLMs do math?” to “Where do they fail, how do we identify the failure, and what workflow catches it before a learner sees it?” That is a much more useful question. It is also harder to market in five words, which is usually how one knows it might be worth taking seriously.

For operators, the practical takeaway is blunt: do not trust a math tutor because it reasons aloud. Trust it only after it has been checked by another process, labelled at the step level, and forced to prove that the answer actually satisfies the problem.

Count us in, but count twice.

Cognaptus: Automate the Present, Incubate the Future.


  1. Liang Zhang and Edith Aurora Graf, “Mathematical Computation and Reasoning Errors by Large Language Models,” arXiv:2508.09932, 2025. https://arxiv.org/abs/2508.09932 ↩︎