Dual Agents

TL;DR for operators Math tutoring is not a place where “sounds right” is a harmless product feature. The paper behind this article tests four LLMs—GPT-4o, OpenAI o1, DeepSeek-V3, and DeepSeek-R1—on generated arithmetic, algebra, and Diophantine tasks, then inspects not only final answers but the intermediate steps where mistakes appear.1 The useful lesson is not “LLMs are bad at math.” That is now almost a decorative sentence. The useful lesson is sharper: some models fail by calculation, some by concept, some by excessive reasoning, and some improve dramatically when another agent challenges the work. For builders of AI tutors, graders, and formative assessment systems, this means reliability should be engineered as a workflow, not purchased as a model label. ...