Cover image

When Solvers Become Judges (and Fail): Why LLMs Still Struggle to Critique Reasoning

Correction is the expensive part. Answer generation is already the familiar magic trick. Give a model a problem, ask for a solution, and admire the fluent staircase of reasoning. Sometimes the staircase even reaches the right floor. That is nice. Investors clap. Product managers update the roadmap. Somewhere, a slide says “AI tutor,” “AI reviewer,” or “autonomous verification layer.” ...

March 27, 2026 · 15 min · Zelina
Cover image

Count Us In: How Dual‑Agent LLMs Turn Math Slips into Teachable Moments

TL;DR for operators Math tutoring is not a place where “sounds right” is a harmless product feature. The paper behind this article tests four LLMs—GPT-4o, OpenAI o1, DeepSeek-V3, and DeepSeek-R1—on generated arithmetic, algebra, and Diophantine tasks, then inspects not only final answers but the intermediate steps where mistakes appear.1 The useful lesson is not “LLMs are bad at math.” That is now almost a decorative sentence. The useful lesson is sharper: some models fail by calculation, some by concept, some by excessive reasoning, and some improve dramatically when another agent challenges the work. For builders of AI tutors, graders, and formative assessment systems, this means reliability should be engineered as a workflow, not purchased as a model label. ...

August 16, 2025 · 17 min · Zelina