Assessment

Judge, Jury, and Calibration: Why AI Evaluation Needs Anchors

TL;DR for operators AI is becoming very good at producing judgement-shaped output. That is not the same thing as judgement. Two recent papers make the same operational point from different sides: one shows how AI can estimate educational item difficulty before response data are available; the other shows how LLM-generated peer reviews can look serious while diverging from human reviewing behaviour.12 ...

When Solvers Become Judges (and Fail): Why LLMs Still Struggle to Critique Reasoning

Correction is the expensive part. Answer generation is already the familiar magic trick. Give a model a problem, ask for a solution, and admire the fluent staircase of reasoning. Sometimes the staircase even reaches the right floor. That is nice. Investors clap. Product managers update the roadmap. Somewhere, a slide says “AI tutor,” “AI reviewer,” or “autonomous verification layer.” ...

$Cover image$

Count Us In: How Dual‑Agent LLMs Turn Math Slips into Teachable Moments

TL;DR for operators Math tutoring is not a place where “sounds right” is a harmless product feature. The paper behind this article tests four LLMs—GPT-4o, OpenAI o1, DeepSeek-V3, and DeepSeek-R1—on generated arithmetic, algebra, and Diophantine tasks, then inspects not only final answers but the intermediate steps where mistakes appear.1 The useful lesson is not “LLMs are bad at math.” That is now almost a decorative sentence. The useful lesson is sharper: some models fail by calculation, some by concept, some by excessive reasoning, and some improve dramatically when another agent challenges the work. For builders of AI tutors, graders, and formative assessment systems, this means reliability should be engineered as a workflow, not purchased as a model label. ...