Cover image

Count Us In: How Dual‑Agent LLMs Turn Math Slips into Teachable Moments

Large language models can talk through a solution like a star pupil—and still get the answer wrong. A new study of four modern LLMs across arithmetic, algebra, and number theory shows where they stumble (mostly procedural slips), when they recover (with a second agent), and how teams should redesign AI tutors and graders to be trustworthy in the real world. TL;DR for builders Single models still flub arithmetic. Even strong general models mis-add partial products or mis-handle carries. Reasoning-tuned models help—but not always. OpenAI o1 was consistently best; DeepSeek‑R1 “overthought” and missed basics. Two agents beat one. Peer‑review style “dual agents” dramatically raised accuracy, especially on Diophantine equations. Most errors are procedural, not conceptual. Think slips and symbolic manipulations—not deep misunderstandings. Step‑labeling works. A simple rubric (Correct / Procedural / Conceptual / Impasse) localizes faults and boosts formative feedback. What the paper really tested (and why that matters) Most benchmarks hide easy leakage and memorized patterns. Here, the authors build three item models—templates that generate many variants—to stress the models beyond memorization: ...

August 16, 2025 · 5 min · Zelina
Cover image

Learning to Struggle: Teaching LLMs to Code Like Real Students

What makes code feel like it was written by a student? Not just errors, but how they evolve. Not just style, but how it diverges from the polished norms. This week’s standout paper, ParaStudent, tackles a refreshingly underexplored challenge: teaching LLMs to generate code that learns like a student — messy, iterative, full of hiccups and growth. Instead of building yet another high-performing code assistant, the authors fine-tune LLMs to mimic real students in an introductory CS class at UC Berkeley. They call their framework ParaStudent. The goal: replace idealized solutions with something plausibly human — an LLM that stumbles, recovers, and grows in fidelity to how novices actually write code. ...

July 19, 2025 · 3 min · Zelina
Cover image

School of Thought: How Fine-Tuned Open LLMs Are Challenging the Giants in Education

Why rent a Ferrari when a fine-tuned e-bike can get you to class faster, cheaper, and on your own terms? That’s the question quietly reshaping AI in education, as shown by Solano et al. (2025) in their paper Narrowing the Gap. The authors demonstrate that with supervised fine-tuning (SFT), smaller open-source models like Llama-3.1-8B and Qwen3-4B can rival proprietary giants like GPT-4.1 when explaining C programming error messages to students. More strikingly, they achieve this with better privacy, lower cost, and pedagogical nuance that large models often overshoot. ...

July 9, 2025 · 3 min · Zelina
Cover image

Divide and Conquer: How LLMs Learn to Teach

Divide and Conquer: How LLMs Learn to Teach Designing effective lessons for training online tutors is no small feat. It demands pedagogical nuance, clarity, scenario realism, and learner empathy. A recent paper by Lin et al., presented at ECTEL 2025, offers a compelling answer to this challenge: use LLMs, but don’t ask too much at once. Their research reveals that breaking the task of lesson generation into smaller, well-defined parts significantly improves quality, suggesting a new collaborative model for scalable education design. ...

June 24, 2025 · 3 min · Zelina