Education

The Problem with Problems: Why LLMs Still Don’t Know What’s Interesting

Opening — Why this matters now In an age when AI can outscore most humans in the International Mathematical Olympiad, a subtler question has emerged: can machines care about what they solve? The new study A Matter of Interest (Mishra et al., 2025) explores this psychological fault line—between mechanical brilliance and genuine curiosity. If future AI partners are to co‑invent mathematics, not just compute it, they must first learn what humans deem worth inventing. ...

Smarter, Not Wiser: What Happens When AI Boosts Our Efficiency but Not Our Minds

Opening — Why this matters now In a world obsessed with productivity hacks and digital assistants, a new study offers a sobering reminder: being faster is not the same as being smarter. As tools like ChatGPT quietly integrate into workplaces and classrooms, the question isn’t whether they make us more efficient — they clearly do — but whether they actually reshape the human mind. Recent findings from the Universidad de Palermo suggest they don’t. ...

$Cover image$

Count Us In: How Dual‑Agent LLMs Turn Math Slips into Teachable Moments

Large language models can talk through a solution like a star pupil—and still get the answer wrong. A new study of four modern LLMs across arithmetic, algebra, and number theory shows where they stumble (mostly procedural slips), when they recover (with a second agent), and how teams should redesign AI tutors and graders to be trustworthy in the real world. TL;DR for builders Single models still flub arithmetic. Even strong general models mis-add partial products or mis-handle carries. Reasoning-tuned models help—but not always. OpenAI o1 was consistently best; DeepSeek‑R1 “overthought” and missed basics. Two agents beat one. Peer‑review style “dual agents” dramatically raised accuracy, especially on Diophantine equations. Most errors are procedural, not conceptual. Think slips and symbolic manipulations—not deep misunderstandings. Step‑labeling works. A simple rubric (Correct / Procedural / Conceptual / Impasse) localizes faults and boosts formative feedback. What the paper really tested (and why that matters) Most benchmarks hide easy leakage and memorized patterns. Here, the authors build three item models—templates that generate many variants—to stress the models beyond memorization: ...

Learning to Struggle: Teaching LLMs to Code Like Real Students

What makes code feel like it was written by a student? Not just errors, but how they evolve. Not just style, but how it diverges from the polished norms. This week’s standout paper, ParaStudent, tackles a refreshingly underexplored challenge: teaching LLMs to generate code that learns like a student — messy, iterative, full of hiccups and growth. Instead of building yet another high-performing code assistant, the authors fine-tune LLMs to mimic real students in an introductory CS class at UC Berkeley. They call their framework ParaStudent. The goal: replace idealized solutions with something plausibly human — an LLM that stumbles, recovers, and grows in fidelity to how novices actually write code. ...

School of Thought: How Fine-Tuned Open LLMs Are Challenging the Giants in Education

Why rent a Ferrari when a fine-tuned e-bike can get you to class faster, cheaper, and on your own terms? That’s the question quietly reshaping AI in education, as shown by Solano et al. (2025) in their paper Narrowing the Gap. The authors demonstrate that with supervised fine-tuning (SFT), smaller open-source models like Llama-3.1-8B and Qwen3-4B can rival proprietary giants like GPT-4.1 when explaining C programming error messages to students. More strikingly, they achieve this with better privacy, lower cost, and pedagogical nuance that large models often overshoot. ...

Divide and Conquer: How LLMs Learn to Teach

Divide and Conquer: How LLMs Learn to Teach Designing effective lessons for training online tutors is no small feat. It demands pedagogical nuance, clarity, scenario realism, and learner empathy. A recent paper by Lin et al., presented at ECTEL 2025, offers a compelling answer to this challenge: use LLMs, but don’t ask too much at once. Their research reveals that breaking the task of lesson generation into smaller, well-defined parts significantly improves quality, suggesting a new collaborative model for scalable education design. ...