Opening — Why this matters now

One‑to‑one tutoring is education’s gold standard—and its most stubborn bottleneck. Everyone agrees it works. Almost no one can afford it at scale. Into this gap steps generative AI, loudly promising democratized personalization and quietly raising fears about hallucinations, dependency, and cognitive atrophy.

Most debates about AI tutors stall at ideology. This paper does something rarer: it runs an in‑classroom randomized controlled trial and reports what actually happened. No synthetic benchmarks. No speculative productivity math. Just UK teenagers, real maths problems, and an AI model forced to earn its keep under human supervision. fileciteturn0file0

Background — Context and prior art

Prior research already established two uncomfortable truths:

  1. Tutoring works extremely well—meta‑analyses routinely show gains equivalent to months of additional schooling.
  2. Unconstrained genAI in education can backfire, harming learning when it shortcuts thinking or fabricates explanations.

What has been missing is a middle ground: not AI replacing tutors, but AI embedded inside a pedagogically disciplined system, with humans retaining final authority. The LearnLM–Eedi collaboration tests exactly this configuration.

Analysis — What the paper actually does

The study ran an exploratory RCT with 165 UK secondary‑school students (ages 13–15) across five schools. Students were randomly assigned to one of three support modes when they made a mistake:

  • Static hints (pre‑written, misconception‑specific)
  • Human tutor via chat
  • LearnLM‑assisted tutoring, where the AI drafted responses and an expert tutor approved, edited, or rewrote every message

Crucially, students could not tell whether their tutor was human‑only or AI‑assisted. Pedagogically, LearnLM was constrained to a Socratic, non‑revealing style, guided by detailed prompts that included the student’s misconception, ability level, and curriculum context.

This is not “ChatGPT helping with homework.” It is AI operating inside a learning‑science‑driven scaffold, under explicit human control.

Findings — Results that matter (with numbers)

The results are more interesting than either hype or skepticism would predict.

Immediate correction

Intervention Correct on retry
Static hints ~65%
Human tutor ~91%
LearnLM (supervised) ~93%

Interactive tutoring dominates static hints. No surprise there. More notable: AI‑assisted tutoring is statistically indistinguishable from human tutoring for immediate mistake correction.

Misconception resolution (within topic)

Intervention Resolved misconception
Static hints ~87%
Human tutor ~95%
LearnLM (supervised) ~95%

Again, parity. AI does not underperform when tightly constrained and supervised.

Knowledge transfer (the hard part)

This is where the paper becomes genuinely provocative.

Intervention Success on next topic
Static hints ~56%
Human tutor ~61%
LearnLM (supervised) ~66%

Students supported by LearnLM were 5.5 percentage points more likely to solve problems in a new topic than those tutored by humans alone. Bayesian analysis assigns a 93.6% probability that this advantage is real.

That is not a rounding error. It suggests something structural about how the AI conducts dialogue.

Why might AI outperform humans here?

Tutor interviews offer a clue. LearnLM consistently generated clean, disciplined Socratic questioning—sometimes better than tutors’ habitual styles. Several tutors reported learning new pedagogical techniques themselves by supervising the model.

Humans, by contrast, are excellent at empathy and pacing—but sometimes shortcut explanation once a student appears to “get it.” The AI, lacking social intuition, stubbornly insists on conceptual grounding. Annoying? Occasionally. Effective for transfer? Apparently, yes.

The human role did not disappear

Importantly, 25% of AI drafts were edited or rewritten. Not because the maths was wrong (factual error rate: 0.1%), but because:

  • Students were getting frustrated
  • Tone felt artificial
  • The model pushed one question too far

Human tutors supplied judgment, emotional calibration, and permission to move on. This is not a failure of the system—it is the design principle.

Scalability — The quiet business implication

A post‑hoc operational simulation suggests supervised AI increased tutor throughput from ~35 to ~41 sessions per hour, while reducing cost per session by ~14%, even after token costs.

This is the real lever: not replacing tutors, but turning one expert into a higher‑bandwidth educational operator.

Implications — What this means beyond maths

Three implications stand out:

  1. Pedagogy beats raw model power. Fine‑tuning and scaffolding matter more than parameter count.
  2. Human‑in‑the‑loop is not a compromise—it’s a multiplier.
  3. Generalization is not guaranteed. Maths is structured; history and literature are not. Claims should stay domain‑specific until tested.

For schools, platforms, and policymakers, the message is clear: banning AI tutors outright is as lazy as deploying them recklessly.

Conclusion — A narrow but important signal

This paper does not prove that AI tutors will revolutionize education. It proves something quieter and more useful: under the right constraints, AI can safely amplify expert instruction—and sometimes even sharpen it.

The future of education is unlikely to be human or machine. It is human judgment, operating at scale, with machines doing the disciplined, repeatable cognitive labor in between.

Cognaptus: Automate the Present, Incubate the Future.