When the Tutor Is a Model: Learning Gains, Guardrails, and the Quiet Rise of AI Co‑Tutors

A tutor has three student chats open.

In the first, a student has confused a factor with a multiple. In the second, another has substituted a negative number incorrectly. In the third, the student has already found the answer but is rapidly losing patience with being asked to explain it.

The tutor must diagnose each problem, compose an appropriate question, maintain the students’ attention, and decide when further explanation becomes counterproductive. Doing this well requires mathematical knowledge, pedagogical discipline, emotional judgment, and enough spare attention to avoid replying to the wrong child.

The obvious AI proposal is to automate the tutor.

The more credible proposal is narrower: let a model draft the next pedagogically sound message, while a human decides whether that message should reach the student.

That is the system tested in an exploratory classroom trial conducted by the LearnLM Team and Eedi across five UK secondary schools.¹ Its results provide an encouraging signal for AI-supported tutoring. They also show why describing the intervention simply as an “AI tutor” removes most of what made it work.

The Model Was Only One Layer of the Tutoring System

During the seven-week trial, 165 students aged 13 to 15 worked through mathematics activities on Eedi, an established educational platform.

Eedi’s foundation matters. Its library contains more than 60,000 diagnostic multiple-choice questions. Each incorrect option is designed to correspond to a particular misconception. When a student selects the wrong answer, the platform can infer more than the unremarkable fact that the student is wrong. It can identify the likely reasoning error behind the answer.

LearnLM entered the process only after this diagnosis.

The platform supplied the model with the question, the student’s incorrect answer, the likely misconception, the correct explanation, the student’s year group, and an estimate of the student’s ability. A strict prompt instructed the model to use short, focused messages, ask one question at a time, follow a Socratic approach, avoid revealing the answer, and let the student leave when appropriate.

The model then drafted a response. An expert tutor reviewed that draft and could approve it, edit it, or replace it entirely before the student saw anything.

The operational loop looked like this:

Diagnostic question → misconception identification → constrained AI draft → human review → student message → measured learning outcome

Each layer reduced a different uncertainty.

System layer	Function	Risk reduced
Diagnostic question library	Identifies the likely misconception behind an incorrect answer	Generic or irrelevant explanations
Pedagogically fine-tuned model	Drafts concise Socratic guidance	Answers that merely provide the solution
Strict prompt and student context	Constrains tone, pacing, content, and difficulty	Unstructured model behavior
Expert tutor review	Approves, modifies, or replaces every draft	Factual errors, poor judgment, and inappropriate interaction
Platform outcome tracking	Measures later student performance	Mistaking conversational fluency for learning

This architecture explains why the study should not be read as a trial of autonomous tutoring. It tested whether a specialized model could become a reliable component inside a supervised educational system.

That distinction is less dramatic than replacing teachers. It is also considerably more useful.

The Trial Randomized Support at Two Levels

The researchers used a two-level randomized design.

First, students were assigned either to continue receiving static, pre-written hints or to receive interactive tutoring after answering a question incorrectly.

Second, each interactive tutoring session was randomly assigned to one of two modes:

a human tutor working alone;
LearnLM drafting messages under the supervision of a human tutor.

The same expert tutors performed both roles during the trial. Students saw the same chat interface in either interactive condition and were not explicitly told which mode had been assigned to a particular session.

The trial evaluated three learning outcomes, each asking a progressively harder question.

Mistake remediation: Could the student answer the same question correctly after receiving support?
Misconception resolution: Could the student answer any subsequent question on the same topic correctly?
Knowledge transfer: Could the student answer the first question in the next sequential study unit correctly?

The third outcome deserves special attention. Correcting the same question may show that the student followed the tutor’s immediate guidance. Success on a related but new unit offers a stronger indication that something transferable was learned.

It is still a near-term measure: the next unit had to be attempted on the same day, and the units belonged to a connected sequence. Calling it evidence of long-term retention would require an enthusiasm for adjectives unsupported by calendars.

Interactive Tutoring Produced Large Immediate Gains

For immediate mistake remediation and misconception resolution, both forms of interactive tutoring substantially outperformed static hints.

The model-estimated success rates, adjusted for students’ baseline performance, were:

Outcome	Static hint	Human tutor	Supervised LearnLM
Correct answer on retry	65.4%	91.2%	93.0%
Misconception resolved within the unit	86.8%	94.9%	95.4%
Correct answer on the next sequential unit	56.2%	60.7%	66.2%

The largest and most certain result is not that AI beat human tutors. It is that interactive tutoring—whether human-only or AI-supported—was far more effective than static feedback for immediate correction.

Relative to static hints, human tutoring increased the estimated probability of mistake remediation by approximately 25.9 percentage points. Supervised LearnLM increased it by approximately 27.7 percentage points. Both comparisons had posterior probabilities above 99.9% of producing better outcomes than static hints.

The difference between the two interactive conditions was much smaller.

For mistake remediation, supervised LearnLM had an estimated 1.8-percentage-point advantage over human tutoring, with a 95% credible interval from –1.7 to +5.4 percentage points. For misconception resolution, the estimated difference was only 0.4 percentage points, with a credible interval from –2.5 to +3.3.

The practical reading is near-parity within the resolution of this trial. The point estimates slightly favor LearnLM, but the evidence permits both modest advantage and modest disadvantage.

For a supervised drafting system, parity is already consequential. Drafting support does not need to outperform an expert on every interaction to be operationally useful. It needs to preserve instructional quality while allowing the expert to allocate attention more effectively.

The Transfer Result Is Promising, but It Is Still a Signal

The most interesting result appeared when students progressed to the next study unit.

Students receiving only static hints had an estimated 56.2% probability of answering the next unit’s opening question correctly. The estimated rate rose to 60.7% after human tutoring and 66.2% after supervised LearnLM tutoring.

The direct comparison estimated a 5.5-percentage-point advantage for supervised LearnLM over human tutors.

That number is meaningful. The uncertainty around it is equally meaningful.

The 95% credible interval ranged from –1.4 to +12.4 percentage points. In other words, the posterior distribution mostly favors LearnLM but still includes the possibility of a small disadvantage. The researchers calculated a 93.6% posterior probability that LearnLM produced better transfer than human tutoring.

A 93.6% posterior probability is a strong reason to investigate further. It is not the same as a credible interval excluding zero, and it does not establish that AI-supported tutoring reliably improves transfer in other classrooms, subjects, age groups, or time horizons.

The study also included a useful benchmark: students who answered the preceding unit correctly and therefore needed no intervention. Their estimated success rate on the next unit was 69.0%. Supervised LearnLM brought students who had previously made a mistake relatively close to that benchmark, but it did not clearly eliminate the gap.

The likely mechanism is plausible. Tutors consistently described LearnLM as disciplined in its use of Socratic questioning. Rather than quickly supplying an explanation or accepting a correct guess, the model often encouraged students to articulate the reasoning behind an answer. Several tutors reported learning new pedagogical approaches from supervising its drafts.

This consistency may support transfer. It may also irritate students.

Conveniently, the edit logs show exactly where the model’s pedagogical discipline stopped being helpful.

The Edit Log Reveals a Division of Labour

LearnLM produced 3,617 message drafts during the trial.

Human tutors approved 74.4% without any changes. When drafts requiring only one- or two-character edits are included, the approval-or-minimal-edit rate rises to 76.4%. Many tiny changes involved removing emojis or making minor stylistic adjustments.

This indicates that the model usually provided a usable first draft. The remaining interventions reveal what “usable” does not include.

Among edited or rewritten drafts, the most frequent reasons for intervention were:

Primary motivation for editing	Share of edits	What the tutor contributed
Adjusting pedagogical pacing	44.3%	Deciding when the student had understood enough and should move on
Improving factual or contextual clarity	33.6%	Making the response more precise or better aligned with the immediate situation
Adjusting tone or persona	19.5%	Maintaining rapport and matching the student’s communication style

The most common problem was not mathematical incompetence. It was excessive pedagogical persistence.

LearnLM would continue asking a student to explain why an answer was correct even when the student was ready to return to the lesson. From the model’s perspective, deeper reasoning remained educationally desirable. From the student’s perspective, the tutor had become the conversational equivalent of a door that keeps asking whether one truly understands leaving.

Human tutors recognized when productive struggle was becoming ordinary frustration. They shortened explanations, allowed students to proceed, acknowledged previous interactions, and removed language that felt artificial.

The model supplied repeatable pedagogical structure. The tutors supplied situational judgment.

This is the co-tutor model in its clearest form: human expertise moves from drafting every sentence toward supervising exceptions, pacing, and relationships.

Safety Came From the Workflow, Not Merely the Model

A retrospective audit of all 3,617 LearnLM drafts found zero instances categorized as harmful or risky and identified five factual inaccuracies, representing approximately 0.1% of drafts.

The five errors included a mathematical mistake, an unexpected language insertion, and several hallucinated or incorrect references to the student’s answer. The supervising tutors corrected them before the messages were delivered.

These are strong results for the tested environment. They are also inseparable from the environment.

The model operated on structured mathematics questions with validated answers and diagnosed misconceptions. Its behavior was tightly prompted. Every draft passed through an expert reviewer. The trial covered a limited number of schools and interactions.

Therefore, the audit supports a specific claim: within this supervised and highly structured deployment, LearnLM generated drafts that were usually accurate and that human tutors could review safely.

It does not establish that the same model could interact autonomously with students while maintaining an equivalent safety profile. Removing the reviewer would change the intervention, the risk distribution, and probably the meaning of the word “safe.”

The study’s evidence components should be kept in their proper categories:

Evidence component	Likely purpose	What it supports	What it does not prove
Two-level classroom RCT	Main evidence	Comparative learning outcomes under supervised deployment	Autonomous tutoring effectiveness
Full draft audit	Safety and quality assessment	Error frequency and tutor intervention patterns in the trial	Safety across unrestricted subjects and conversations
Tutor interviews and surveys	Mechanism and experience exploration	How tutors perceived the model and why they edited drafts	Causal learning effects
Supplementary operational simulation	Exploratory scalability estimate	Possible effects on concurrency, throughput, and session cost	Production-scale ROI

This separation matters because the paper contains both rigorous experimental results and useful exploratory findings. Combining them into one undifferentiated success story would make the article simpler and the evidence worse.

The Scalability Signal Comes From Concurrency

The main classroom trial was not designed to measure tutor productivity cleanly. Tutors moved fluidly between direct human tutoring and LearnLM-supervised sessions during the same working periods, making their time difficult to attribute.

The researchers therefore ran a supplementary operational simulation after the trial. Six tutors acted as tutors, while six others role-played students. New sessions were initiated at one-minute intervals until tutors could no longer maintain the workload.

The simulation produced an initially puzzling result: supervised LearnLM sessions lasted longer on average.

Operational metric	Human tutor alone	Supervised LearnLM
Average session duration	3.9 minutes	5.1 minutes
Average concurrent sessions	2.3	3.5
Estimated throughput	35.4 sessions/hour	41.2 sessions/hour
Estimated cost per session	£0.997	£0.861

The AI-supported sessions took longer because the model’s Socratic approach often extended the conversation. Yet tutors could manage more conversations simultaneously because they spent less time composing each individual response.

Using an assumed tutor labour rate of £35.29 per hour and an estimated inference cost of £0.0037 per supervised session, the researchers estimated a 13.6% reduction in cost per session.

This is an indicative model, not a demonstrated production saving. The concurrency estimates came from a simulation with role-playing participants, and the model assumes sustained demand. Real operations would add scheduling gaps, escalation procedures, quality assurance, infrastructure, training, and compliance costs.

Even so, the simulation identifies the relevant economic mechanism. The business case does not primarily depend on cheap tokens. At £0.0037 per session, token costs are already a rounding error beside expert labour. The potential value comes from increasing the number of students an expert can supervise without degrading outcomes.

The Transferable Product Is the Control Loop

For education providers, the easiest mistake would be to copy only the visible component: connect a general-purpose model to a student-facing chat window and label the result personalized learning.

The paper suggests a more demanding implementation path.

1. Build diagnosis before dialogue

LearnLM received a likely misconception, not merely a transcript and a request to “help the student.” Eedi’s diagnostic question structure narrowed the problem before generation began.

For other education products, comparable diagnostic infrastructure may require carefully designed assessments, misconception taxonomies, curriculum maps, or retrieval systems containing validated explanations. Without that layer, the model must guess what the learner misunderstands.

A fluent guess remains a guess.

2. Constrain the pedagogical action

The prompt specified how the model should teach: concise messages, one question at a time, no direct answer, adaptation to predicted ability, and explicit permission to end the interaction.

This converts pedagogy from an aspiration into an operational policy. Education providers should define which actions the model may take, when it should escalate, and what constitutes a completed intervention.

3. Design review around exceptions

Human review becomes economically useful only when the interface makes intervention efficient. Tutors need to see the proposed response, edit it quickly, replace it when necessary, and understand why the system produced it.

The trial’s edit categories offer a practical starting point for routing and monitoring:

pacing interventions;
factual or contextual corrections;
tone and rapport adjustments;
safety escalations;
complete rewrites.

Over time, these interventions can become training data, quality metrics, and indicators of where automation remains weak.

4. Measure learning after the conversation

A satisfaction score can reveal whether students enjoyed the experience. It cannot establish whether they learned.

The Eedi platform measured immediate correction, same-topic resolution, and next-unit transfer. Products deploying AI tutors should similarly distinguish between conversational engagement and educational outcomes.

A student saying “thanks” is pleasant. A student solving the next problem independently is evidence.

What the Paper Directly Shows—and What Businesses Must Still Test

The paper supports several practical conclusions, but only within clear boundaries.

Question	What the paper shows	What remains uncertain
Can a specialized model generate useful tutoring drafts?	Expert tutors approved most drafts unchanged or with minimal edits	Performance in less structured domains and unrestricted dialogue
Can supervised AI preserve learning outcomes?	Immediate outcomes were comparable to human tutoring in this trial	Long-term retention and cumulative effects
Can supervised AI improve transfer?	The estimate favored LearnLM by 5.5 percentage points with a 93.6% posterior probability	Whether the advantage replicates and persists
Can the model operate safely?	Few factual errors and no harmful drafts were identified under expert supervision	Safety without message-level human review
Can AI reduce tutoring costs?	A supplementary simulation estimated higher throughput and lower session cost	Actual production ROI under real workloads

Several design features limit broader interpretation.

The trial focused on mathematics, where answers can be verified and common misconceptions can be mapped to specific distractors. Subjects such as history, literature, and social science demand interpretation, argument evaluation, and engagement with ambiguity. The same diagnostic structure may be difficult to reproduce.

The study lasted seven weeks and randomized the source of tutoring session by session. This design efficiently compared interventions within the same student population, but it could not isolate the cumulative effect of consistently receiving one tutoring method over several months. Tutors also reported learning from LearnLM, meaning that its pedagogical techniques may have influenced their later human-only sessions.

The participating schools varied in academic performance and socioeconomic background, but their proportions of students speaking English as an additional language were below national averages. Language diversity, accessibility needs, and culturally varied communication styles require separate testing.

Most importantly, every LearnLM message was supervised by a qualified tutor. Any organization considering less intensive review must treat that change as a new intervention requiring new evidence.

The Quiet Rise of the Co-Tutor

The trial began with a practical constraint: one-to-one tutoring works, but expert attention is expensive and scarce.

LearnLM did not remove that constraint by replacing the expert. It changed how the expert’s attention was used.

The model handled much of the repeatable work: generating a focused question, maintaining a Socratic structure, and producing a usable first draft. The tutor concentrated on the parts that remained difficult to formalize: deciding when to stop, recognizing frustration, preserving rapport, and correcting the occasional error.

That division of labour produced immediate learning outcomes comparable to human tutoring and a promising, though still uncertain, signal of stronger next-topic transfer. It also suggested that tutors might supervise more simultaneous sessions without allowing computational costs to become economically significant.

The central lesson is architectural. Effective AI tutoring emerged from the combination of pedagogical specialization, structured diagnostic data, constrained generation, human authority, and outcome measurement.

Remove enough of those components and the product may still look like an AI tutor.

It will simply resemble the one tested in this paper much less than the marketing page suggests.

Cognaptus: Automate the Present, Incubate the Future.

LearnLM Team. “AI Tutoring Can Safely and Effectively Support Students: An Exploratory RCT in UK Classrooms.” arXiv:2512.23633, 2025. https://arxiv.org/abs/2512.23633 ↩︎

The Model Was Only One Layer of the Tutoring System#

The Trial Randomized Support at Two Levels#

Interactive Tutoring Produced Large Immediate Gains#

The Transfer Result Is Promising, but It Is Still a Signal#

The Edit Log Reveals a Division of Labour#

Safety Came From the Workflow, Not Merely the Model#

The Scalability Signal Comes From Concurrency#

The Transferable Product Is the Control Loop#

1. Build diagnosis before dialogue#

2. Constrain the pedagogical action#

3. Design review around exceptions#

4. Measure learning after the conversation#

What the Paper Directly Shows—and What Businesses Must Still Test#

The Quiet Rise of the Co-Tutor#