Teaching Minds or Just Mimicking? When LLMs Play Teacher

Opening — Why this matters now

Everyone agrees AI tutors are improving outcomes. Fewer people are asking the more uncomfortable question: how are they deciding what to teach next?

That distinction matters. A system that truly models a learner behaves very differently from one that simply follows surface-level heuristics or prompt instructions. One adapts. The other performs.

A recent study attempts to pry open this black box—and the results are, predictably, both impressive and slightly unsettling.

Background — Context and prior art

Most evaluations of AI tutoring focus on outcomes: test scores, engagement, or completion rates. That’s useful, but shallow.

Human teaching research offers a sharper lens. Teachers typically fall into two camps:

Strategy Type	Description	Trade-off
Model-based (mentalizing)	Infers what the learner knows and teaches accordingly	High effectiveness, high cognitive cost
Heuristic-based	Uses simple rules (e.g., highlight high-reward content)	Lower effort, inconsistent results

In humans, these strategies coexist. People switch depending on effort, context, and incentives.

The open question: where do LLMs land?

Analysis — What the paper actually tests

The study uses a controlled Graph Teaching Task (see Figure 1 in the paper) where:

A “learner” navigates a reward-based graph
The “teacher” (LLM) reveals one edge to improve the learner’s future decision
The optimal move depends on what the learner does not know

This setup forces a choice:

Infer the learner’s missing knowledge (mentalizing)
Or rely on shortcuts (e.g., pick the highest reward edge)

The Core Metric: Teaching Score

The paper defines performance using a normalized utility metric:

Metric	Meaning
Teaching Score	How close the chosen teaching action is to the optimal teaching move

A score of 1.0 means the model behaves like a Bayes Optimal Teacher—the gold standard that perfectly infers the learner’s knowledge state.

Cognitive Models Tested

The researchers didn’t just measure performance—they reverse-engineered strategy using cognitive model fitting:

Model Type	Key Idea
Bayes Optimal Teacher	Infers learner knowledge via inverse planning
Weak Bayesian variants	Partial or simplified inference
Heuristics	Reward-based or depth-based shortcuts
Non-mentalizing	Ignores learner entirely

This is where things get interesting.

Findings — Results with visualization

1. LLMs Perform Surprisingly Well (and Consistently)

Across 40 trials, performance is flat—no learning curve. That’s expected because there’s no feedback loop.

But the absolute level? High.

Observation	Interpretation
High Teaching Scores across models	LLMs already operate near optimal teaching policies
Stability across trials	No adaptation—but also no degradation

2. They Look… Almost Human (But Not Quite)

From the graph-level analysis (Figure 2):

Most models strongly correlate with human difficulty patterns (r ≈ 0.76–0.89)
They struggle on the same problems humans do

Yet the distribution differs:

Group	Strategy Distribution
Humans	Bimodal (mentalizing + heuristics)
LLMs	Skewed toward high-performing, model-based behavior

Translation: LLMs act like consistently “rational” teachers—something humans rarely manage.

3. Cognitive Model Fits: The Quiet Bombshell

The majority of LLM behavior is best explained by the Bayes Optimal Teacher model.

Model Fit Result	Implication
BOT dominates across LLMs	Models behave as if they infer learner knowledge
Minimal heuristic alignment	Less shortcut-driven than humans

This suggests something counterintuitive:

LLMs appear to mentalize—even without explicit learner models.

4. Prompting Doesn’t Fix (or Improve) Teaching

The second experiment introduces scaffolding prompts:

Inference scaffolding: highlight what the learner might not know
Reward scaffolding: highlight high-value content

Humans benefit from these.

LLMs? Not really.

Condition	Human Effect	LLM Effect
Inference scaffolding	Improves teaching	Little to no improvement
Reward scaffolding	Can hurt performance	Often reduces performance

Even more telling:

Models follow the scaffolding instructions correctly
But their final teaching decisions don’t improve

Compliance ≠ competence.

Implications — What this means for business and AI systems

1. LLMs Are Already “Good Enough” Teachers—Structurally

If models naturally approximate optimal teaching policies, the bottleneck shifts:

Not capability
But alignment and control

2. Prompt Engineering Has Hard Limits

This paper quietly undermines a popular belief:

Better prompts → better outcomes

In reality:

Prompts can change behavioral surface
But not necessarily decision quality

For operators, this means diminishing returns on prompt tinkering.

3. Cognitive Evaluation > Output Evaluation

Most benchmarks ask:

“Did the student improve?”

This paper asks:

“Did the AI choose the right teaching action?”

That shift is subtle—but foundational.

4. Resource Economics Matter More Than We Thought

Humans avoid mentalizing because it’s costly.

LLMs don’t face the same constraint.

Implication:

System Constraint	Expected Behavior
High compute / low latency pressure	Model-based (mentalizing)
Tight constraints (tokens, latency)	Shift toward heuristics

In other words, deployment conditions may shape teaching strategy more than prompts do.

5. New Failure Mode: “Correct Reasoning, Wrong Intervention”

Perhaps the most uncomfortable takeaway:

LLMs can understand the right strategy
Execute intermediate steps correctly
And still choose suboptimal teaching actions

That’s not hallucination.

That’s policy misalignment.

Conclusion — The uncomfortable clarity

The paper answers its own title with a shrug that feels more like a warning:

LLMs don’t just mimic teaching.

They often approximate optimal teaching strategies—without being explicitly designed to do so.

And yet, they remain oddly resistant to improvement via the very tools we rely on most: prompts and scaffolding.

Which leaves us with a slightly inconvenient truth:

We are no longer optimizing capability. We are negotiating with already-capable systems.

That’s a very different engineering problem.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually tests#

The Core Metric: Teaching Score#

Cognitive Models Tested#

Findings — Results with visualization#

1. LLMs Perform Surprisingly Well (and Consistently)#

2. They Look… Almost Human (But Not Quite)#

3. Cognitive Model Fits: The Quiet Bombshell#

4. Prompting Doesn’t Fix (or Improve) Teaching#

Implications — What this means for business and AI systems#

1. LLMs Are Already “Good Enough” Teachers—Structurally#

2. Prompt Engineering Has Hard Limits#

3. Cognitive Evaluation > Output Evaluation#

4. Resource Economics Matter More Than We Thought#

5. New Failure Mode: “Correct Reasoning, Wrong Intervention”#

Conclusion — The uncomfortable clarity#