Opening — Why this matters now

Everyone agrees AI tutors are improving outcomes. Fewer people are asking the more uncomfortable question: how are they deciding what to teach next?

That distinction matters. A system that truly models a learner behaves very differently from one that simply follows surface-level heuristics or prompt instructions. One adapts. The other performs.

A recent study attempts to pry open this black box—and the results are, predictably, both impressive and slightly unsettling.

Background — Context and prior art

Most evaluations of AI tutoring focus on outcomes: test scores, engagement, or completion rates. That’s useful, but shallow.

Human teaching research offers a sharper lens. Teachers typically fall into two camps:

Strategy Type Description Trade-off
Model-based (mentalizing) Infers what the learner knows and teaches accordingly High effectiveness, high cognitive cost
Heuristic-based Uses simple rules (e.g., highlight high-reward content) Lower effort, inconsistent results

In humans, these strategies coexist. People switch depending on effort, context, and incentives.

The open question: where do LLMs land?

Analysis — What the paper actually tests

The study uses a controlled Graph Teaching Task (see Figure 1 in the paper) where:

  • A “learner” navigates a reward-based graph
  • The “teacher” (LLM) reveals one edge to improve the learner’s future decision
  • The optimal move depends on what the learner does not know

This setup forces a choice:

  • Infer the learner’s missing knowledge (mentalizing)
  • Or rely on shortcuts (e.g., pick the highest reward edge)

The Core Metric: Teaching Score

The paper defines performance using a normalized utility metric:

Metric Meaning
Teaching Score How close the chosen teaching action is to the optimal teaching move

A score of 1.0 means the model behaves like a Bayes Optimal Teacher—the gold standard that perfectly infers the learner’s knowledge state.

Cognitive Models Tested

The researchers didn’t just measure performance—they reverse-engineered strategy using cognitive model fitting:

Model Type Key Idea
Bayes Optimal Teacher Infers learner knowledge via inverse planning
Weak Bayesian variants Partial or simplified inference
Heuristics Reward-based or depth-based shortcuts
Non-mentalizing Ignores learner entirely

This is where things get interesting.

Findings — Results with visualization

1. LLMs Perform Surprisingly Well (and Consistently)

Across 40 trials, performance is flat—no learning curve. That’s expected because there’s no feedback loop.

But the absolute level? High.

Observation Interpretation
High Teaching Scores across models LLMs already operate near optimal teaching policies
Stability across trials No adaptation—but also no degradation

2. They Look… Almost Human (But Not Quite)

From the graph-level analysis (Figure 2):

  • Most models strongly correlate with human difficulty patterns (r ≈ 0.76–0.89)
  • They struggle on the same problems humans do

Yet the distribution differs:

Group Strategy Distribution
Humans Bimodal (mentalizing + heuristics)
LLMs Skewed toward high-performing, model-based behavior

Translation: LLMs act like consistently “rational” teachers—something humans rarely manage.

3. Cognitive Model Fits: The Quiet Bombshell

The majority of LLM behavior is best explained by the Bayes Optimal Teacher model.

Model Fit Result Implication
BOT dominates across LLMs Models behave as if they infer learner knowledge
Minimal heuristic alignment Less shortcut-driven than humans

This suggests something counterintuitive:

LLMs appear to mentalize—even without explicit learner models.

4. Prompting Doesn’t Fix (or Improve) Teaching

The second experiment introduces scaffolding prompts:

  • Inference scaffolding: highlight what the learner might not know
  • Reward scaffolding: highlight high-value content

Humans benefit from these.

LLMs? Not really.

Condition Human Effect LLM Effect
Inference scaffolding Improves teaching Little to no improvement
Reward scaffolding Can hurt performance Often reduces performance

Even more telling:

  • Models follow the scaffolding instructions correctly
  • But their final teaching decisions don’t improve

Compliance ≠ competence.

Implications — What this means for business and AI systems

1. LLMs Are Already “Good Enough” Teachers—Structurally

If models naturally approximate optimal teaching policies, the bottleneck shifts:

  • Not capability
  • But alignment and control

2. Prompt Engineering Has Hard Limits

This paper quietly undermines a popular belief:

Better prompts → better outcomes

In reality:

  • Prompts can change behavioral surface
  • But not necessarily decision quality

For operators, this means diminishing returns on prompt tinkering.

3. Cognitive Evaluation > Output Evaluation

Most benchmarks ask:

  • “Did the student improve?”

This paper asks:

  • “Did the AI choose the right teaching action?”

That shift is subtle—but foundational.

4. Resource Economics Matter More Than We Thought

Humans avoid mentalizing because it’s costly.

LLMs don’t face the same constraint.

Implication:

System Constraint Expected Behavior
High compute / low latency pressure Model-based (mentalizing)
Tight constraints (tokens, latency) Shift toward heuristics

In other words, deployment conditions may shape teaching strategy more than prompts do.

5. New Failure Mode: “Correct Reasoning, Wrong Intervention”

Perhaps the most uncomfortable takeaway:

  • LLMs can understand the right strategy
  • Execute intermediate steps correctly
  • And still choose suboptimal teaching actions

That’s not hallucination.

That’s policy misalignment.

Conclusion — The uncomfortable clarity

The paper answers its own title with a shrug that feels more like a warning:

LLMs don’t just mimic teaching.

They often approximate optimal teaching strategies—without being explicitly designed to do so.

And yet, they remain oddly resistant to improvement via the very tools we rely on most: prompts and scaffolding.

Which leaves us with a slightly inconvenient truth:

We are no longer optimizing capability. We are negotiating with already-capable systems.

That’s a very different engineering problem.

Cognaptus: Automate the Present, Incubate the Future.