Opening — Why this matters now
Everyone agrees AI tutors are improving outcomes. Fewer people are asking the more uncomfortable question: how are they deciding what to teach next?
That distinction matters. A system that truly models a learner behaves very differently from one that simply follows surface-level heuristics or prompt instructions. One adapts. The other performs.
A recent study attempts to pry open this black box—and the results are, predictably, both impressive and slightly unsettling.
Background — Context and prior art
Most evaluations of AI tutoring focus on outcomes: test scores, engagement, or completion rates. That’s useful, but shallow.
Human teaching research offers a sharper lens. Teachers typically fall into two camps:
| Strategy Type | Description | Trade-off |
|---|---|---|
| Model-based (mentalizing) | Infers what the learner knows and teaches accordingly | High effectiveness, high cognitive cost |
| Heuristic-based | Uses simple rules (e.g., highlight high-reward content) | Lower effort, inconsistent results |
In humans, these strategies coexist. People switch depending on effort, context, and incentives.
The open question: where do LLMs land?
Analysis — What the paper actually tests
The study uses a controlled Graph Teaching Task (see Figure 1 in the paper) where:
- A “learner” navigates a reward-based graph
- The “teacher” (LLM) reveals one edge to improve the learner’s future decision
- The optimal move depends on what the learner does not know
This setup forces a choice:
- Infer the learner’s missing knowledge (mentalizing)
- Or rely on shortcuts (e.g., pick the highest reward edge)
The Core Metric: Teaching Score
The paper defines performance using a normalized utility metric:
| Metric | Meaning |
|---|---|
| Teaching Score | How close the chosen teaching action is to the optimal teaching move |
A score of 1.0 means the model behaves like a Bayes Optimal Teacher—the gold standard that perfectly infers the learner’s knowledge state.
Cognitive Models Tested
The researchers didn’t just measure performance—they reverse-engineered strategy using cognitive model fitting:
| Model Type | Key Idea |
|---|---|
| Bayes Optimal Teacher | Infers learner knowledge via inverse planning |
| Weak Bayesian variants | Partial or simplified inference |
| Heuristics | Reward-based or depth-based shortcuts |
| Non-mentalizing | Ignores learner entirely |
This is where things get interesting.
Findings — Results with visualization
1. LLMs Perform Surprisingly Well (and Consistently)
Across 40 trials, performance is flat—no learning curve. That’s expected because there’s no feedback loop.
But the absolute level? High.
| Observation | Interpretation |
|---|---|
| High Teaching Scores across models | LLMs already operate near optimal teaching policies |
| Stability across trials | No adaptation—but also no degradation |
2. They Look… Almost Human (But Not Quite)
From the graph-level analysis (Figure 2):
- Most models strongly correlate with human difficulty patterns (r ≈ 0.76–0.89)
- They struggle on the same problems humans do
Yet the distribution differs:
| Group | Strategy Distribution |
|---|---|
| Humans | Bimodal (mentalizing + heuristics) |
| LLMs | Skewed toward high-performing, model-based behavior |
Translation: LLMs act like consistently “rational” teachers—something humans rarely manage.
3. Cognitive Model Fits: The Quiet Bombshell
The majority of LLM behavior is best explained by the Bayes Optimal Teacher model.
| Model Fit Result | Implication |
|---|---|
| BOT dominates across LLMs | Models behave as if they infer learner knowledge |
| Minimal heuristic alignment | Less shortcut-driven than humans |
This suggests something counterintuitive:
LLMs appear to mentalize—even without explicit learner models.
4. Prompting Doesn’t Fix (or Improve) Teaching
The second experiment introduces scaffolding prompts:
- Inference scaffolding: highlight what the learner might not know
- Reward scaffolding: highlight high-value content
Humans benefit from these.
LLMs? Not really.
| Condition | Human Effect | LLM Effect |
|---|---|---|
| Inference scaffolding | Improves teaching | Little to no improvement |
| Reward scaffolding | Can hurt performance | Often reduces performance |
Even more telling:
- Models follow the scaffolding instructions correctly
- But their final teaching decisions don’t improve
Compliance ≠ competence.
Implications — What this means for business and AI systems
1. LLMs Are Already “Good Enough” Teachers—Structurally
If models naturally approximate optimal teaching policies, the bottleneck shifts:
- Not capability
- But alignment and control
2. Prompt Engineering Has Hard Limits
This paper quietly undermines a popular belief:
Better prompts → better outcomes
In reality:
- Prompts can change behavioral surface
- But not necessarily decision quality
For operators, this means diminishing returns on prompt tinkering.
3. Cognitive Evaluation > Output Evaluation
Most benchmarks ask:
- “Did the student improve?”
This paper asks:
- “Did the AI choose the right teaching action?”
That shift is subtle—but foundational.
4. Resource Economics Matter More Than We Thought
Humans avoid mentalizing because it’s costly.
LLMs don’t face the same constraint.
Implication:
| System Constraint | Expected Behavior |
|---|---|
| High compute / low latency pressure | Model-based (mentalizing) |
| Tight constraints (tokens, latency) | Shift toward heuristics |
In other words, deployment conditions may shape teaching strategy more than prompts do.
5. New Failure Mode: “Correct Reasoning, Wrong Intervention”
Perhaps the most uncomfortable takeaway:
- LLMs can understand the right strategy
- Execute intermediate steps correctly
- And still choose suboptimal teaching actions
That’s not hallucination.
That’s policy misalignment.
Conclusion — The uncomfortable clarity
The paper answers its own title with a shrug that feels more like a warning:
LLMs don’t just mimic teaching.
They often approximate optimal teaching strategies—without being explicitly designed to do so.
And yet, they remain oddly resistant to improvement via the very tools we rely on most: prompts and scaffolding.
Which leaves us with a slightly inconvenient truth:
We are no longer optimizing capability. We are negotiating with already-capable systems.
That’s a very different engineering problem.
Cognaptus: Automate the Present, Incubate the Future.