Teaching Minds or Just Mimicking? When LLMs Play Teacher

Tutoring looks simple when the answer is already known.

A student takes the wrong path. The teacher sees the better path. The teacher gives one piece of advice. Everyone nods, learning happens, and somewhere a product slide quietly adds “personalized AI tutor” beside a cheerful icon of a graduation cap.

Unfortunately, teaching is not just telling. A good teacher must infer what the learner does not know, decide which missing piece matters most, and choose an intervention that changes the learner’s next action. That middle step is where many education products become suspiciously glossy. They can explain. They can encourage. They can generate hints. But do they choose what to teach because they model the learner, or because they have learned a collection of plausible tutoring moves?

The paper Do Large Language Models Mentalize When They Teach? asks exactly that question using a controlled Graph Teaching task adapted from prior work on human teaching strategies.1 The study is useful not because it proves that LLMs possess human-like theory of mind. It does not. Its value is sharper and more operational: it gives us a way to compare LLM teaching decisions against cognitive models of teaching strategy.

That comparison produces a slightly inconvenient result. Most tested LLMs behave more like a Bayes-Optimal Teacher than like simple reward-based heuristics in this task. Yet when the researchers add lightweight scaffolding prompts, the models often comply with the scaffolding step without reliably improving the final teaching decision. In other words, the model can perform the ritual and still fail to benefit from the ritual. Prompt engineering, once again, refuses to be magic with nicer typography.

The core lesson for business is not “LLM tutors are good” or “LLM tutors are fake.” The better lesson is this: AI tutoring systems need evaluation of instructional action policy, not just answer quality, reasoning traces, or prompt compliance.

The task separates teaching from merely pointing at high rewards

The Graph Teaching task is deliberately small, but that is the point. It strips teaching down to one decision.

A learner moves through a directed acyclic graph from a top node to a terminal node. Each node has a reward. The learner knows only some of the edges and chooses the best path available under that partial knowledge. The teacher sees the full graph, the rewards, and the learner’s observed trajectory. The teacher must reveal exactly one edge so that, if the learner replans, their expected score improves.

The teacher does not directly observe the learner’s hidden knowledge. The teacher must infer it.

That turns a toy graph into a diagnostic for tutoring strategy. If a learner skips a high-value route, perhaps they do not know an edge on that route. But which edge? And would revealing that edge actually change the learner’s best path? A shallow rule like “teach the edge connected to the biggest number” may work on some graphs and fail on others. A more model-based teacher asks a harder question: given the path the learner chose, what must the learner probably know or not know, and which new edge has the highest expected teaching utility?

The paper compares LLM choices against three broad classes of models:

Strategy family What it assumes the teacher is doing Why it matters for AI tutoring
Bayesian Teacher models Infer the learner’s hidden transition knowledge from the observed trajectory, then choose the edge with high expected improvement Closest to learner-aware teaching: not just useful content, but useful content for this learner
Heuristic models Use simple cues such as endpoint reward or graph depth Plausible shortcut behavior: cheap, often persuasive, sometimes wrong
Non-mentalizing utility models Compute graph utility without representing the learner’s knowledge state Can look intelligent while treating the task as a puzzle rather than a tutoring problem

The key Bayesian model is the Bayes-Optimal Teacher, which performs inverse planning: it treats the learner’s path as evidence about the learner’s partial knowledge. The utility of teaching an edge is the expected gain in the learner’s optimal value after that edge is revealed.

The paper then fits these candidate models to each simulated LLM teacher’s trial-by-trial choices using BIC. Lower BIC means the cognitive model explains that teacher’s observed choices better after accounting for model complexity.

This is already more informative than a normal benchmark score. A benchmark might say the model selected good edges. Cognitive model fitting asks what kind of policy would have generated those selections.

Comparison one: LLMs look less heuristic than many human teachers in the baseline task

In the baseline experiment, each simulated teacher completed 40 trials: 20 graph configurations and their horizontally flipped versions. The researchers used the same graph stimuli as the earlier human experiment, reset conversation context between simulated teachers, preserved history within a teacher, and gave no feedback about whether the teaching choice worked.

The no-feedback detail matters. This was not a learning-over-time test. The task asked what strategy the model brought into the setting, not what it could learn after being corrected.

The first result is modest but important: teaching scores were essentially flat over trials, with correlations between trial number and teaching score below $|r| < 0.1$. That means performance did not drift much across the 40 trials. The models were not gradually discovering the task.

The second result is stronger. Most LLMs showed graph-wise performance patterns similar to humans. The researchers ordered the 20 unique graphs by human difficulty and then compared each model’s average teaching score across those graphs with the human profile. Seven models had strong positive correlations with human graph-wise performance, around $r \approx 0.76$ to $0.89$, all with $p < 10^{-4}$. Two additional models showed moderate correlations, around $r \approx 0.46$ to $0.56$ with $p < .05$. GPT-3.5 and Llama 4 Maverick were not significantly correlated with humans.

That does not mean LLMs and humans used the same strategy. It means many models were sensitive to the same graph properties that made the task easier or harder for humans.

The distribution result is where the comparison becomes more interesting. Human teachers in the prior data were bimodal: some looked more like higher-performing mentalizing teachers, while others looked more heuristic. Most LLM simulated teachers clustered in the high-performing range, near the Bayes-Optimal Teacher benchmark and overlapping with the stronger human teachers. GPT-4o was reported as closest to human teachers in both performance level and variability, while GPT-o4-mini, Gemini 2.5 Flash, and Claude Sonnet 4.5 showed more variability.

The model-fitting result then tightens the interpretation. For most LLMs, the Bayes-Optimal Teacher gave the best overall account of trial-by-trial choices. Compared with humans, LLMs showed less alignment with heuristic and non-mentalizing utility strategies. The clearest heterogeneity appeared for GPT-o3-mini, Llama 4 Maverick, Gemini 2.5 Flash, and Claude Haiku 4.5, whose best-fitting model distributions were more mixed.

This is not a declaration that LLMs “understand students.” It is a behavioral result under a controlled task. The paper shows that, in this task, many contemporary LLMs choose edges in a way better described by learner-state inference than by simple reward salience.

That distinction is not academic decoration. In a tutoring product, the difference between “this content is generally valuable” and “this learner likely lacks this prerequisite right now” is the difference between content recommendation and teaching.

Comparison two: human strategy mixtures and LLM strategy concentration are not the same phenomenon

The comparison with human teachers could be read too quickly. A careless summary might say: “LLMs teach like humans.” The paper’s evidence says something narrower.

Humans and LLMs share some graph-wise difficulty patterns, but their strategy distributions differ. Human data show a visible mixture: some people rely on learner-model reasoning, while others fall back on heuristics. LLMs, by contrast, often concentrate near high-performing model-based behavior.

The authors interpret this through a resource-rational lens. For humans, mentalizing about a learner is useful but cognitively costly. When a simple cue seems good enough, people may rationally save effort and use the shortcut. Anyone who has taught a real class after lunch will recognize the phenomenon. The reward heuristic is not stupidity; it is cognitive economy.

LLMs face a different cost landscape. At inference time, a model does not experience cognitive effort in the human sense. A reasoning-oriented model may produce a computation that resembles inverse planning without the same subjective cost pressure that pushes a human toward shortcuts. That makes the LLM pattern less mysterious: the model may stay closer to a “model-based” policy because the human reason to economize is not present in the same way.

The business implication is subtle. If an AI tutor looks more consistently model-based than humans in a toy diagnostic, that does not automatically make it a better tutor in real settings. It may mean the diagnostic has isolated a type of structured decision where the model’s pattern-recognition and reasoning machinery align well with the Bayes-optimal solution.

But it does suggest a useful evaluation pathway. Instead of asking whether an AI tutor sounds supportive or gives correct explanations, we can ask whether its instructional decisions are better fit by learner-aware models or by shortcut policies.

That is a much better audit question.

Comparison three: scaffold compliance is not the same as better teaching

The second experiment is the part product teams should read twice.

The researchers tested whether lightweight scaffolding prompts could shift LLM teaching behavior, analogous to interventions previously used with human teachers. The design crossed two training environments with three scaffolding conditions.

The training environments were:

  • Heuristic-congruent training, where the reward heuristic tended to agree with the Bayes-optimal teaching choice.
  • Heuristic-incongruent training, where the reward heuristic was misleading, so good teaching required reasoning about learner knowledge.

The scaffolding conditions were:

  • No scaffolding, where the model simply chose an edge to teach.
  • Inference scaffolding, where the model first selected three edges it thought the learner did not know.
  • Reward scaffolding, where the model first selected three edges connected to the largest-value nodes.

After training, all groups faced the same heuristic-incongruent test graphs without the auxiliary scaffolding prompt. This detail is crucial. The question was not merely whether a scaffold could influence the immediate answer while present. The question was whether the scaffold changed the later teaching policy.

The paper first checks whether the models followed the scaffolded subtask. They did. Under reward scaffolding, models preferentially selected high-reward-ranked edges. Under inference scaffolding, they showed a systematic preference for edges ranked as more likely unknown to the learner under the Bayes-Optimal Teacher. Figure 5 is therefore best read as a compliance and manipulation check, not as the main teaching result.

Then comes the uncomfortable part: compliance did not reliably become better teaching.

On the test trials, inference scaffolding showed little evidence of a consistent improvement over no scaffolding. Reward scaffolding often reduced performance, especially for several models. The paper also reports that some of this degradation appeared even during training, where the teaching decision immediately followed the scaffolded step. Model fits on test trials still generally favored the Bayes-Optimal Teacher over the Reward Heuristic, although reward scaffolding made the reward heuristic more competitive for some models.

A useful way to read the experiments is this:

Experiment component Likely purpose What it supports What it does not prove
Baseline 40-trial task Main evidence on unscaffolded teaching strategy Many LLM choices are better fit by Bayes-Optimal Teacher than by simple heuristics That LLMs possess human-like mental-state representations
Graph-wise correlation with humans Comparison with prior human data Many LLMs are sensitive to graph difficulty patterns similar to humans That humans and LLMs use identical internal mechanisms
Distribution of teaching scores Strategy heterogeneity check LLMs cluster more in high-performing ranges, while humans show more bimodality That every LLM is reliably model-based in every teaching domain
Auxiliary scaffolded edge selections Manipulation/compliance check Models follow inference- and reward-focused scaffold instructions That following the scaffold improves final teaching decisions
Test performance after scaffolding Main evidence for intervention transfer Lightweight scaffolds do not reliably improve later LLM teaching and can impair it That all scaffolding is useless; only that these scaffolds did not reliably help here
Supplementary training-performance figure Robustness/detail on when degradation appears Performance reduction can appear even while scaffolds are present A complete causal mechanism for why scaffolding hurts

This is the paper’s most practical contribution. It separates two things that are often casually merged in LLM product work:

  1. Can the model produce the requested intermediate reasoning artifact?
  2. Does that artifact improve the final action policy?

The answer can be yes to the first and no to the second.

For AI tutoring, that distinction is not optional. A tutor may write a neat “student misconception analysis” before giving a hint. It may list prerequisites. It may identify likely gaps. It may perform all the visible ceremonies of pedagogy. But if the final hint selection is not actually conditioned on that analysis, the scaffold is theatre.

Useful theatre, perhaps. Nice theatre, even. Still theatre.

The paper’s strongest business signal is diagnostic, not decorative

For companies building AI tutors, course copilots, customer-training assistants, onboarding bots, or internal knowledge coaches, the paper points to a concrete evaluation upgrade.

Do not only test whether the model can explain the right answer. Test whether the model chooses the right instructional intervention under uncertainty about the learner.

That means creating small diagnostic environments where learner state is hidden but inferable from behavior. The Graph Teaching task is one example. A business version could use product onboarding paths, sales-training simulations, compliance-learning modules, or coding-tutor exercises. The common structure is the same:

  • the learner takes an observable action;
  • the system sees a richer task environment;
  • the learner’s hidden knowledge must be inferred;
  • the AI tutor chooses one intervention;
  • the choice is scored by expected improvement, not by textual elegance.

This shifts evaluation away from “does the model sound like a tutor?” toward “does the model select interventions as if it has a learner model?” That is a higher bar and a more useful one.

A practical evaluation framework could look like this:

Product question Diagnostic version Failure mode it catches
Does the tutor adapt to the learner? Fit choices against learner-state-aware and heuristic models Generic hints dressed up as personalization
Does chain-of-thought-style scaffolding help? Compare final action quality with and without the scaffold, after removing the scaffold at test Surface compliance without policy improvement
Does the model overuse salient content? Create cases where high-value-looking content is not the best intervention Reward-salience shortcuts
Does performance survive constraints? Repeat under shorter context, lower latency, or token budgets Strategy collapse into simpler heuristics
Is the system improving over versions? Track teaching-policy fit, not only answer accuracy Model upgrades that improve fluency but not instruction

The ROI argument is also practical. A full classroom RCT is expensive and slow. A cognitive diagnostic task is not a substitute for real deployment evidence, but it can serve as an early filter. Before testing whether an AI tutor improves learning outcomes at scale, a company can test whether the tutor’s action policy is even plausibly learner-aware in controlled cases.

That is cheaper than discovering, after deployment, that the “adaptive tutor” mostly teaches whatever looks shiny.

Why generic prompting may fail even when the intermediate output looks right

The scaffolding result deserves a mechanism-level interpretation.

In humans, a scaffold can reduce cognitive cost. Asking a teacher to identify what the learner likely does not know may make the learner-state inference more available, which then improves the teaching choice. The scaffold changes the human’s effective decision process.

For an LLM, the same scaffold may function differently. It may become an additional instruction-following subtask inserted into the conversation history. The model can complete that subtask, then still choose the final teaching edge through a policy that only loosely uses the scaffolded output. Worse, the extra subtask may introduce irrelevant salience, especially when the scaffold points toward reward-heavy edges.

This creates a useful design rule: intermediate reasoning should not merely precede action; it should constrain action.

In a production AI tutor, that could mean:

  • forcing the final hint to reference a selected learner-state hypothesis;
  • scoring candidate interventions against the inferred gap before generation;
  • separating diagnosis from action selection in a structured pipeline;
  • using verifier models or rules to reject hints that do not follow from the diagnosis;
  • logging the inferred learner state and the chosen intervention as separate objects.

The paper does not test these product designs directly. They are Cognaptus-level inferences from the result. But the direction is clear: if the scaffold is only text in the context window, it may be too weak. A good tutoring architecture may need the diagnosis to become part of the decision mechanism, not just a paragraph above the decision.

The boundary: Bayes-optimal fit is not proof of a teaching mind

The phrase “mentalize” is useful, but dangerous if handled lazily.

The paper is careful here. Strong fits to a Bayes-Optimal Teacher model do not prove that LLMs perform human-like mental-state inference. A high-capacity model could learn a sophisticated mapping from trajectories and rewards to good edge choices. It could imitate Bayes-optimal behavior without explicitly representing the learner’s knowledge state in a human-like way.

This is not hair-splitting. It affects deployment.

If an LLM has learned an amortized shortcut that works on familiar task structures, it may fail when the learner policy changes, when the interface becomes noisy, when the graph-like structure is replaced by real student misconceptions, or when latency and token limits pressure the system toward cheaper rules. The paper itself suggests that future tests should manipulate constraints such as latency, context length, or token budget, and should use adversarial graph families where shortcuts fail.

The practical boundary is therefore:

Directly shown by the paper Reasonable business inference Still uncertain
In a controlled Graph Teaching task, many LLMs choose edges best explained by a Bayes-Optimal Teacher model Cognitive-model diagnostics can reveal whether an AI tutor’s choices look learner-aware or heuristic Whether the same pattern holds in messy classroom, enterprise, or customer-training settings
LLMs can comply with inference and reward scaffolds Visible reasoning artifacts are measurable and auditable Whether those artifacts causally improve final intervention selection
Lightweight scaffolds did not reliably improve later teaching and sometimes hurt Prompt-only tutoring scaffolds should be tested against action outcomes, not trusted by appearance Which architectural scaffolds can reliably improve LLM teaching policies
Human and LLM strategy distributions differ AI tutors should not be evaluated by human analogy alone Whether LLMs under cost constraints become more human-like in their heuristic shortcuts

This is also where companies should resist a tempting but weak conclusion: “The model is already Bayes-optimal, so no scaffolding is needed.” The paper’s setting is structured, short-horizon, and symbolic. Real tutoring systems deal with ambiguous language, partial student histories, affect, curriculum constraints, safety policies, and occasionally the learner saying “I get it” while getting nothing. A controlled graph task is a microscope, not a classroom.

But microscopes are useful precisely because they make small mechanisms visible.

What this means for AI tutoring teams

A serious AI tutoring team should take three operational lessons from this paper.

First, evaluate the teaching move, not only the explanation. The model’s final instructional action should be scored against the learner’s likely state. A beautiful explanation of the wrong next step is still the wrong next step. It just has better lighting.

Second, treat scaffolds as hypotheses, not upgrades. If a prompt asks the model to infer the learner’s missing knowledge, do not assume the final answer uses that inference. Measure whether intervention quality improves after scaffolding, including on test cases where the scaffold is removed or where reward salience conflicts with learner-state inference.

Third, separate diagnosis and intervention selection in the architecture. For high-stakes or high-volume tutoring products, the system should store learner-state hypotheses, candidate interventions, expected utility estimates, and final choices as structured intermediate objects. That makes the tutor auditable. It also makes failures easier to debug than a long conversational trace that says many reasonable things before doing something mediocre.

The deeper point is not that every AI tutor needs a Bayes-Optimal Teacher model under the hood. The deeper point is that tutoring is a policy problem. It requires choosing what to do next under uncertainty about the learner. Once framed that way, generic chat quality becomes only one component of system quality.

Teaching minds, or just mimicking?

So, do LLMs mentalize when they teach?

The most honest answer is: in this task, many tested LLMs behave as if they are using a learner-model-based teaching policy, and that behavior is better captured by a Bayes-Optimal Teacher model than by simpler heuristics. But behaviorally matching a cognitive model is not the same as possessing a human teaching mind.

That answer may feel less cinematic than “AI understands students now.” It is also far more useful.

The paper’s real contribution is a diagnostic vocabulary. It lets us compare teaching-like behavior against alternative strategy models: learner-state inference, reward salience, graph utility, and other shortcuts. It also shows why visible scaffolding is insufficient. A model can follow the scaffold and still fail to improve the pedagogical action.

For businesses, that is the line worth remembering. The next generation of AI tutoring systems should not be judged by whether they sound patient, Socratic, or encouraging. Those qualities matter, but they are not the core mechanism. The core mechanism is whether the system can infer what the learner probably lacks and choose the next intervention accordingly.

A tutor that merely sounds thoughtful is a chatbot with manners. A tutor that chooses well under learner uncertainty is closer to an instructional system.

The market will contain plenty of the first. The second is where evaluation needs to go.

Cognaptus: Automate the Present, Incubate the Future.


  1. Sevan K. Harootonian, Mark K. Ho, Thomas L. Griffiths, Yael Niv, and Ilia Sucholutsky, “Do Large Language Models Mentalize When They Teach?”, arXiv:2604.01594, 2026. https://arxiv.org/abs/2604.01594 ↩︎