Mind Games for Machines: How Decrypto Reveals the Hidden Gaps in AI Reasoning

TL;DR for operators

Meetings are easy to automate until someone has to understand what everyone else thinks everyone else knows. That is the useful discomfort created by Decrypto, a new benchmark for multi-agent reasoning and theory of mind in language models.¹

The benchmark is built around a simple word game. Alice and Bob share four secret keywords. Alice receives a three-digit code and gives three public hints. Bob must recover the code. Eve sees the same hints but does not know the secret keywords and tries to intercept. Alice’s job is therefore not “give good clues.” It is “give clues calibrated to Bob’s knowledge while limiting Eve’s inference.” Welcome to enterprise communication, but with fewer calendar invites.

The paper’s central finding is not that LLMs cannot play games. They can. The more uncomfortable result is that frontier models struggle with the exact skills agentic deployments increasingly require: ad-hoc coordination, reasoning about partial information, and predicting what another actor will infer from the same message. In several settings, simple word-embedding baselines outperform LLMs. Human teams also remain much better at using and interpreting hints, especially when LLMs are asked to coordinate with human encoders.

The theory-of-mind results are sharper. Most tested models do reasonably well on weak representational-change and false-belief variants, where they only need to realise that an earlier belief could have been wrong. But they perform poorly on strong variants requiring self-consistency. In perspective taking, models often predict that Eve will intercept almost every turn, even though the actual interception rate is roughly 52%. Several models behave as if Eve knows the secret keywords, despite being told that she does not.

The business implication is plain: do not infer social competence from strong reasoning scores. A model that solves hard individual tasks may still fail when it has to coordinate with a user, anticipate a counterparty, or choose what to reveal under asymmetric information. For practical deployment, Decrypto argues for interactive evaluation: test not only whether an agent can answer, but whether it can communicate strategically with humans and other agents under incomplete shared context.

The boundary is equally plain. Decrypto is a stylised word-game benchmark. It evaluates selected theory-of-mind abilities: representational change, false belief, and perspective taking. It does not prove whether models understand intentions, emotions, trust, incentives, negotiation, or office politics. Sadly, office politics remains undefeated.

The game is simple; the reasoning is not

Decrypto looks harmless because it is made of words. No robotics simulator. No visual maze. No tool chain. No 200-step software ticket. Just three players, four secret keywords, a code, and a few hints.

That simplicity is the point.

Alice and Bob share four ordered secret keywords, such as:

Digit	Secret keyword
1	star
2	jazz
3	thunder
4	plane

On each turn, Alice receives a code such as 2-3-4. She must produce three public hints. Bob sees the hints and tries to infer the code. Eve sees the same hints but does not know the keywords. If Bob guesses wrong, Alice and Bob receive a miscommunication token. If Eve guesses correctly, she receives an intercept token. The game ends when either side accumulates two failure tokens, or Alice and Bob survive eight turns.

That creates a beautiful little trap. Literal hints help Bob but also help Eve. Obscure hints protect against Eve but risk confusing Bob. As the game continues, Eve accumulates hint history, making future interception easier. Alice must therefore reason not only about the meaning of words but about how much information each word leaks to each player.

This is where Decrypto becomes more than a parlour game. It turns communication into a controlled test of asymmetric information.

A weak model can map “thunder” to “storm.” A stronger model might map “AC/DC” to “Thunderstruck.” A socially competent model must ask a harder question: will Bob make that association, and will Eve make it too? That is the distinction Decrypto is designed to expose.

Decrypto tests calibration, not vocabulary size

Many language benchmarks reward directness. Decrypto punishes it.

In a normal benchmark, a model receives a question and produces the best answer. In Decrypto, “best” depends on who is listening. Alice’s hint is successful only if it lands in a narrow communicative channel: interpretable by Bob, not too interpretable by Eve.

This matters because many real business uses of agents have the same structure. A procurement agent negotiating with suppliers should reveal enough to move the process forward, but not enough to weaken bargaining position. A customer-support agent should infer what the user likely misunderstood without hallucinating private intent. A workflow agent coordinating with another system should send instructions that are actionable to the partner system but not overexpose internal state. A compliance assistant should explain a risk without encouraging a workaround. The boring word for all this is “context management.” The more accurate word is social reasoning.

Decrypto operationalises that problem through three observable failures:

Failure mode	In the game	In business deployment
Miscommunication	Bob cannot decode Alice’s hints	The agent gives instructions that the user or teammate cannot act on
Interception	Eve decodes the code	The agent reveals too much, frames a message badly, or exposes sensitive inference paths
Poor perspective taking	Alice predicts Eve incorrectly	The agent misunderstands what another actor can infer from available information

That is why the benchmark is useful. It does not ask whether a model can produce fluent language. It asks whether the model can tune language to different minds.

The strongest evidence is not a leaderboard; it is the pattern of failure

The paper evaluates both specialist baselines and general-purpose LLMs. The specialist baselines use GloVe and Word2Vec embeddings. Alice chooses hints from semantically similar nouns, Bob maps hints back to keywords using cosine similarity, and Eve uses hint histories to infer assignments.

A simple baseline of this kind should not be socially brilliant. It has no rich inner life, no corporate empathy module, no laminated theory-of-mind certificate. Yet when Alice and Bob share the same embedding strategy, the baseline coordinates extremely well. The reason is mechanical: Alice and Bob use the same representational geometry. They “think alike” in the only sense the game requires.

The catch appears in cross-play. When Alice uses one embedding model and Bob uses another, coordination can collapse. The paper gives a telling example: for the keyword “fire,” a GloVe-based Alice might choose “oil” as a reasonably similar hint, while Word2Vec-Bob may not rank “oil” among the top 1,000 related words. Same English. Different associative map. Miscommunication follows.

This is a useful warning for multi-agent systems. The problem is not always that an agent lacks intelligence. Sometimes two agents have incompatible latent conventions. Put less politely: your orchestration layer may be hosting a very expensive misunderstanding.

The LLM results follow the same theme. In cooperative cross-play, baseline-baseline and LLM-LLM teams generally miscommunicate less than baseline-LLM teams. Among LLMs, the decoder matters heavily: weaker decoders produce more miscommunications. The authors also do not observe a clear self-play advantage for LLMs, even though, in principle, a model playing with another instance of itself could model its partner more easily.

That absence is important. It suggests that current LLMs are not reliably exploiting the fact that they are paired with a similar mind. They are not failing merely because their teammate is unfamiliar. They are failing because they do not yet turn that familiarity into robust communicative strategy.

Eve wins too often because Alice is bad at hiding, not because the game is unfair

In competitive play, Eve dominates most LLM games. The paper reports that win rate is often uninformative because current LLMs are much weaker at producing calibrated hints than at intercepting them. As a result, games skew toward Eve.

This could be misread as a flaw in Decrypto. Perhaps the game is simply too hard for Alice and Bob. The human results argue otherwise.

The authors collected 10 games from human encoder-decoder teams playing against a fixed LLM interceptor, Llama 3.1-70B-Instruct. They extended the games to collect more data even after technical termination. When novice humans played, they achieved a 33% win rate against even the strongest Eve agents in the replay experiments. Human play was not perfect, but it showed that the game is not structurally unwinnable. The LLMs are the fragile part.

The human-AI cross-play table makes the practical gap clearer. When models replaced Eve against human Alice-Bob teams, stronger models such as DeepSeek-R1 and Claude 3.7 produced more interceptions and lower human win rates. But when models replaced Bob while humans supplied the hints, all tested decoders fell short of human decoders. Claude 3.7 with extended thinking came closest: as decoder, it recorded 12.67 miscommunications, 11 intercepts, a 16.67% win rate, and 6.57 average turns. Human original games had 11 miscommunications, 12 intercepts, a 40% win rate, and 6.90 average turns.

The qualitative interpretation is more useful than the exact score. LLMs can be dangerous enough as interceptors and still unreliable as teammates. That split matters for organisations using agents in collaborative workflows. A model may be good at extracting patterns from someone else’s communication while still being poor at understanding what a human colleague intended by a hint.

That is not reassuring. It is, however, diagnostic.

The theory-of-mind probes separate shallow awareness from self-consistent perspective taking

The paper then uses Decrypto as a platform for interactive theory-of-mind experiments. This is one of its more interesting contributions. Instead of treating theory of mind as a static question-answering task, the authors embed it inside an ongoing game where agents have different information at different times.

They adapt two classic cognitive-science ideas.

First, they build variants of the Smarties task. In the original developmental psychology setup, a child sees a familiar container that unexpectedly contains something else and is then asked about earlier or others’ beliefs. In Decrypto, the analogue is Eve’s belief about the secret keywords before and after the reveal.

The benchmark asks three kinds of questions: what Eve predicts the keywords are before reveal; what Eve says she previously thought after seeing the true keywords; and what Eve thinks a second interceptor would believe before reveal. These support tests of representational change and false belief.

Second, they adapt the Three Mountains problem into a perspective-taking task. After Alice gives hints, Alice is asked to predict Eve’s guess. This tests whether Alice can reason from Eve’s information state rather than from her own privileged access to the secret keywords.

The results are not comforting. Most models do well on weak representational-change and false-belief tasks. That means they can often recognise, at a broad level, that someone without the answer should not know the answer. Fine. The bar is low, but at least it is visible.

The strong variants are much harder. They require the model to reproduce its earlier mistaken belief, or predict the mistaken belief of another agent, rather than merely avoid giving the true keywords. On these strong tasks, all evaluated models score at or below 10% accuracy. The paper uses temperature 0 for these theory-of-mind experiments, so this is not easily dismissed as random sampling noise.

The perspective-taking results are even more revealing. Models often predict Eve’s guess incorrectly. More importantly, most models except Llama 3.1-70B predict that Eve will intercept on nearly every turn, while the actual interception rate is around 52%. The authors inspect outputs and find that several models predict interception even on the first turn, when Eve has no hint history and should do no better than a random guess. This persists even when the prompt explicitly emphasises that Eve does not know the secret keywords.

That is not a small wording bug. It is a failure to condition reasoning on another agent’s information boundary.

The model “knows” the answer. It cannot reliably un-know it on Eve’s behalf. Anyone who has managed confidential information will recognise the pattern. Some people cannot resist explaining the thing they were supposed to abstract away. Apparently, neither can some frontier models.

The paper’s most useful correction is aimed at a lazy assumption: better reasoning model, better theory of mind. Decrypto does not support that assumption.

In the theory-of-mind tests, Llama 3.1-70B outperforms newer reasoning models across the evaluated tasks. Claude 3.7 with extended thinking and o1 high do not simply dominate because they are more recent or more “reasoning-oriented.” DeepSeek-R1-Distill-Qwen-32B performs strongly in some inter-LLM game settings but does worse in human-AI coordination than GPT-4o and Llama 3.1-70B.

This does not mean older models are generally better. It means “reasoning” is not a single transferable substance that automatically fills every cognitive gap. Mathematical chain-of-thought, code repair, web navigation, strategic communication, and false-belief reasoning are not the same capability with different costumes.

For business users, this is the piece to tape above the procurement dashboard. A model can be upgraded and still regress on the social behaviour your workflow depends on. Evaluating agents only on task completion, tool use, or single-turn reasoning will miss that regression.

The next generation of enterprise evaluation should therefore include interaction tests with role asymmetry:

Evaluation question	Decrypto analogue	Deployment analogue
Can the agent communicate with a teammate under partial shared context?	Alice-Bob coordination	Human-agent workflow handoff
Can it avoid leaking too much to an observer?	Eve interception risk	Negotiation, compliance, privacy, security
Can it model what another actor knows?	Perspective-taking prompt	Customer support, sales, risk review
Can it reproduce a prior mistaken belief after learning the truth?	Strong representational change	Auditability and post-hoc explanation
Can it coordinate with unfamiliar partners?	Cross-play	Multi-vendor and multi-agent orchestration

This is not a call to make every enterprise agent play board games before deployment. Charming though that would be. It is a call to stop treating single-agent benchmark gains as evidence of multi-agent reliability.

The appendix tests robustness, not a second thesis

Several paper details matter because they protect the main interpretation from cheap objections.

First, the authors vary prompts extensively. They handwrite five system prompt variants and five user prompt variants for encoder and decoder roles, producing 625 prompt setups per model. They find that prompt variation does not significantly affect final performance measured by average turn length for Llama 3.1-8B and Llama 3.1-70B. The likely purpose of this appendix result is robustness testing. It supports the claim that poor performance is not merely an artefact of one awkward prompt. It does not prove that no prompt engineering could ever improve performance. That would be far too convenient.

Second, the generation setup is designed to avoid trivial truncation failures. The authors use generous token limits, maintain the game as a multi-turn environment, and re-prompt models if formatting extraction fails, up to 10 times. For theory-of-mind tests, they use temperature 0 where possible. This makes the results more interpretable: failures are less likely to be caused by the model being cut off mid-answer or by a formatting mishap.

Third, the paper distinguishes generalist and specialist agents. A generalist model plays out of the box, with the prompt treated as part of the environment. A specialist agent may use task-specific strategies, prompt engineering, reinforcement learning, or hand-designed rules. This distinction matters because a specialist Decrypto player could overfit to the game while saying little about general foundation-model ability. Conversely, a generalist model’s weak score says more about zero-shot multi-agent competence.

Finally, the Rational Speech Act formalisation in the appendix is not decorative mathematics. Its purpose is to show why optimal play requires nested belief modelling. Bob must interpret Alice’s hints partly by modelling how Alice expects Eve to interpret them. Alice, meanwhile, loses utility when her model of Eve is inaccurate, even if Bob would understand perfectly. The formalism gives the intuitive point a spine: Decrypto is hard because communication is strategic, not because the vocabulary is exotic.

What Cognaptus infers for deployment

The paper directly shows that Decrypto exposes weaknesses in current LLMs on cooperative play, competitive interception, human-AI coordination, and selected theory-of-mind tasks. It also shows that simple embedding baselines can outperform LLMs in some game settings, especially when agents share conventions.

Cognaptus infers a wider operational lesson: agent deployment should include social-reasoning tests that resemble the target workflow’s information structure. Not the surface topic. The information structure.

For example, a legal review agent does not need a Decrypto clone. It needs tests where some actors know privileged facts, some see partial summaries, and the agent must choose what to say without contaminating another party’s inference. A sales-support agent needs tests where the customer’s likely belief differs from the internal CRM record. A project-management agent needs tests where another tool or teammate has stale context. A risk-monitoring agent needs tests where revealing the full rationale could create a gaming vector.

The operational move is to define who knows what, who should infer what, and what counts as over-disclosure. Then test agents under those conditions.

A useful evaluation design might include:

Layer	What to test	Example failure
Shared-context calibration	Does the agent know what the user already knows?	Repeats irrelevant context or skips necessary setup
Asymmetric-information control	Does it avoid leaking privileged facts?	Gives a partner too much internal rationale
Counterparty modelling	Does it predict how another actor will interpret a message?	Assumes the supplier, customer, or regulator has internal knowledge
Human cross-play	Can it coordinate with actual human phrasing?	Misreads shorthand, metaphors, domain conventions
Regression tracking	Does a model upgrade preserve social behaviour?	New model improves task score but worsens coordination

That last layer is especially important. Decrypto’s finding that newer reasoning models can underperform older models on theory-of-mind tasks should make buyers cautious about blanket upgrades. A model change is not just a capability improvement. It is a behavioural migration.

Where Decrypto should not be overread

The benchmark is strong because it is narrow. That is also its limit.

Decrypto evaluates word-based strategic communication under controlled asymmetry. It does not test whether a model understands emotions, motives, trust, deception in rich dialogue, organisational incentives, or social norms across cultures. It also depends on language associations, which means culture and world knowledge can strongly shape performance. A hint that is clever for one community is nonsense for another. Anyone who has watched a multinational team decode a local idiom already knows this; Decrypto just makes it measurable.

The human study is also small: 10 full games. That is enough to demonstrate a meaningful gap, not enough to map human performance across expertise levels, cultures, domains, or human interceptor roles. The computational cost of full model combinations is another practical constraint. Because Decrypto has three roles, exhaustive evaluation scales poorly as the number of models increases.

None of these limitations weaken the core lesson. They define where the lesson applies. Decrypto is not a universal social-intelligence meter. It is a sharp probe for a particular class of failure: models that can process language but cannot reliably manage who knows what.

For agentic AI, that class is not peripheral. It is the job.

The real benchmark is whether the agent knows who it is talking to

Decrypto’s contribution is not merely a new scorecard. It changes the question.

The familiar question is: can the model reason? Decrypto asks: can the model reason through another agent’s perspective while choosing what to communicate?

That is a much more business-relevant question for the next phase of AI deployment. Agents will not live in single-player exams. They will negotiate, coordinate, summarise, escalate, conceal, reveal, hand off, and recover from misunderstanding. Many failures will not look like hallucinations. They will look like bad calibration: too much information here, too little there, the wrong assumption about what the other side knows.

The uncomfortable result from Decrypto is that current models still stumble on this calibration, even when the world is reduced to a few keywords and hints. The useful result is that the stumble is observable.

That gives operators a better standard. Do not ask only whether an agent can answer correctly. Ask whether it knows what answer the other party is in a position to understand.

That is where the mind games begin.

Cognaptus: Automate the Present, Incubate the Future.

Andrei Lupu, Timon Willi, and Jakob Foerster, “The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind,” arXiv:2506.20664, 2025, https://arxiv.org/abs/2506.20664. ↩︎

TL;DR for operators#

The game is simple; the reasoning is not#

Decrypto tests calibration, not vocabulary size#

The strongest evidence is not a leaderboard; it is the pattern of failure#

Eve wins too often because Alice is bad at hiding, not because the game is unfair#

The theory-of-mind probes separate shallow awareness from self-consistent perspective taking#

Reasoning models do not automatically become social reasoners#

The appendix tests robustness, not a second thesis#

What Cognaptus infers for deployment#

Where Decrypto should not be overread#

The real benchmark is whether the agent knows who it is talking to#