As large language models (LLMs) evolve from mere tools into interactive agents, they are increasingly expected to operate in multi-agent environments—collaborating, competing, and communicating not just with humans but with each other. But can they understand the beliefs, intentions, and misunderstandings of others? Welcome to the world of Theory of Mind (ToM)—and the cleverest AI benchmark you haven’t heard of: Decrypto.

Cracking the Code: What is Decrypto?

Inspired by the award-winning board game of the same name, Decrypto is a three-player game of secret codes and subtle hints, reimagined as a benchmark to test LLMs’ ability to coordinate and deceive. Each game features:

  • Alice, the Encoder, who sees a secret code (like “2-3-4”) and must craft three word-based hints.
  • Bob, the Decoder, who knows a fixed set of secret keywords and tries to deduce the code.
  • Eve, the Interceptor, who only sees the public hints and tries to intercept the code without access to the keywords.

The tension? Hints must be clear enough for Bob but vague enough to confuse Eve. It’s a communication minefield—exactly the kind of nuanced reasoning where humans excel and LLMs flounder.

Beyond Sally-Anne: Rethinking ToM Evaluation

Classic ToM benchmarks like the Sally-Anne task are static, limited, and easily overfit. Decrypto is different:

  • Interactive: The benchmark is multi-turn, with a growing history of public hints that changes the game dynamics each round.
  • Open-ended: LLMs generate their own hints in natural language, rather than choosing from constrained options.
  • Scalable: Over 8.8 billion keyword combinations and infinite hint permutations make it robust against memorization.

Decrypto challenges agents not just to associate words but to reason about what others know and don’t know, forcing them to engage in Bayesian pragmatic inference.

Multi-Agent Reasoning in Action

Decrypto supports several experimental setups:

  • Competition: Alice and Bob are controlled by one model, and Eve by another. The win-rate, miscommunication rate, and number of intercepted turns are used to evaluate effectiveness.
  • Ad-hoc Coordination: Alice and Bob are controlled by different models (or a human and a model), testing how well agents collaborate without prior alignment.
  • Human-AI Crossplay: A human replaces Alice or Bob to test coordination quality. Results show that all LLMs currently struggle to coordinate as smoothly as humans.

The authors also propose average turn length as a more stable performance metric than win-rate, since many models are systematically worse as encoders than interceptors.

Theory of Mind Tasks, Reimagined

Decrypto provides a platform to port and adapt classic cognitive psychology experiments. Two tasks stand out:

  1. Smarties Task (Representational Change & False Belief):

    • Eve is prompted to guess the keywords before and after revelation.
    • Accuracy is measured both on weak and strong criteria (e.g., consistency with prior beliefs).
  2. Three Mountain Task (Perspective Taking):

    • Alice must predict what Eve will guess.
    • The difference between actual and predicted guesses reveals failures in ToM.

Even the best models, like Claude 3.7 and GPT-4o, fail strong ToM tasks and often assume Eve has privileged information she doesn’t.

Surprising Findings: Bigger Isn’t Always Better

The benchmark reveals several unintuitive results:

  • LLaMA 3.1-70B outperforms newer models like DeepSeek-R1 and Claude 3.7 on ToM tasks.
  • Word embedding baselines (e.g., GloVe, Word2Vec) outperform LLMs in coordinated self-play and even match human performance in some cases.
  • Cross-play performance between different models (or human and model) is significantly lower than self-play, exposing brittle internal representations.

These results challenge the belief that scaling alone yields better reasoning and mind modeling.

Why Decrypto Works: A Formal View

The authors formalize Decrypto using the Rational Speech Act (RSA) model:

  • Alice optimizes her hint utility to communicate successfully with Bob while avoiding interception by Eve.
  • Bob must perform second-order ToM: model Alice, who models Eve.
  • Pragmatic failures arise when models overfit their own knowledge, ignore public information, or misestimate Eve’s strategy.

This makes Decrypto not just a benchmark, but a language game grounded in probabilistic reasoning theory, ideal for studying emergent communication.

Implications and Future Directions

Decrypto is both a challenge and a platform:

  • Future-Proof Benchmark: With its infinite replayability and cultural flexibility, it resists saturation.
  • RL Fine-Tuning Ground: Its short episodes and dense feedback make it ideal for multi-agent reinforcement learning with LLMs.
  • Social AI Evaluation Tool: Researchers can test persona-based behavior, cultural inference, and model consistency in real-world settings.

Ultimately, Decrypto doesn’t just expose current models’ limitations—it provides a roadmap for improving their social reasoning and interactive competence.


Cognaptus: Automate the Present, Incubate the Future