A negotiation bot walks into a pricing dispute.
That is not the start of a joke. It is the start of a procurement problem, a marketplace design problem, a customer-service escalation problem, and, sooner than executives would like to admit, a governance problem. Once AI systems begin making choices on behalf of organisations, their behaviour in social settings matters. Not just whether they answer correctly. Not just whether they sound polite. Whether they cooperate, defect, compromise, optimise, over-trust, or quietly behave like a very caffeinated economist.
A new paper by Andrea Cera Palatsi, Samuel Martin-Gutierrez, Ana S. Cardenal, and Max Pellert takes that question into the clean little laboratory of game theory: simple two-player games, controlled payoffs, and a direct comparison between human behaviour, Nash equilibrium, and three open-source language models: Llama, Mistral, and Qwen.1
The headline is tempting: LLMs can replicate human cooperation. Nice. Convenient. Slightly too neat.
The better reading is more interesting. The models do not become generically “human-like”. Llama most closely reproduces aggregate human cooperation patterns. Qwen behaves more like Nash equilibrium. Mistral follows a distinct payoff-sensitive pattern that looks neither fully human nor fully rational-choice. In other words, the paper is not a story about artificial humans. It is a story about behavioural model selection.
That distinction matters because businesses will not deploy “AI behaviour” in the abstract. They will deploy a specific model, prompted in a specific way, inside a specific incentive structure, with a specific failure mode waiting patiently in the corner.
The experiment asks a simple question with expensive implications
The paper builds a digital twin of an earlier human experiment in dyadic game theory. In the original study, more than 500 human participants played one-shot, simultaneous two-player games. Each player chose between two actions, conventionally understood as cooperation or defection, although the human experiment used neutral labels to avoid loading the choice with moral language.
The payoff matrix is the classic four-outcome setup:
| Situation | Player’s outcome |
|---|---|
| Both cooperate | reward for mutual cooperation |
| Player cooperates, other defects | sucker’s payoff |
| Player defects, other cooperates | temptation payoff |
| Both defect | punishment for mutual defection |
The original grid fixed mutual cooperation at 10 points and mutual defection at 5 points. The two moving parts were the temptation payoff and the sucker’s payoff. Those two values determine whether the game behaves like a Harmony Game, Snowdrift Game, Stag Hunt, or Prisoner’s Dilemma.
That gives the study a useful discipline. The authors are not asking a model to role-play a citizen, a buyer, a patient, or a CEO. They are asking it to make tightly specified choices under known payoffs. This strips away much of the usual fog around “AI behaviour”, though not all of it. Fog is resilient. Ask any strategy consultant.
The important business analogue is this: many organisational interactions can be simplified into cooperation problems. Suppliers decide whether to share information. Platforms decide how much to nudge users. Customers decide whether to trust an automated agent. Departments decide whether to optimise locally or coordinate globally. The real world is messier than a payoff matrix, but the matrix exposes something useful: how a decision system responds when individual incentives and collective outcomes diverge.
Humans were not Nash machines, which is awkward for Nash machines
The original human results already contained the key behavioural lesson. Human participants did not simply follow Nash equilibrium. The earlier study classified human strategies into a reduced set of behavioural phenotypes: optimists, pessimists, envious players, trustful players, and undefined players. These patterns deviated from classical rational-choice predictions.
This matters because “rational” is often used lazily in business AI discussions. A model that maximises expected payoff may sound superior until it is placed in a human-facing system where trust, fairness, reciprocity, and perceived legitimacy change the next interaction. Humans are not bad calculators who failed to become economists. They often use social heuristics because social environments punish narrow optimisation.
The paper’s comparison with Nash equilibrium therefore gives the study its bite. The question is not whether LLMs can solve a toy game. The question is whether they reproduce the empirical shape of human deviations from formal rationality.
That is where the three-model comparison becomes more useful than a single-model success story.
Llama behaves most like the crowd, not like the textbook
The clearest result is that Llama is closest to the empirical human cooperation matrix. After the authors apply their full prompting and extraction pipeline, Llama reaches a mean squared displacement of 0.031 against the human data and a Pearson correlation of 0.89. Lower mean squared displacement is better; higher correlation is better. On both measures, Llama is the closest model to human aggregate behaviour.
For context, Nash equilibrium itself performs worse against human behaviour: mean squared displacement of 0.096 and correlation of 0.78. That is the paper’s most commercially useful result. A calibrated LLM can, in this narrow setting, fit observed human cooperation patterns better than the classical rational benchmark.
But the detail is less flattering to any simplistic “LLMs are cooperative” narrative. Human average cooperation in the original matrix was 0.480. Llama’s final verified cooperation rate in the original region was 0.402. Llama does not simply cooperate more. It cooperates in a pattern that resembles humans across payoff regions.
That difference is not cosmetic. Average cooperation answers “how much”. Pattern similarity answers “where and when”. For business design, the second question is usually more important.
A customer-retention agent that cooperates too much in every condition becomes a discount machine with a chat interface. A procurement agent that defects too much becomes a relationship shredder. A useful behavioural simulator should help identify which contexts trigger trust, caution, opportunism, or coordination failure. Llama’s value in this paper is not that it is nicer. It is that its strategic texture is closer to human aggregate texture.
How poetic. The machine does not need a soul. It needs a covariance structure.
Qwen is the economist in the room
Qwen tells a different story. It has a mean squared displacement of 0.065 and a correlation of 0.79 against human data, so it is not wildly detached from human behaviour. But its strongest alignment is with Nash equilibrium: mean squared displacement of 0.036 and correlation of 0.93.
That makes Qwen the model that most resembles the formal rational benchmark.
This is valuable, but not for the same purpose. If the business question is “what would a strategically optimising agent do under this payoff structure?”, Qwen-like behaviour is useful. It can serve as a rational baseline, a stress test, or a way to identify where human users may diverge from incentive-theoretic expectations.
If the question is “what will real people probably do?”, Qwen becomes more dangerous. Not because it is wrong in a mathematical sense, but because it may be right in the wrong ontology. Humans systematically deviate from Nash equilibrium in many social dilemmas. A model that converges too cleanly toward game-theoretic rationality can make human behaviour look like noise. Unfortunately, humans tend to object when treated as noise. Often through churn.
The authors deliberately avoided explicit game-theory terminology such as “cooperate” and “defect” in the prompts. They used neutral labels and simplified instructions to reduce the chance that models merely retrieved memorised game-theory patterns. Qwen still gravitated toward Nash-like behaviour. The paper does not prove why. It may reflect training exposure, stronger logical reasoning under the prompt, or model-specific inductive tendencies. The practical takeaway is simpler: Qwen behaves like a rational-strategy comparator more than a human-behavioural proxy in this setup.
That is not a defect. It is a role.
Mistral is the awkward middle case, which is often where deployment lives
Mistral is less tidy. It has a mean squared displacement of 0.091 and correlation of 0.70 against human data; against Nash, its mean squared displacement is 0.182 and correlation is 0.60. In the authors’ interpretation, Mistral shows a vertical separation in the cooperation matrix: it becomes more cooperative in regions where certain payoff conditions make high rewards attractive, resembling something like the human “optimist” phenotype.
That makes Mistral useful as a warning. Not every model will fall neatly into “human-like” or “rational”. Some will exhibit structured behaviour that is real, repeatable, and operationally consequential, while still not mapping cleanly onto the intended behavioural benchmark.
This is exactly the sort of model behaviour that organisations tend to discover after launch, when an agent starts making strangely consistent decisions that were never specified in the product requirements. The system is not broken. It is just following a behavioural geometry nobody bothered to inspect.
The paper’s model comparison can be compressed this way:
| Model or benchmark | Closest behavioural role in the paper | Evidence | Business reading | Boundary |
|---|---|---|---|---|
| Llama | Aggregate human-pattern proxy | Best match to human matrix: MSD 0.031, $r = 0.89$ | Useful for pre-testing cooperation scenarios where human-like aggregate response matters | Not an individual-level human simulator |
| Qwen | Rational-choice comparator | Best match to Nash: MSD 0.036, $r = 0.93$ | Useful as a strategic baseline or optimiser | May underpredict human social deviation |
| Mistral | Payoff-sensitive behavioural variant | Weaker match to both humans and Nash | Useful as a reminder that models have distinct behavioural profiles | Harder to interpret as either human proxy or formal benchmark |
| Nash equilibrium | Classical rational benchmark | Worse than Llama against humans: MSD 0.096, $r = 0.78$ | Still useful as a theoretical reference | Often misses empirical cooperation patterns |
The point is not to crown a winner. The point is to stop pretending that model choice is only about benchmark scores, latency, and price per million tokens. In agentic systems, model choice is also behavioural governance.
The extraction pipeline is not plumbing; it is part of the result
One of the paper’s most important sections looks methodological, which means it is exactly where many readers will skim. Bad idea.
The authors did not simply ask each model to choose A or B and record the answer. They tested four progressively more complex answer-extraction strategies.
First, simple extraction: ask for the choice directly. For Llama, this produced almost random-looking cooperation patterns.
Second, double extraction: ask the tested model for a longer answer, then use Qwen to extract the final A/B choice. This made the pattern clearer.
Third, multi-step prompting: guide the model through grouped payoff comparisons before asking for a choice. This further reduced inconsistency.
Fourth, logical verification: use Qwen as a verifier to filter responses with arithmetic errors, incorrect outcome descriptions, or reasoning-choice inconsistencies before extracting the final answer.
The figure showing this progression is not decorative. It is evidence that behavioural simulation depends heavily on elicitation design. The “same model” can look noisy, structured, human-like, or logically broken depending on how its decision is elicited and validated.
For business users, this is the part that should sting a little. If an organisation says it has tested an AI agent’s behaviour, the immediate follow-up should be: tested under what extraction regime? With what validation layer? Against what failure cases? With what treatment of invalid responses?
The authors manually validated extractor performance on 100 sampled long-form answers. Qwen and Llama performed similarly as extractors, at 0.97 and 0.96 accuracy respectively, while Mistral lagged at 0.83. Qwen was then selected as the extractor and verifier. This is an implementation detail, but not a trivial one. The measurement system is itself an LLM component. When the judge is also a model, the judge’s behaviour becomes part of the experiment.
A neat little governance headache, wrapped in a table.
The paper’s evidence has three layers, and they should not be blended together
The study contains several kinds of evidence, each doing a different job. Mixing them together would make the result sound broader than it is.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Replication of original 121-game grid | Main evidence | Llama can approximate aggregate human cooperation patterns better than Nash in this setup | That Llama predicts individual human choices |
| Cross-model comparison | Main evidence and model-selection evidence | Llama, Qwen, and Mistral have distinct behavioural profiles | That one model is universally superior |
| Progressive extraction methods | Sensitivity and implementation evidence | Behavioural patterns depend on prompting, reasoning space, and verification | That raw model outputs are reliable behavioural data |
| Extended 441-game grid | Exploratory extension and hypothesis generation | Llama can generate predictions outside the human-tested parameter space | That those predictions are correct for real humans |
| Appendix comparison for Qwen and Mistral extraction | Robustness and sensitivity check | Other models also become more structured under better extraction, but at different stages | That extraction effects are identical across models |
| Problematic-games appendix | Quality boundary | Some high-parameter regions bypassed verification, but bypass rates stayed at or below 0.25 | That all simulated regions are equally reliable |
This is the right way to read the paper: the original-grid comparison is the validated replication; the extended grid is a hypothesis generator; the appendix tests reveal where the machinery is stable and where it creaks.
That last word matters. The authors report that some game configurations repeatedly produced logically flawed responses. To avoid infinite replay loops, they allowed the verifier to deactivate for problematic games when invalid responses stopped decreasing across iterations. Those games then proceeded directly to extraction. The bypass rate did not exceed 0.25 for any game, meaning most responses still passed verification even in difficult regions. Still, this is a real boundary: less-filtered games may contain more reasoning error.
This does not undermine the paper. It makes the paper more useful. Any serious AI simulation workflow needs to show where quality control was relaxed, not pretend the pipeline descended from heaven with perfect labels.
The extended grid is a hypothesis machine, not a prophecy machine
After validating Llama on the original human-tested region, the authors extend the parameter grid. The original experiment varied the temptation and sucker’s payoffs across 121 games. The extension expands the space to 441 games, adding 320 parameter combinations not previously tested with humans.
This is where the “digital twin” claim becomes more ambitious. Llama is not only used to replicate known human data. It is used to generate predictions for untested games. The authors also compare those predictions with Nash equilibrium across the extended space and preregister a future human experiment to test which framework better anticipates real behaviour.
This is the proper scientific loop:
- Validate the simulation against existing human data.
- Use the simulation to generate predictions in untested regions.
- Run targeted human experiments to check those predictions.
- Keep or discard the simulator based on empirical performance.
For business, the analogue is obvious. Use LLM simulations to pre-screen candidate policy designs, negotiation rules, incentive structures, customer-offer mechanics, or multi-agent workflows. Then test the most informative cases with real users, employees, suppliers, or counterparties.
The mistake would be skipping step three because the simulated heatmap looked persuasive. Heatmaps are very good at looking persuasive. That is practically their job.
The extended grid should be treated as a cheap prioritisation engine, not as evidence that the model knows what humans will do everywhere. The authors are careful about this. They present the novel simulations as preregistered hypotheses for future human validation, not as settled behavioural law.
Businesses should copy that discipline. Simulate first. Commit later. Validate before scaling. Try not to call the simulation “customer truth” in a slide deck. Everyone has suffered enough.
The business value is cheaper behavioural diagnosis, not synthetic humanity
The immediate commercial temptation is to say: if LLMs can simulate people, market research gets cheaper. That is partly true and mostly under-specified.
The more defensible interpretation is that calibrated LLM simulations can reduce the cost of behavioural diagnosis. They can help teams map where cooperation breaks, where incentives produce defection, where a rational optimiser diverges from likely human responses, and where model choice changes the outcome.
That has practical uses across several domains:
| Business setting | Directly relevant paper insight | Plausible use | Uncertainty boundary |
|---|---|---|---|
| AI negotiation agents | Models differ in cooperation and rationality profiles | Compare agent behaviour under supplier, pricing, or settlement payoff structures | Real negotiations include reputation, repeated interaction, emotion, and legal constraints |
| Marketplace design | Human-like cooperation may diverge from payoff-maximising equilibrium | Pre-screen incentive rules before live A/B testing | Platform users are not anonymous one-shot game players |
| Customer policy design | Fairness and cooperation can matter even when narrow optimisation says otherwise | Test how automated concessions or penalties may be perceived behaviourally | LLM outputs are not substitutes for customer data |
| Multi-agent enterprise workflows | Agents may optimise locally or cooperate depending on payoff framing | Stress-test internal agent coordination rules | Organisational incentives are richer than two-action games |
| Governance and model selection | Llama, Qwen, and Mistral imply different behavioural regimes | Select models based on behavioural fit, not just task accuracy | Results are model-version and prompt-pipeline dependent |
The paper directly shows aggregate behavioural replication in formal games. Cognaptus infers that similar methods can help businesses explore cooperation-sensitive decision spaces before spending money on live trials. What remains uncertain is whether the same calibration holds in richer, repeated, identity-laden, legally constrained, emotionally charged environments. In other words: the real world.
Annoying place, the real world.
The misconception to kill: “human-like” is not a model property
The easiest bad takeaway is that LLMs are now human-like decision-makers. The paper does not show that.
It shows that, under a carefully engineered prompting and verification setup, one open-source model reproduced aggregate human cooperation patterns in a controlled class of one-shot games better than Nash equilibrium did. That is impressive. It is also narrower than the slogan.
Human-likeness here is:
- model-specific;
- prompt-sensitive;
- extraction-sensitive;
- aggregate-level;
- domain-constrained;
- still awaiting validation in the extended parameter space.
That list is not a disclaimer ritual. It defines the operating manual.
If a company wants to use LLMs as behavioural testbeds, it should treat calibration as the product, not as a footnote. A behavioural simulator should be validated against known human data in the relevant decision domain, compared with formal baselines, checked across model families, and stress-tested under prompt variations. Otherwise the organisation is not running a digital twin. It is running vibes with compute.
The real strategic lesson is comparative, not celebratory
The paper’s title claims LLMs replicate and predict human cooperation. Fair enough. But the stronger management lesson is comparative.
Llama is useful when the task is to approximate aggregate human cooperation patterns in this formal setting. Qwen is useful when the task is to represent rational strategic structure. Mistral is useful as a reminder that models can develop their own behavioural signatures, which may be neither human nor textbook-rational.
That gives organisations a more mature question to ask before deploying agentic AI:
Not “Is the model smart?”
Not even “Is the model aligned?”
But: “Aligned with which behavioural reference class?”
A rational optimiser, a human-like population proxy, a conservative rule follower, a trust-building counterpart, a hard-nosed negotiator, or something stranger that emerged from pretraining and prompt scaffolding?
The answer will not be visible in general-purpose benchmarks. It has to be tested in the decision environments where the model will act.
That is the useful discomfort in this paper. It suggests that AI agents may soon be evaluated not only by accuracy, speed, and safety compliance, but by behavioural phenotype. The spreadsheet will need a new column. Naturally.
Conclusion: cooperation is now a model-selection problem
The rational illusion is that there is one clean answer to strategic behaviour. Classical game theory gives one answer. Humans give another. LLMs, inconveniently, give several.
This paper shows that calibrated LLMs can become useful behavioural instruments, but only if we stop treating them as generic synthetic humans. Llama, Qwen, and Mistral behave differently enough that model choice becomes part of the experimental design. Prompting and verification matter enough that measurement becomes part of the behaviour. Extended simulations are promising enough to guide future tests, but not strong enough to replace them.
For business, that is a good trade. The point is not to outsource judgement to a simulated population. The point is to make behavioural uncertainty cheaper to explore before real people, real money, and real reputational damage enter the room.
AI agents are learning to play social games. The sensible response is not applause. It is instrumentation.
Cognaptus: Automate the Present, Incubate the Future.
-
Andrea Cera Palatsi, Samuel Martin-Gutierrez, Ana S. Cardenal, and Max Pellert, “Large language models replicate and predict human cooperation across experiments in game theory,” arXiv:2511.04500, 2025. https://arxiv.org/abs/2511.04500 ↩︎