Playing with Strangers: A New Benchmark for Ad-Hoc Human-AI Teamwork

Human-AI collaboration is easy to romanticize in theory but hard to operationalize in practice. While reinforcement learning agents have dazzled us in games like Go and StarCraft, they often stumble when asked to cooperate with humans under real-world constraints: imperfect information, ambiguous signals, and no chance to train together beforehand. That’s the realm of ad-hoc teamwork—and the latest paper from Oxford’s FLAIR lab introduces a critical step forward.

The Ad-Hoc Human-AI Coordination Challenge (AH2AC2) tackles this problem by leveraging Hanabi, a cooperative card game infamous among AI researchers for its subtle, communication-constrained dynamics. Unlike chess, Hanabi demands theory of mind—inferring what your teammate knows and intends based on sparse, indirect cues. It’s a Turing Test of collaboration.

Human Proxies, Not Human Subjects

Evaluating human-AI coordination typically requires labor-intensive user studies. AH2AC2 circumvents this with a clever solution: human proxy agents trained on a massive dataset of real human Hanabi gameplay (over 147,000 games). These proxies aren’t just bots that mimic humans; they’re trained via a two-step process:

Behavioral Cloning (BC) to imitate human decisions,
Regularized Reinforcement Learning (HDR-IPPO) to refine strategies while staying human-like.

The result: cheap, scalable, and behaviorally faithful agents that can serve as stand-ins for human teammates. Importantly, the proxies are not publicly downloadable—they’re hosted via an evaluation API to prevent overfitting, establishing a gold standard for reproducibility.

Crucially, the authors validate these agents through multiple means: cross-play with BC agents, behavioral metrics like Information Per Play and Communicativeness, and action prediction accuracy on held-out datasets. Their best human proxies follow human conventions with 88% success in qualitative game analysis, making mistakes similar to actual players.

Coordination Without Coordination

AH2AC2 asks a deceptively simple question: Can your AI agent play well with unfamiliar partners who behave like humans? To answer, it offers two challenges:

Coordination Challenge: Your agent plays 1,000 Hanabi games with human proxies. Performance is ranked on a public leaderboard.
Prediction Challenge (optional): Your agent must predict the next action a human would take in unseen scenarios.

Baseline agents tested include classic self-play PPO, behavioral cloning, and zero-shot methods like OBL (Off-Belief Learning). Strikingly, even advanced LLMs like DeepSeek-R1, when prompted in natural language with Hanabi rules and conventions, underperform specialized ZSC agents like OBL—highlighting the limits of generic LLMs in structured social tasks.

The study also reveals that strong zero-shot coordination (ZSC) methods like OBL can outperform data-hungry imitation learners, and that population-based methods (like FCP) struggle with Hanabi’s partial observability. Importantly, HDR-IPPO with proper KL regularization achieves human-like behavior and strong self-play performance, a rare combination.

Why This Matters

Hanabi may be a game, but the stakes are real. Ad-hoc coordination mirrors the challenges of integrating AI into workflows like autonomous driving, collaborative robotics, or even co-authoring documents. The AH2AC2 benchmark:

Makes human-AI evaluation reproducible,
Emphasizes data efficiency, releasing only 3,000 games for training,
Encourages robust generalization, not just memorization,
Provides a live API and leaderboard to track community progress,
Introduces held-out action prediction as a new test of theory-of-mind modeling,
Highlights regularization as a key to maintaining human-likeness during training.

It also pushes LLM research to grapple with cooperation, not just generation. Prompting LLMs to act as teammates in a complex, partially observable environment reveals where their reasoning—and empathy—fall short.

Looking Ahead

Oxford’s team leaves open exciting frontiers: extending to 4-5 player games, incorporating rainbow cards for extra ambiguity, or ultimately validating the proxies through real human experiments. Moreover, AH2AC2 could become a touchstone for testing future LLM agents—whether trained end-to-end or augmented via symbolic scaffolding.

An especially intriguing idea is using Hanabi to test multi-modal agents that process visual states, or combining proxy agents with few-shot adaptation settings to explore hybrid RL/LLM models. The groundwork is set.

In a world where AI will increasingly be our partner, not just our tool, we need benchmarks that measure how well it plays with others. AH2AC2 is a promising, pragmatic, and playful start.

Cognaptus: Automate the Present, Incubate the Future

Human Proxies, Not Human Subjects#

Coordination Without Coordination#

Why This Matters#

Looking Ahead#

Human Proxies, Not Human Subjects

Coordination Without Coordination

Why This Matters

Looking Ahead