Opening — Why this matters now
The current generation of AI systems can summarize books, write code, and even simulate conversations that feel uncannily human. Yet place these same systems inside a real collaborative task, and the illusion quickly breaks.
Human collaboration depends on something subtle but powerful: common ground—the evolving set of shared beliefs and mutually recognized facts that allow teams to coordinate action. In workplaces, negotiations, and engineering teams, this shared understanding forms the invisible infrastructure of decision-making.
For AI systems aspiring to become autonomous agents or collaborative copilots, the ability to track common ground is not optional. It is foundational.
A recent study introduces a deceptively simple benchmark designed to test exactly this capability. The result: even state‑of‑the‑art language models struggle to follow how shared beliefs evolve when multiple people communicate with partial information and multiple modalities.
In other words, AI can talk fluently—but it still struggles to truly coordinate.
Background — The Missing Piece in AI Collaboration
Decades of research in linguistics and cognitive science describe common ground as the shared belief state enabling coordinated action.
Humans construct this shared state through multiple communication channels simultaneously:
| Modality | Role in Collaboration |
|---|---|
| Speech | Explicit instructions and reasoning |
| Gesture | Spatial and referential cues |
| Action | Evidence of intention |
In real collaborative environments, these signals overlap constantly. Someone may say “Put the green block on the red one” while pointing to a location and simultaneously adjusting another piece.
Modern AI benchmarks rarely capture this complexity. Most dialogue datasets are:
- Two‑party interactions
- Text‑only conversations
- Fully observable environments
Real teamwork, unfortunately for AI systems, is rarely so polite.
The benchmark introduced in this study attempts to recreate the messy reality of collaborative reasoning.
The DPIP Task — Engineering Epistemic Asymmetry
The experiment centers on a task called the Distributed Partial Information Puzzle (DPIP).
The setup is simple but carefully designed:
| Role | Information Access | Responsibility |
|---|---|---|
| Director 1 | One side view of a structure | Provide instructions |
| Director 2 | Another side view | Provide instructions |
| Director 3 | Third side view | Provide instructions |
| Builder | No blueprint | Physically construct structure |
Each director sees only one projection of the final Lego structure. No single participant has enough information to solve the puzzle alone.
The builder must rely entirely on instructions and gestures from the three directors. Meanwhile the directors must reconcile conflicting or incomplete perspectives.
This creates a controlled form of epistemic asymmetry—a situation where participants possess different pieces of knowledge.
In business terms, this resembles real project teams:
- Engineers understand systems
- Designers understand users
- Managers understand constraints
The final solution emerges only when these partial views converge.
Building the Dataset
To capture this process, researchers recorded 10 collaborative sessions, each involving four participants.
Multiple data streams were collected simultaneously:
| Data Type | Purpose |
|---|---|
| Audio transcripts | Capture verbal instructions |
| Gesture annotations | Record pointing and spatial cues |
| Action logs | Track block placement changes |
| Temporal alignment | Synchronize all modalities |
The annotation pipeline reconstructed each interaction at a surprisingly granular level.
Example proposition extracted from speech:
nextto(BlueShortBlock, GreenShortBlock)
Example action representation:
put(red_block_1, on(base))
Gesture annotations were encoded using Gesture Abstract Meaning Representation (GAMR) structures that capture semantic intent.
Once aligned, these modalities created a timeline of belief updates across the team.
Modeling Shared Beliefs
To evaluate whether AI systems could track common ground, researchers compared two fundamentally different approaches.
1. Large Language Models
Several models were tested, including:
- Llama‑3.2
- Qwen3
- GPT‑5‑mini / GPT‑5
These models received multimodal annotations and attempted to infer:
- the evolving structure
- the group’s shared belief state
2. Axiomatic Belief Tracking
The second approach used Dynamic Epistemic Logic (DEL)—a formal reasoning system describing how beliefs change after events.
Three core axioms governed belief updates:
| Axiom | Meaning |
|---|---|
| Seeing is believing | Perception updates belief |
| Acting is believing | Actions reveal intention |
| Saying is believing | Speech signals belief |
Whenever two or more participants accepted a proposition, it entered the system’s Common Ground (CG) representation.
For example:
CG{Director1, Director2, Builder}: on(gs3, rs2)
The resulting model tracks shared beliefs as they emerge and dissolve during interaction.
Findings — Fluent Models, Fragile Coordination
The results reveal a fascinating gap between language competence and collaborative reasoning.
1. LLMs Struggle With Multi‑Turn Belief Tracking
When predicting the structure from actions alone, the strongest model performed best—but still with modest accuracy.
| Model | Avg Accuracy (DSC) |
|---|---|
| GPT‑5 | Highest |
| Qwen3 | Moderate |
| Llama‑3 | Lowest |
However, accuracy dropped significantly when evaluating the entire dialogue history, suggesting that long‑range reasoning remains difficult.
2. More Modalities Sometimes Hurt
Adding speech and gesture information produced an unexpected effect:
- Turn‑level accuracy decreased
- Dialogue‑level accuracy improved
This implies that multimodal signals provide valuable context but also introduce noise when interpreted incrementally.
3. Formal Logic Can Rival Large Models
The deterministic epistemic logic pipeline sometimes matched or exceeded the performance of large models in predicting the final structure.
This is notable because the logic system had no statistical learning—only explicit reasoning rules.
4. LLMs and Logic Disagree About Beliefs
Perhaps the most interesting finding: LLM predictions of common ground rarely matched the logic‑derived belief states.
Even when both systems correctly predicted the final structure, they often arrived there through completely different inferred belief dynamics.
In short: the models solved the puzzle without understanding how the team understood it.
The Strange Case of the Failed Team
One group in the dataset failed the puzzle entirely.
Surprisingly, AI systems performed better at predicting common ground in this scenario.
Why?
Because there was almost none.
Participants misunderstood instructions and never converged on shared beliefs. The models correctly inferred that the group lacked meaningful agreement.
Detecting absence of common ground turned out to be easier than reconstructing its evolution.
Implications — The Real Challenge for AI Agents
The study highlights a critical limitation in current AI systems.
Large language models excel at:
- generating fluent text
- answering questions
- summarizing documents
But real-world collaboration requires something more demanding:
- Belief tracking
- Perspective modeling
- Multimodal reasoning
- Group dynamics understanding
For AI copilots or autonomous agents operating in workplaces, robotics, or multiplayer environments, these capabilities are essential.
Without them, AI may appear intelligent while quietly misunderstanding the state of collaboration.
That gap—between conversation and coordination—may become one of the defining challenges of the next generation of AI systems.
Conclusion
The Distributed Partial Information Puzzle offers a rare window into how humans actually build shared understanding.
The benchmark exposes a simple truth: collaboration is not just communication.
It is the continuous construction of shared beliefs across people, signals, and actions.
Until AI systems learn to track those evolving belief states, they will remain impressive conversationalists—but unreliable teammates.
Cognaptus: Automate the Present, Incubate the Future.