Mind the Gap: Why AI Still Struggles to Build Common Ground

Opening — Why this matters now

The current generation of AI systems can summarize books, write code, and even simulate conversations that feel uncannily human. Yet place these same systems inside a real collaborative task, and the illusion quickly breaks.

Human collaboration depends on something subtle but powerful: common ground—the evolving set of shared beliefs and mutually recognized facts that allow teams to coordinate action. In workplaces, negotiations, and engineering teams, this shared understanding forms the invisible infrastructure of decision-making.

For AI systems aspiring to become autonomous agents or collaborative copilots, the ability to track common ground is not optional. It is foundational.

A recent study introduces a deceptively simple benchmark designed to test exactly this capability. The result: even state‑of‑the‑art language models struggle to follow how shared beliefs evolve when multiple people communicate with partial information and multiple modalities.

In other words, AI can talk fluently—but it still struggles to truly coordinate.

Background — The Missing Piece in AI Collaboration

Decades of research in linguistics and cognitive science describe common ground as the shared belief state enabling coordinated action.

Humans construct this shared state through multiple communication channels simultaneously:

Modality	Role in Collaboration
Speech	Explicit instructions and reasoning
Gesture	Spatial and referential cues
Action	Evidence of intention

In real collaborative environments, these signals overlap constantly. Someone may say “Put the green block on the red one” while pointing to a location and simultaneously adjusting another piece.

Modern AI benchmarks rarely capture this complexity. Most dialogue datasets are:

Two‑party interactions
Text‑only conversations
Fully observable environments

Real teamwork, unfortunately for AI systems, is rarely so polite.

The benchmark introduced in this study attempts to recreate the messy reality of collaborative reasoning.

The DPIP Task — Engineering Epistemic Asymmetry

The experiment centers on a task called the Distributed Partial Information Puzzle (DPIP).

The setup is simple but carefully designed:

Role	Information Access	Responsibility
Director 1	One side view of a structure	Provide instructions
Director 2	Another side view	Provide instructions
Director 3	Third side view	Provide instructions
Builder	No blueprint	Physically construct structure

Each director sees only one projection of the final Lego structure. No single participant has enough information to solve the puzzle alone.

The builder must rely entirely on instructions and gestures from the three directors. Meanwhile the directors must reconcile conflicting or incomplete perspectives.

This creates a controlled form of epistemic asymmetry—a situation where participants possess different pieces of knowledge.

In business terms, this resembles real project teams:

Engineers understand systems
Designers understand users
Managers understand constraints

The final solution emerges only when these partial views converge.

Building the Dataset

To capture this process, researchers recorded 10 collaborative sessions, each involving four participants.

Multiple data streams were collected simultaneously:

Data Type	Purpose
Audio transcripts	Capture verbal instructions
Gesture annotations	Record pointing and spatial cues
Action logs	Track block placement changes
Temporal alignment	Synchronize all modalities

The annotation pipeline reconstructed each interaction at a surprisingly granular level.

Example proposition extracted from speech:

nextto(BlueShortBlock, GreenShortBlock)

Example action representation:

put(red_block_1, on(base))

Gesture annotations were encoded using Gesture Abstract Meaning Representation (GAMR) structures that capture semantic intent.

Once aligned, these modalities created a timeline of belief updates across the team.

Modeling Shared Beliefs

To evaluate whether AI systems could track common ground, researchers compared two fundamentally different approaches.

1. Large Language Models

Several models were tested, including:

Llama‑3.2
Qwen3
GPT‑5‑mini / GPT‑5

These models received multimodal annotations and attempted to infer:

the evolving structure
the group’s shared belief state

2. Axiomatic Belief Tracking

The second approach used Dynamic Epistemic Logic (DEL)—a formal reasoning system describing how beliefs change after events.

Three core axioms governed belief updates:

Axiom	Meaning
Seeing is believing	Perception updates belief
Acting is believing	Actions reveal intention
Saying is believing	Speech signals belief

Whenever two or more participants accepted a proposition, it entered the system’s Common Ground (CG) representation.

For example:

CG{Director1, Director2, Builder}: on(gs3, rs2)

The resulting model tracks shared beliefs as they emerge and dissolve during interaction.

Findings — Fluent Models, Fragile Coordination

The results reveal a fascinating gap between language competence and collaborative reasoning.

1. LLMs Struggle With Multi‑Turn Belief Tracking

When predicting the structure from actions alone, the strongest model performed best—but still with modest accuracy.

Model	Avg Accuracy (DSC)
GPT‑5	Highest
Qwen3	Moderate
Llama‑3	Lowest

However, accuracy dropped significantly when evaluating the entire dialogue history, suggesting that long‑range reasoning remains difficult.

2. More Modalities Sometimes Hurt

Adding speech and gesture information produced an unexpected effect:

Turn‑level accuracy decreased
Dialogue‑level accuracy improved

This implies that multimodal signals provide valuable context but also introduce noise when interpreted incrementally.

3. Formal Logic Can Rival Large Models

The deterministic epistemic logic pipeline sometimes matched or exceeded the performance of large models in predicting the final structure.

This is notable because the logic system had no statistical learning—only explicit reasoning rules.

4. LLMs and Logic Disagree About Beliefs

Perhaps the most interesting finding: LLM predictions of common ground rarely matched the logic‑derived belief states.

Even when both systems correctly predicted the final structure, they often arrived there through completely different inferred belief dynamics.

In short: the models solved the puzzle without understanding how the team understood it.

The Strange Case of the Failed Team

One group in the dataset failed the puzzle entirely.

Surprisingly, AI systems performed better at predicting common ground in this scenario.

Why?

Because there was almost none.

Participants misunderstood instructions and never converged on shared beliefs. The models correctly inferred that the group lacked meaningful agreement.

Detecting absence of common ground turned out to be easier than reconstructing its evolution.

Implications — The Real Challenge for AI Agents

The study highlights a critical limitation in current AI systems.

Large language models excel at:

generating fluent text
answering questions
summarizing documents

But real-world collaboration requires something more demanding:

Belief tracking
Perspective modeling
Multimodal reasoning
Group dynamics understanding

For AI copilots or autonomous agents operating in workplaces, robotics, or multiplayer environments, these capabilities are essential.

Without them, AI may appear intelligent while quietly misunderstanding the state of collaboration.

That gap—between conversation and coordination—may become one of the defining challenges of the next generation of AI systems.

Conclusion

The Distributed Partial Information Puzzle offers a rare window into how humans actually build shared understanding.

The benchmark exposes a simple truth: collaboration is not just communication.

It is the continuous construction of shared beliefs across people, signals, and actions.

Until AI systems learn to track those evolving belief states, they will remain impressive conversationalists—but unreliable teammates.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The Missing Piece in AI Collaboration#

The DPIP Task — Engineering Epistemic Asymmetry#

Building the Dataset#

nextto(BlueShortBlock, GreenShortBlock)#

put(red_block_1, on(base))#

Modeling Shared Beliefs#

1. Large Language Models#

2. Axiomatic Belief Tracking#

CG{Director1, Director2, Builder}: on(gs3, rs2)#

Findings — Fluent Models, Fragile Coordination#

1. LLMs Struggle With Multi‑Turn Belief Tracking#

2. More Modalities Sometimes Hurt#

3. Formal Logic Can Rival Large Models#

4. LLMs and Logic Disagree About Beliefs#

The Strange Case of the Failed Team#

Implications — The Real Challenge for AI Agents#

Conclusion#