Four people sit around a table. Three of them can see only one side of a Lego structure. The fourth person, the builder, can touch the blocks but cannot see the target design. Nobody has the whole picture. Everyone must talk, gesture, infer, correct, and occasionally pretend that “left” is a stable concept in a room full of humans.

This is not a children’s game. Well, it is Lego, so technically it is. But in the paper Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry, the Lego table becomes a small laboratory for a large AI problem: whether modern models can track what a group actually knows together, rather than merely what has been said nearby.1

That distinction matters. In enterprise collaboration, a project rarely fails because nobody spoke. It fails because different people leave the same meeting believing different things. The engineer thinks the dependency was approved. The product manager thinks it was only discussed. The designer thinks the constraint was rejected. The AI meeting assistant, ever cheerful, summarizes the whole thing as “the team aligned on next steps.” Very helpful. In the same way a fog machine is helpful during surgery.

The paper introduces the Distributed Partial Information Puzzle, or DPIP, to make this problem observable. Its contribution is not just another benchmark where models score depressingly low and everyone nods with practiced sadness. The useful point is sharper: common ground is not the same as conversation, not the same as task progress, and not the same as multimodal input. It is a changing belief state distributed across people.

That is exactly the kind of thing AI agents will need to understand if they are expected to coordinate work rather than decorate workflows with plausible summaries.

The Lego table is a miniature enterprise team

The DPIP task is deliberately simple at the surface. Four participants work together to build a Lego structure. Three are “directors.” Each director receives a different two-dimensional side view of the target structure. The builder receives no blueprint but is the only person allowed to manipulate the blocks.

No director can solve the task alone. The builder cannot solve it by inspection. The group succeeds only if the final physical structure is consistent with all three partial views.

This is epistemic asymmetry in toy form. Each participant holds private information that is true but incomplete. The task forces them to externalize that information through speech, gesture, and correction. They must also infer what others know, what others believe, and whether a visible block placement reflects real agreement or merely a temporary misunderstanding with bricks attached.

That is why the benchmark is more interesting than a standard instruction-following test. A normal benchmark might ask: “Can the model follow the instruction?” DPIP asks something closer to: “Can the model infer which facts have become mutually accepted by a group whose members do not share the same view of the world?”

In business language, the three directors could be legal, engineering, and sales. The builder could be operations. Everyone is truthful. Everyone is incomplete. The final plan must satisfy all views. This is collaboration without the comforting fiction that one person has a master dashboard.

Common ground is built from speech, gesture, and action

The authors recorded 10 four-person DPIP sessions. The interactions were audiovisual, captured with three Microsoft Azure Kinect cameras and a tabletop microphone. The fully annotated dataset contains 3,112 utterances, 479 actions, 278 gestures, 307 propositions, and 13,027 tokens across the 10 groups. One group, Group 7, was an extreme outlier: it took 3,223.60 seconds and failed the task, so the authors treat it separately in the main evaluation.

The annotation pipeline is one of the paper’s main contributions. The researchers did not simply transcribe conversations and call it multimodal. They aligned three kinds of evidence:

Modality What it captures Why it matters for common ground
Speech Verbal propositions such as spatial relations between blocks People explicitly state what they believe or instruct
Action Builder block operations: put, remove, move Physical changes reveal interpretation and intention
Gesture Pointing, iconic movement, agreement, disagreement, spatial reference Some task-relevant meaning is communicated without words

Speech annotations achieved high agreement, with an average Cohen’s Kappa of 0.971. Structure annotations were also strong on average, at 0.893. Gesture was harder: agreement over the union of annotated gestures averaged 0.495, while agreement over the intersection averaged 0.845. That split is important. Annotators largely agreed about gesture meaning when they agreed that a gesture occurred, but identifying all gesture occurrences was much noisier. Human bodies, regrettably, do not emit clean JSON.

The paper then aligns these modalities temporally. If someone says “put the green on the red” while pointing and the builder acts shortly afterward, the pipeline can associate the verbal instruction, gesture reference, and physical placement. Color-shape terms can be grounded to concrete block identifiers. Gesture content can be linked to objects and layers. The result is a symbolic timeline of propositions and belief-relevant events.

This matters because the models are not being fed raw video. They are being tested on structured annotations derived from speech, gesture, and action. That makes the benchmark less about visual perception and more about belief tracking under multimodal evidence. It also makes the results more uncomfortable: even after humans have converted messy interaction into cleaner symbolic inputs, the models still struggle.

The benchmark tests four different questions, not one generic “AI reasoning” score

The evaluation is easy to misread if treated as a single scoreboard. The paper actually runs four experimental conditions, each asking a different question.

Experiment Input Output being evaluated Likely purpose
Action Structure Builder actions only Current structure state Baseline for task-progress tracking from physical actions
Aligned Structure Speech, gesture, and action annotations Current structure state Test whether additional modalities help reconstruct the structure
CGC Structure Axiomatic common-ground calculation Structure implied by inferred shared beliefs Test whether formal belief rules can predict physical outcome
Aligned CGC Aligned annotations to LLMs Common ground compared with axiomatic CG Intrinsic comparison between LLM belief inference and logic-based belief inference

The metric is the Dice Similarity Coefficient, or DSC. It measures overlap between predicted and ground-truth sets. The paper reports two versions: an average turn-level DSC, which reflects local agreement after each belief-update episode, and a global dialogue-level DSC, which reflects agreement with the final state.

That distinction becomes central. Models can look better locally and worse globally, or worse locally and better globally. The paper is not simply saying “more context is good” or “more context is bad.” It is saying that context changes the type of error.

More modalities help globally, but make turn-level reasoning noisier

In the Action Structure setting, GPT-5-mini/GPT-5 performs best among the LLMs on average turn-level structure prediction, with a mean Average DSC of 0.382. Qwen3-4B follows at 0.276, and Llama-3.2-3B at 0.219. But the global scores are much lower: GPT-5-mini/GPT-5 reaches only 0.159, Qwen3-4B 0.107, and Llama 0.049.

The interpretation is not subtle. Given action-only input, models can partially follow local board changes, but the full dialogue-level structure remains difficult. The block world is small, yet state accumulation across turns is already fragile. Scale that up to a three-month ERP migration and enjoy the confetti.

The Aligned Structure condition adds speech and gesture annotations. Here the result changes in an interesting way. Turn-level performance drops for all three LLMs. Llama falls to 0.118, Qwen to 0.139, and GPT-5-mini/GPT-5 collapses to 0.029. Yet global performance rises for Llama and Qwen: Llama reaches 0.471 and Qwen reaches 0.668. GPT-5-mini/GPT-5 remains weak globally at 0.250.

The paper’s interpretation is sensible: additional modalities may help when the whole dialogue is considered, but they add noise at the turn level. In other words, speech and gesture provide useful context for reconstructing the final structure, but they complicate incremental updates. This is exactly the trap in multimodal AI marketing. More channels do not automatically mean more understanding. Sometimes they mean the system now has three ways to be confused.

The strongest example is GPT-5-mini/GPT-5. In the action-only condition, it is the best model by turn-level DSC. With aligned multimodal annotations, it becomes the worst, including zero global DSC for Groups 4 and 5. The paper does not prove why this happens, and we should not invent a heroic explanation. But the operational lesson is clear: stronger general models may still be brittle when the input format requires structured state tracking across noisy multimodal evidence.

The logic pipeline is not smarter, but it is more disciplined

The paper also implements an axiomatic common-ground calculation inspired by Dynamic Epistemic Logic. The simplified rules are intuitive:

Axiom Meaning
Seeing is believing Perceptual context can update belief
Acting is believing Embodied action reveals intention
Saying is believing Language communicates epistemic state

The system tracks whether participants accept, doubt, or negate propositions. When at least two participants concurrently accept a relation, that relation can enter common ground for that set of participants. Doubt can remove a participant from the common-ground set. Negation can delete the shared belief.

This is not a magical symbolic oracle. It depends on annotated propositions and simplified assumptions. But it has one advantage that matters in enterprise systems: it is explicit about the belief state it is updating.

In the CGC Structure experiment, the axiomatic pipeline achieves a mean Average DSC of 0.062 and a Global DSC of 0.369. Turn-level performance is low, but the global score is higher than GPT-5-mini/GPT-5 in the multimodal Aligned Structure condition. The difference is reported as a non-significant medium effect, so this should not be exaggerated into “logic beats LLMs.” That would be a nice headline and a lazy reading.

A better interpretation is this: explicit belief rules can remain competitive in a task where generic language models struggle to maintain structured shared state. The logic pipeline is not more flexible. It is less forgetful about what it is supposed to represent.

For business AI, that matters. A workflow agent does not merely need fluent reasoning. It needs inspectable state. Who accepted the requirement? Who objected? Which decision was observed, which was inferred, and which was merely proposed? Without such distinctions, “AI coordination” becomes a theater production in which the model confidently narrates consensus that nobody actually reached.

The most important result is disagreement about belief, not low structure scores

The Aligned CGC experiment compares LLM-predicted common ground with the axiomatic common-ground calculation. The overlap is low. Llama’s mean Average DSC is 0.140 and Global DSC is 0.221. Qwen’s are 0.115 and 0.138. GPT-5-mini/GPT-5 does better, with 0.288 and 0.363, but still far from reliable.

The paper notes that many cases have DSC of 0, meaning the model and axiomatic system predict completely disjoint belief sets. Even Group 9, which appears relatively easy for both LLMs and the CGC pipeline in structure prediction, shows limited overlap between model-inferred and axiomatically calculated common ground. The model and the logic system may both approach the same visible structure while disagreeing about how the group got there.

This is the business-critical point. A model can be approximately right about the final artifact and wrong about the collaboration process. That is not a harmless error if the AI system is expected to manage handoffs, assign accountability, detect unresolved disagreement, or recommend next actions.

A project assistant that knows the final slide deck exists but does not know who accepted the pricing assumption is not a collaboration agent. It is a librarian with excellent posture.

The failed group reveals what models can detect

Group 7 failed the task and was treated separately. At first glance, one might expect it to be the hardest case. The group did not converge successfully, so common-ground tracking should be more difficult.

The results are stranger. For Aligned CGC, Qwen and GPT-5-mini/GPT-5 achieve perfect DSC: 1.000 for both Average and Global. Llama also improves, with 0.334 Average and 0.500 Global.

The explanation is not that failed collaboration is easy in general. The paper’s closer reading is that Group 7 had very little extractable common ground. The participants showed widespread confusion about the task, goals, and available information. When LLMs predicted small or empty belief sets, they matched the axiomatically calculated lack of shared belief.

This is an important boundary. Models may be better at detecting the absence of common ground than reconstructing its substantive content. In practical terms, an AI system might be able to flag “this meeting did not converge” before it can accurately say “these are the five propositions the team now mutually accepts.”

That is still useful. In fact, it may be one of the more realistic near-term applications. Detecting collaboration breakdown is easier than modeling collaboration success in full detail. The smoke alarm does not need to understand architecture. It just needs to know that something is burning.

What this means for AI copilots and enterprise agents

The paper directly shows that DPIP is challenging for current LLMs and that multimodal symbolic annotations do not guarantee reliable common-ground inference. It also shows that a simple axiomatic belief pipeline can be competitive in some structure-prediction settings and can expose systematic disagreement between model-inferred and rule-derived belief states.

Cognaptus’ business inference is broader but should stay disciplined. DPIP is not a direct benchmark for Slack, Jira, Zoom, factory robotics, or executive meetings. It is a controlled proxy for a recurring coordination problem: teams act under partial information, and AI systems must infer not only what happened, but what became mutually accepted.

The practical implication is that enterprise AI agents should separate at least four layers of state:

Layer Example Why separation matters
Observed event “The builder placed the green block on the red block.” The system records what happened
Communicated claim “Director 2 said the green block belongs on red.” The system records who asserted what
Accepted belief “Director 1 and the builder accepted that relation.” The system tracks local agreement
Common ground “The group mutually treats the relation as settled.” The system can safely use it for downstream coordination

Most current AI products blur these layers. A meeting summary compresses observed statements, unresolved claims, and accepted decisions into one polished paragraph. A task agent may convert a suggestion into an action item. A workflow bot may treat silence as consent. This is not intelligence. It is administrative overconfidence.

For AI systems in project management, robotics, design review, compliance workflows, and multi-agent operations, the relevant design question is not “Can the model summarize the conversation?” It is “Can the system maintain an auditable belief ledger?”

That ledger does not have to be purely symbolic. It may combine LLM extraction, structured event logs, human confirmation, and formal update rules. The key is architectural: belief state should be represented as a first-class object, not left as a vibe inside a context window.

A useful product pattern: detect non-grounding before automating action

The most immediately useful pattern from DPIP is not full autonomous coordination. That is still ambitious. The safer pattern is non-grounding detection.

An enterprise assistant could flag situations such as:

Signal Possible interpretation Safer system behavior
Multiple participants assert incompatible constraints The team has not reconciled private knowledge Ask for explicit resolution
A task is assigned after only one stakeholder speaks Acceptance may not be shared Request confirmation from missing parties
Action occurs without verbal agreement The actor may be experimenting, not executing a decision Mark as provisional
Repeated corrections or reversals appear Common ground is unstable Delay automation and summarize uncertainty
Final artifact exists but decision trail is unclear Outcome and agreement may diverge Produce a belief-state audit, not a simple summary

This is where the paper becomes commercially relevant. The ROI is not “AI understands meetings like a human.” That claim should be placed carefully in the same drawer as “blockchain will fix dentistry.” The realistic value is cheaper diagnosis of coordination risk.

A system that can identify when common ground is missing can prevent premature automation. It can reduce rework, catch unresolved dependencies, and make meetings less dependent on the heroic memory of whoever took notes while also trying to participate.

Boundaries: what the paper does not prove

The evidence is valuable, but it has clear limits.

First, the fully annotated evaluation covers 10 groups, with 9 used in the main evaluation and one outlier analyzed separately. The authors collected more data overall, including 33 videos totaling nearly 20 hours, but the reported fully annotated subset is smaller. The benchmark is therefore best read as a carefully designed early resource, not as a final population-level estimate of model capability.

Second, the models are evaluated on structured annotations, not raw multimodal streams. This isolates belief tracking from low-level perception, which is methodologically useful. But it also means the real-world problem is harder. In actual enterprise settings, the system must first extract the relevant signals from messy audio, video, documents, chat, workflow tools, and human hesitation disguised as “sounds good.”

Third, gesture annotation remains difficult. The paper’s union/intersection agreement split shows that meaning can be reliable once gestures are identified, but identifying gesture occurrences is still noisy. That matters because real collaboration often depends on exactly these small embodied cues: a point, a nod, a hesitation, a hand hovering over the wrong block like destiny losing confidence.

Fourth, the axiomatic common-ground pipeline is simplified. Its rules are interpretable, but they depend on assumptions about perception, action, and speech. “Saying is believing” is useful as a modeling rule; it is not a universal law of human organizations. Anyone who has attended a budget meeting knows this.

Finally, DSC overlap is a useful metric, but it does not capture every dimension of practical usefulness. A system could miss some propositions but still identify the most operationally important unresolved issue. Conversely, it could score well on many small relations while missing the one constraint that breaks deployment.

The lesson is not “LLMs cannot collaborate.” It is “collaboration needs state.”

DPIP is useful because it makes a quiet failure visible. Modern AI systems often sound as if they understand collaboration because they can describe it. They can generate fluent meeting notes, infer plausible intentions, and produce confident next steps. But collaboration is not a transcript with better formatting. It is a process by which private information becomes mutually recognized enough to support action.

The paper’s most important warning is that multimodality alone does not solve this. Speech, gesture, and action provide richer evidence, but they also introduce noise, timing ambiguity, and competing interpretations. A model that cannot maintain structured belief state may become less reliable as the input becomes more realistic.

The constructive path is not to abandon LLMs or worship symbolic logic. That binary is tired, and frankly it needs a nap. The better architecture is hybrid: use language models to extract, translate, and reason over messy inputs, but maintain common ground through explicit state representations, update rules, and confirmation mechanisms.

A good AI teammate should know the difference between:

  • what someone said;
  • what someone did;
  • what someone appeared to accept;
  • what the group mutually treats as settled;
  • what remains private, disputed, or unresolved.

Until AI systems can maintain those distinctions, they will remain impressive conversationalists and unreliable coordinators. They will be able to talk about the Lego structure beautifully while still putting the green block in the wrong place.

And in business, the green block is usually a budget, a compliance assumption, or a production deadline. Adorable, until it falls on someone.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha, James Pustejovsky, and Nikhil Krishnaswamy, “Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry,” arXiv:2603.05450, 2026. https://arxiv.org/abs/2603.05450 ↩︎