Opening — Why this matters now

Large Language Models are very good at knowing. They are considerably worse at helping.

As AI systems move from chat interfaces into robots, copilots, and assistive agents, collaboration becomes unavoidable. And collaboration exposes a deeply human cognitive failure that LLMs inherit wholesale: the curse of knowledge. When one agent knows more than another, it tends to communicate as if that knowledge were shared.

In embodied environments—where perception is partial, viewpoints differ, and walls exist—this bias is no longer academic. It is the difference between success and collision.

The paper behind this article puts a number on that failure. And the number is uncomfortably large.

Background — From symbol grounding to social grounding

LLMs excel at symbolic reasoning but operate, as critics like to say, as brains in jars. They describe the world fluently without inhabiting it. Embodied AI research has tried to fix this by dropping models into simulators like AI2-THOR, Habitat, and VirtualHome.

But most embodied benchmarks quietly assume something convenient: shared perception. Agents see roughly the same world, so instructions map cleanly onto actions.

Reality does not work like that.

Humans, robots, copilots, and assistive systems operate with asymmetric information. Someone sees the obstacle; someone else does not. Someone knows where the target is; someone else faces a wall. This is not a perception problem. It is a communication problem.

The paper reframes symbol grounding as social grounding—the ability to reason about what your partner can and cannot perceive.

Analysis — The Leader–Follower experiment

The authors construct a deliberately asymmetric setup inside AI2-THOR:

  • Leader: full global vision, complete object map, acts as planner.
  • Follower: severely limited vision (2m range, 90° FOV), executes actions.

Both are powered by the same LLM, split into two personas with strictly separated knowledge. The task is simple: object-goal navigation (“Find the Apple”).

Two communication protocols are tested:

Protocol Description Failure Mode
Push Leader issues instructions unilaterally Egocentric bias
Pull Follower verifies and queries ambiguities Higher cognitive load

The key difference is not intelligence, planning depth, or reasoning chains. It is whether the less-informed agent is allowed—encouraged, even—to say: this doesn’t make sense.

Findings — Quantifying the Success Gap

The results expose what the paper calls the Success Gap.

Condition Success Rate
Solo agent, full vision 16%
Solo agent, limited vision 11%
Leader alone (can see target) 35%
Leader–Follower team 17%

Read that again.

In 35% of episodes, the Leader knows exactly where the target is. Yet the team succeeds only 17% of the time. Nearly half of all feasible plans fail purely during communication.

Why?

Because the Leader speaks from its own frame of reference:

“Move left.”

Left relative to whom? The Leader’s camera? The Follower’s orientation? The wall directly in front of the Follower has opinions.

The diagnostic metric is brutal and clean:

Metric Successful Failed
Leader instructions ~25 ~26
Follower queries 2.0 0.99

More talking does nothing. More questioning doubles success.

Implications — Why obedience is a bug

The paper’s most uncomfortable conclusion is this:

Failure is not caused by disobedience, but by obedience without grounding.

The Follower almost always complies. It just complies with instructions that are meaningless in its local reality. Walls are hit not because agents are stupid, but because they are polite.

This has immediate consequences for real-world systems:

  • Assistive robots following human instructions
  • Industrial cobots operating under occlusion
  • AI copilots executing partial-context commands

Instruction-following is not safety. Epistemic friction is safety.

The authors argue for what amounts to institutionalized doubt: agents must be rewarded for hesitation, clarification, and even refusal when instructions cannot be grounded.

Conclusion — Collaboration is negotiated, not broadcast

The study delivers three clean insights:

  1. Knowing is not enough — half of viable plans die in translation.
  2. Push-based instruction scales failure — verbosity does not fix grounding.
  3. Pull-based querying restores performance — uncertainty reduction is the mechanism, not intelligence.

True embodied intelligence does not emerge when agents agree faster. It emerges when they disagree productively.

Or, put differently: the safest robot is not the one that listens best—but the one that asks the most inconvenient questions.

Cognaptus: Automate the Present, Incubate the Future.