Directions are easy when both people see the same room.
“Move left.” “Go toward the table.” “The apple is beside the sofa.” These are perfectly reasonable instructions if speaker and listener share the same visual world. They become less reasonable when one of them is staring at a wall, cannot see the table, and has no reason to believe the sofa exists. At that point, the problem is no longer navigation. It is epistemology, with furniture.
That is the useful lesson in Emergence: Overcoming Privileged Information Bias in Asymmetric Embodied Agents via Active Querying, a paper by Shaun Baek, Sam Liu, and Joseph Ukpong.1 The paper studies a simple but sharp failure mode in embodied LLM collaboration: one agent knows too much, the other sees too little, and the team fails because the knowledgeable agent explains the world from its own perspective.
The obvious interpretation would be: make the Leader smarter. Give better instructions. Push more detail. Add another paragraph of guidance, because apparently that is what our industry does when systems fail.
The paper’s more interesting result points in the opposite direction. The key mechanism is not more Leader speech. It is more Follower doubt.
The failure mechanism is not ignorance; it is ungrounded obedience
The paper sets up a Leader-Follower pair inside AI2-THOR, using ManipulaTHOR environments. Both agents share the same high-level goal, such as finding an object in a room. But they do not share the same sensory state.
The Leader has privileged access to the global scene: object positions, layout information, and target location. The Follower has a restricted local view, limited by an egocentric field of view and a 2-meter visibility handicap. The Leader acts as global planner. The Follower acts as local executor.
That asymmetry creates a small but poisonous gap:
Leader sees target and layout
↓
Leader gives instruction
↓
Follower checks local view
↓
If grounded: execute
If ungrounded: ask
If ungrounded but not questioned: collide, wander, or fail
This is why the paper’s chosen problem is better than another generic “LLMs are bad at robotics” demonstration. The issue is not that the model cannot talk about rooms. It can. The issue is that its talk can be correct in one agent’s world and useless in another’s.
A Leader instruction like “move left” can be semantically clear and operationally wrong. “Left” relative to whom? From which camera orientation? With which obstacle in front? The Leader’s instruction may be grounded in the Leader’s global state, while the Follower receives it as a command in a different local frame.
That is the curse of knowledge, translated into robot motion. The system does not fail because nobody knows where the apple is. It fails because one agent knows where the apple is and forgets that the other agent does not.
What the experiment actually tests
The paper evaluates three conditions over a fixed benchmark of 100 object-navigation tasks sampled from 1,320 generated candidates. The task is deliberately simple: navigate to a target object from a random spawn point. The design filters out impossible or trivial episodes by checking reachability and excluding cases where the goal is too close.
The point of the setup is not to beat specialized navigation systems. The paper is explicit that these are zero-shot LLM-driven embodied agents, not agents trained through reinforcement learning or imitation learning over many environment interactions. The low absolute success rates are therefore part of the interpretation, not an embarrassing footnote to hide under the carpet.
The three evaluated conditions are:
| Condition | What it is meant to test | Reported success rate | Interpretation |
|---|---|---|---|
| Baseline Agent, solo full perception | Zero-shot performance with normal sensory access | 16.0% | The LLM agent has semantic competence but weak procedural navigation |
| Handicapped Agent, solo limited perception | The cost of restricted perception | 11.0% | Losing distant landmarks produces a measurable sensory tax |
| Two-Agent Dyad, Leader view | Whether the privileged Leader can locate the target | 35.0% | The Leader often knows enough to form a feasible plan |
| Two-Agent Dyad, Follower view | Whether the team can execute that plan through communication | 17.0% | Much of the Leader’s knowledge fails during transmission |
The important comparison is not only solo handicapped versus assisted Follower, although that matters. The assisted Follower reaches 17.0% success, improving over the solo handicapped agent’s 11.0%. So guidance helps.
The sharper comparison is Leader view versus Follower view inside the dyad: 35.0% versus 17.0%. The Leader can perceive or navigate to the target in 35% of episodes, but the collaborative team only succeeds in 17%. That 18-point gap is the paper’s central diagnostic object.
In plain language: in many cases, the team has enough information somewhere in the system, but the information is not converted into instructions the limited agent can verify and use.
This distinction matters. If the Leader never found the target, the problem would be planning. If the Follower simply ignored correct commands, the problem would be compliance. Instead, the paper isolates a more operationally relevant failure: the plan exists, the instruction is given, the Follower obeys, and the team still fails.
Very modern. Very automated. Very familiar.
More instructions did not solve the problem
The paper’s strongest mechanism-level evidence comes from the communication analysis. This is main diagnostic evidence, not just decorative logging.
Successful episodes and failed episodes had roughly the same amount of Leader “push” communication:
| Communication metric | Successful episodes | Failed episodes | What it suggests |
|---|---|---|---|
| Average Leader instructions | 24.41 | 25.99 | More Leader commands did not distinguish success from failure |
| Average Follower queries | 2.00 | 0.99 | Successful episodes involved about twice as much active querying |
This is the result that should make product teams uncomfortable. The Leader was not too quiet in failures. If anything, failed episodes had slightly more Leader instructions. The differentiator was the Follower’s willingness to ask.
The paper frames this as a push-versus-pull problem. In the push protocol, the Leader broadcasts instructions from its privileged state, and the Follower tries to execute. In the pull protocol, the Follower checks whether the instruction is grounded in its local observation. If the instruction references something unobservable or ambiguous, the Follower asks for clarification.
That small change has a large conceptual consequence. The Follower is no longer merely an actuator. It becomes a local verifier.
This is the part business readers should not miss. In many real AI deployments, the “follower” role is treated as a low-status execution layer: a robot arm, a warehouse picker, a field-service copilot, a junior operations agent, a workflow automation step. It receives a plan and is expected to carry it out. But in asymmetric environments, execution without verification is not discipline. It is a collision policy.
The paper’s mechanism suggests a different architecture: the downstream agent must be allowed, and sometimes required, to interrupt.
The qualitative failures explain why the numbers are believable
The paper includes qualitative error analysis and appendix interaction logs. Their purpose is not to prove the result statistically; the tables already do that modestly. The logs explain the anatomy of the failure.
In one representative failure, the Leader sees the target and issues a command like “move left” or “move forward.” The Follower, however, sees a wall or cabinet. Instead of challenging the instruction, it executes. The result is collision or repeated invalid movement.
The failure is not dramatic. No science-fiction rebellion. No evil robot. No philosophical uprising against human authority.
Just bad reference frames.
This is exactly why the study’s mechanism-first reading is more useful than a benchmark-first reading. If we only summarize the success rates, the paper sounds like another small embodied-AI benchmark with low performance. If we inspect the failure mechanism, it becomes a design warning: a system can contain correct information and still fail because the handoff format assumes shared perception.
The successful logs show the opposite pattern. The Follower says, in effect: I do not see the thing you are referencing. Which way is it relative to what I can see? That query forces the Leader to re-ground the instruction using a local landmark or relative orientation the Follower can verify.
So the real unit of collaboration is not “instruction.” It is “instruction plus belief-state check.”
The ablation tests horizon sensitivity, not a second thesis
The paper also includes a 60-step ablation study. This is best read as a robustness and sensitivity test: perhaps the original 30-step episode limit made agents fail simply because they ran out of time. If that were true, the communication story would be weaker.
The authors rerun 91 failed tasks with the horizon extended from 30 to 60 steps. Reported success improves:
| Agent | 30-step success rate | 60-step success rate | Relative improvement |
|---|---|---|---|
| Leader | 28.6% | 34.1% | +19.2% |
| Handicapped | 8.8% | 14.3% | +62.5% |
At first glance, that looks like a simple answer: give the agent more time. But the paper’s granular analysis complicates that interpretation. Among seven newly successful Handicapped episodes, only one actually required more than 30 steps. The other six succeeded within 30 steps during the rerun.
That means the extended horizon is not merely revealing “longer but valid” trajectories. It is also exposing stochastic instability in zero-shot LLM navigation. The same task can fail once and succeed later, not because the agent discovered a systematically better long-horizon strategy, but because the run unfolded differently.
This ablation matters because it limits how aggressively we should interpret the main results. Some failures are not purely communication failures. Some are instability, weak spatial memory, or the brittleness of post-hoc embodiment, where a language model receives flattened text descriptions of a 3D world and tries to act as if that is enough.
Still, the ablation does not erase the push-pull result. It mainly tells us not to pretend the communication mechanism is the only bottleneck.
The business lesson is verification rights, not nicer instructions
The practical value of the paper is not that Gemini 2.5 Flash can navigate AI2-THOR rooms. That would be an odd hill to monetize.
The business value is the design principle: in human-AI and robot-robot collaboration, the less-informed executor should not be optimized for obedience alone. It should be optimized for grounded compliance.
That phrase is less catchy than “agentic AI,” but more useful.
| What the paper directly shows | Cognaptus business inference | Boundary |
|---|---|---|
| A sensor-limited Follower performs better with Leader guidance than alone | Remote guidance can compensate for limited local perception | The improvement is modest and simulation-only |
| Leader knowledge does not fully transfer to Follower execution | Supervisory AI needs translation into the executor’s local context | The study does not validate this in real robots |
| Successful episodes have about twice as many Follower clarification queries | Querying should be treated as productive work, not hesitation | Query thresholds require domain-specific calibration |
| More Leader instructions do not distinguish success from failure | Adding more directives may increase noise without improving grounding | The result comes from a small 100-task benchmark |
| Horizon extension improves results but reveals stochasticity | Longer recovery windows may help, but do not replace better grounding | More time is not the same as better spatial reasoning |
For assistive robotics, the implication is obvious. A home robot helping an elderly user should not blindly execute “bring me the cup from the table” if the table is not visible, if the cup may refer to multiple objects, or if the path is blocked. It should ask a targeted question.
For warehouse automation, the same principle applies to pick-and-place tasks. “Move to rack B” is not enough if the robot’s local map has occlusion, aisle changes, temporary obstacles, or localization drift. The executor should not only receive the task. It should verify whether the task description matches its local state.
For field-service copilots, the Follower may be a human technician wearing AR glasses or using a mobile assistant. The remote expert, or AI supervisor, may know the schematic but not the technician’s exact viewpoint. If the copilot tells the technician “remove the panel on the left,” and the technician sees three panels, the correct behavior is not confident instruction repetition. It is disambiguation.
For workflow automation, the same pattern appears without robots. A central planning agent may know the business objective, while a downstream tool agent sees only a partial API response, a restricted database view, or a subset of user permissions. If the downstream agent cannot verify that the instruction is grounded in its accessible state, it should not improvise silently.
That is the hidden operational shift: active querying is not a UX inconvenience. It is a control layer.
How to design for productive doubt
The paper’s language around “Epistemic Anxiety” is useful because it names a design property that many AI systems currently lack. An agent should know when its instruction is under-grounded. More precisely, it should have a policy for detecting when an instruction references information outside its observable state.
A practical implementation does not need to be mystical. It can begin with three checks.
First, reference checks. Does the instruction mention an object, location, file, customer, API field, screen element, or physical landmark that the executor can currently observe? If not, ask.
Second, frame checks. Does the instruction depend on a direction, coordinate system, identity label, or relative relationship that may differ between Leader and Follower? If yes, translate or verify.
Third, risk checks. What is the cost of executing the instruction if the grounding is wrong? If the cost is low, the agent may proceed and recover. If the cost is high, the agent should pause before acting. A warehouse robot and a spreadsheet formatting bot should not share the same doubt threshold. One breaks inventory. The other breaks fonts. Admittedly, both can ruin a day, but not equally.
The design target is not to make agents annoyingly hesitant. The target is to make them selectively skeptical.
A simple policy could look like this:
| Situation | Bad behavior | Better behavior |
|---|---|---|
| Instruction references unseen object | Execute anyway | Ask for a visible landmark or alternative description |
| Direction depends on viewpoint | Treat “left” as universal | Confirm reference frame or convert to local orientation |
| Local observation contradicts instruction | Assume supervisor is right | Report contradiction before moving |
| Task is safety-critical | Minimize interruptions | Require explicit grounding before action |
| Repeated collision or invalid action | Retry same command | Escalate uncertainty and request re-planning |
This is not merely a robotics issue. It is an organizational design issue for AI systems. Many agentic workflows still reward completion, speed, and apparent confidence. The paper suggests another reward signal should be added: uncertainty reduction before irreversible execution.
Where the paper should not be overused
The study is useful, but its boundaries matter.
The benchmark contains 100 static navigation tasks in simulation. It does not establish real-world robot safety. AI2-THOR is valuable for controlled experiments, but real environments introduce moving humans, sensor noise, lighting variation, object deformation, hardware latency, and many other ways for elegant diagrams to meet the floor.
The architecture uses a single Gemini 2.5 Flash model in a dual-persona setup. That is experimentally convenient, but it also means the Leader-Follower separation is enforced through prompting and state construction rather than through physically separate agents with independently learned policies. This is a good testbed for information asymmetry, not a final architecture for deployment.
The absolute success rates are low. The Leader’s 35.0% success rate should not be treated as a ceiling for embodied navigation. The paper itself notes the lack of a human feasibility baseline and the difficulty of interpreting all remaining failures as reasoning failures. Some tasks may be hard within the step limit. Some failures may reflect weak spatial memory rather than communication.
The 60-step ablation also warns against a too-clean story. Extending the horizon improves outcomes, but not always because the agent simply needed more steps. Some of the improvement appears tied to run-to-run stochastic robustness. That means future systems need better spatial memory and state synchronization, not just longer patience.
So the business interpretation should be disciplined. The paper does not prove that active querying alone solves embodied collaboration. It shows that, under controlled asymmetric perception, active querying is strongly associated with successful re-grounding, while more Leader instruction is not.
That is already enough to matter.
The robot should ask what it can verify
The paper’s title emphasizes privileged information bias. The simpler business phrase is: do not tell the robot only what you know. Tell it in a way that lets it check what it knows.
This is the mistake many AI systems will make as they move from chat windows into operational settings. A central model may have broad context, a supervisor may have a global dashboard, and a planning agent may generate a coherent route. But the executor lives in a smaller world: a camera frame, a local map, a database permission scope, a tool response, a user’s messy workspace.
The gap between those worlds is where failure hides.
Better agents will not be the ones that obey most beautifully. They will be the ones that can say: I cannot ground that instruction from here. Give me a landmark. Clarify the frame. Confirm the object. Re-plan from my view.
That may sound less heroic than autonomous intelligence. It is also much closer to how reliable work gets done.
Don’t tell the robot what you know. Teach it to ask what it can verify.
Cognaptus: Automate the Present, Incubate the Future.
-
Shaun Baek, Sam Liu, and Joseph Ukpong, “Emergence: Overcoming Privileged Information Bias in Asymmetric Embodied Agents via Active Querying,” arXiv:2512.15776, 2025, https://arxiv.org/html/2512.15776. ↩︎