Opening — Why this matters now
Embodied AI is having its deployment moment. Robots are promised for homes, agents for physical spaces, and multimodal models are marketed as finally “understanding” the real world. Yet most of these claims rest on benchmarks designed far away from kitchens, hallways, mirrors, and cluttered tables.
This paper makes an uncomfortable point: if you evaluate agents inside the environments they will actually operate in, much of that apparent intelligence collapses.
Background — The benchmark illusion
Public benchmarks suffer from three structural flaws.
First, contamination. With image–text overlap between training data and benchmarks exceeding 30%, high scores increasingly measure recall, not capability.
Second, abstraction. Existing task-generation methods thrive in code, GUIs, or games—domains that are tidy, symbolic, and forgiving. Real 3D environments are none of those things.
Third, scene violation. Many “embodied” benchmarks quietly inject external objects or rearrange rooms, breaking the very premise of in-situ evaluation. If you change the environment to fit the task, you are no longer testing adaptation—you are testing compliance.
Analysis — Task generation as cognition, not prompt engineering
The paper introduces TEA, a two-stage system for automatic in-situ task generation that operates directly inside unseen 3D environments.
The conceptual shift is subtle but important: tasks are no longer treated as natural-language prompts. They are formalized as graph structures consisting of:
- Vertices: objects, agents, scenes
- Edges: spatial or physical relationships
- Attributes: perceptual or semantic properties
This mirrors how humans reason about environments—entities first, relationships second, labels last.
Stage 1: Agent–environment interaction
Instead of relying on pre-authored tasks, the agent explores the environment, collects multimodal data (RGB, depth, segmentation, contact), and generates tasks from its own experience. A small probability of random exploration prevents collapse into local minima.
Crucially, task execution and task generation form a closed loop. Tasks are born from interaction, not from datasets.
Stage 2: Task evolution
Once tasks exist, TEA evolves them without external assets:
- Reuse: simpler task structures inherit instances from more complex ones
- Recombination: graph components are swapped to create new task types
The result is combinatorial growth without contamination.
Across just two cycles in ten unseen scenes, the system generated 87,876 valid tasks—all physically grounded, all environment-specific.
Findings — What the models actually fail at
The paper introduces the Maximum Independent Ratio (MIR) to quantify task diversity. TEA dramatically increases MIR compared to baseline VLM task generation, proving that it avoids redundancy rather than amplifying it.
But the real shock comes from evaluation.
1. Basic perception is still broken
Object classification and localization show the largest gaps between humans and state-of-the-art models. These are not exotic reasoning tasks. They are foundational perceptual skills.
High-level reasoning progress has not repaired low-level vision.
2. 3D awareness is fragile
Navigation tasks expose systematic failures:
- Models chase mirror reflections
- Agents terminate searches when targets leave the initial field of view
- Movement often degrades spatial proximity instead of improving it
In short: models see images, not spaces.
3. Reasoning generalizes poorly
Models excel at relationship detection—a task well represented in training data—but collapse on novel embodied reasoning such as egocentric “object in view” checks. Reasoning ability is distribution-dependent, not general.
4. Humans fail differently
Interestingly, humans underperform models on mirror-counting tasks—not due to perception, but due to over-generalization. Humans ignore artificial constraints; models obey them literally. Intelligence cuts both ways.
Implications — Deployment before evaluation is malpractice
The takeaway is blunt: public benchmarks systematically overestimate real-world readiness.
If agents are evaluated only on curated datasets, failures will emerge after deployment—inside homes, hospitals, and factories. In-situ evaluation is not a luxury; it is a prerequisite.
For businesses and regulators, this suggests three shifts:
- Treat environment-specific evaluation as mandatory
- Value task diversity over leaderboard scores
- Audit perception before reasoning
Conclusion — Reality is the hardest benchmark
TEA demonstrates that meaningful evaluation does not require more data—it requires better grounding. When tasks emerge from the environment itself, illusions vanish quickly.
Embodied AI will not fail because it lacks scale. It will fail because it was never tested where it was meant to live.
Cognaptus: Automate the Present, Incubate the Future.