Benchmarks Lie, Rooms Don’t: Why Embodied AI Fails the Moment It Enters Your House

The room is not impressed by your leaderboard

A robot that performs well on a public benchmark has not necessarily learned how to operate in your house.

It may recognize a chair in a dataset. It may answer a visual question about a tidy image. It may even produce a confident paragraph explaining where the coffee mug should be. Then it enters a real room — with mirrors, partial views, cluttered corners, awkward sightlines, and objects that are not positioned for benchmark convenience — and suddenly the “general intelligence” starts behaving like a tourist holding the map upside down.

That is the useful irritation behind the paper “Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents.”¹ The paper introduces TEA, a two-stage framework for automatically generating cognitive tasks inside unseen 3D environments. The point is not merely to create more test questions. We already have enough artificial exams for machines to overfit, memorize, or flatter.

The point is sharper: if an embodied agent is meant to work in a specific home, warehouse, hospital room, office, or inspection site, then evaluation should begin inside that environment, not on a public benchmark assembled somewhere else.

This sounds obvious. Naturally, that is why the industry often skips it.

TEA treats a task as a graph, not a sentence

The paper’s central move is to stop treating embodied tasks as loose natural-language prompts.

A sentence like “find the red table” looks simple, but it hides several different cognitive requirements. The agent must identify objects, understand color attributes, reason over spatial position, connect its egocentric view to the room layout, and decide whether movement is needed. Natural language compresses all of this into a phrase. That compression is convenient for humans and dangerously vague for evaluation.

TEA formalizes a task as a graph structure:

Graph element	Cognitive meaning	Example in an embodied task
Vertex	Entity to process	object, agent, room, view
Edge	Relation among entities	left of, inside, reflected by, visible from
Attribute	Information attached to entities or relations	color image, label, position, depth, category

This matters because graph structure makes tasks decomposable and reusable. An object-detection task and an object-classification task are not unrelated prompts. One contains structure that can be reused by the other. A search task conditioned on a label can be transformed into a search task conditioned on an image. A visual relationship task can provide object instances for simpler perception tasks.

That is the mechanism behind TEA’s practical value. It does not just ask a model to invent questions. It creates a structured representation of what a task requires, then uses that structure to generate, filter, reuse, and recombine tasks inside the target environment.

The business translation is simple: the evaluation unit is no longer “a benchmark item.” It is a local cognitive requirement tied to a real deployment space.

Stage one: let the agent walk before you test it

TEA’s first stage is agent–environment interaction.

Instead of starting from a prewritten task set, the agent interacts with the environment and collects data: RGB images, depth images, segmentation views, object labels, positions, and 3D bounding boxes. The system then uses this collected information to generate task instances.

The important detail is the loop. The agent does not merely observe the room once and freeze the data. It executes tasks, receives new scene information, generates additional tasks, filters them, and continues. A small amount of random walking is included to prevent the agent from repeatedly sampling the same obvious areas.

This solves a quiet but serious problem in embodied evaluation: new environments often do not come with initial tasks.

A public benchmark gives you a clean test set. A real apartment does not. A warehouse does not. A hospital room does not wake up in the morning and politely hand your robot a balanced set of perception, navigation, and spatial-reasoning questions.

TEA’s interaction stage is designed for precisely that missing starting point. It can begin from exploration and build the task set from what the environment actually contains.

There is also a filtering mechanism. Without filtering, task generation can explode into redundant variants. TEA computes similarity using multimodal embeddings across image, command prompt, and label information, then uses clustering to keep representative tasks. This is not decorative engineering. It is the difference between “we generated many tasks” and “we generated tasks that are not mostly duplicates wearing different hats.”

Stage two: evolve tasks without changing the room

Many existing task-generation methods create variety by changing the environment: adding objects, rearranging layouts, importing new assets, or perturbing existing datasets.

For training, that can be useful. For in-situ evaluation, it is suspicious.

If the goal is to test whether an agent can operate in a specific room, then modifying the room to create evaluation variety partly defeats the purpose. You are no longer testing adaptation to the target environment. You are testing performance in a synthetic variant of it. That may be interesting, but it is not the same question.

TEA’s second stage, task evolution, tries to generate more tasks without importing external environment changes. It does this through two mechanisms:

Mechanism	What it does	Why it matters
Task reuse	A simpler task inherits valid instances from a more complex task when their graph structures overlap	Existing observations become reusable evaluation material
Task recombination	Graph components of compatible semantic types are exchanged to form new task structures	New task types can be created without new assets or external datasets

For example, if a complex relationship-detection task already contains a table instance, a simpler classification task can reuse that table. If a navigation task searches for an object by label, a recombined version might search using an image-conditioned object node instead.

This is the paper’s most business-relevant technical idea. It says local evaluation does not have to be manually written from scratch, nor does it need to rely on contaminated benchmark data. If the environment has been explored, the system can convert that exploration into a structured testing surface.

In other words: the room becomes the dataset.

The 87,876-task number is impressive, but it is not the main point

Across 10 unseen Unreal Engine scenes and two generation loops, TEA produced 87,876 tasks. The tasks covered perception, reasoning, spatial reasoning, and interaction: object classification, localization, depth estimation, mirror counting, embodied counting, pattern counting, relationship detection, object-in-view checking, and label- or picture-driven navigation.

That number is useful, but it should not be the headline.

Large generated datasets are easy to misunderstand. A system can generate 80,000 tasks and still produce mostly redundant or trivial material. We have all seen automated content farms. They are not a scientific achievement; they are a warning label.

The paper therefore introduces Maximum Independent Ratio, or MIR, to quantify diversity. MIR measures the proportion of non-redundant tasks in a task set, using a maximum independent subset under a redundancy threshold. Higher MIR means the task set contains more independent variation rather than near-duplicates.

In the interaction-stage analysis, TEA improves MIR compared with a version without the method. For GPT-4o-generated task subsets, the reported mean MIR rises from 0.307 ± 0.159 without TEA to 0.536 ± 0.055 in the first TEA loop and 0.676 ± 0.156 in the second loop.

That is the evidence to care about. Not “the system made many tasks,” but “the system made tasks that became less redundant under a diversity metric.”

The evolution-stage result supports the same claim from another direction. The paper reports an average MIR-e of 0.75 across scenes and models, suggesting that most evolved tasks contributed new independent elements when integrated into the existing task set.

A useful way to read the experiments is this:

Test or result	Likely purpose	What it supports	What it does not prove
MIR comparison against task generation without TEA	Ablation / method comparison	TEA reduces redundancy and improves task diversity	It does not prove the generated tasks are sufficient for every deployment domain
MIR-e for evolved tasks	Validation of recombination and reuse	Graph-based evolution adds independent task variants	It does not prove all evolved tasks are equally valuable operationally
Spatial statistics	Diagnostic / exploratory analysis	Task distributions cover space differently depending on random walk versus task execution	It does not prove real-world robotic reliability
Human verification of sampled tasks	Quality check	Generated tasks are physically reasonable and relevant to daily cognition	It does not eliminate simulation-to-reality concerns
TEA-Test model evaluation	Main evidence on agent capability	Strong models still fail on basic in-situ embodied tasks	It does not isolate every cause of failure in model architecture

This is where the paper becomes more than another benchmark paper. It is trying to show that task generation itself can be evaluated: by diversity, spatial distribution, physical validity, and downstream diagnostic power.

The human check says the tasks are not just benchmark confetti

The authors validate task quality through human evaluation on a random 10% subset of the test set. Participants judged all sampled tasks as valid, with 90.8% considered meaningful for home assistance, 84.9% closely related to daily routines, and 94.4% requiring essential cognitive faculties.

These numbers should be interpreted carefully.

They do not mean TEA has solved household robotics. They do not mean every generated task is equally important. They do not mean the benchmark fully covers the messy long tail of home operation.

They do mean the generated tasks are not obviously artificial nonsense. That matters. Automatically generated evaluation can easily drift into tasks that satisfy formal rules but have little operational meaning. TEA’s human validation suggests that, at least in the sampled setting, the system is generating tasks that people recognize as relevant to daily embodied cognition.

For business users, this is the difference between a test suite and a trivia machine.

TEA-Test shows the failure is not one failure

The second experiment samples 848 tasks to form TEA-Test and compares state-of-the-art models against a human baseline from seven participants. The paper also evaluates 100 navigation tasks on GPT-4o and o1, focusing on only two models because interactive evaluation is computationally expensive.

The results are not a single story of “models bad, humans good.” They are more useful than that. They show different failure modes across task types.

Basic perception is still not solved

Object classification and object localization show the largest performance gaps.

In object classification, the human baseline reaches 0.7448, while the best model score reported is 0.4196. The gap between the highest and lowest scores is 0.6958. In object localization, the human baseline is 0.4913, while model scores range widely, and the reported gap is 0.4445.

These are not exotic tasks. They are the kind of abilities people casually assume modern vision-language models already possess. The paper’s result is a useful correction: public multimodal success does not imply reliable low-level perception in a local 3D scene.

That distinction matters for deployment. Many agent failures that look like “bad reasoning” may actually begin as perception failures. If the system does not correctly identify or localize objects, the later reasoning chain is just a beautifully formatted error.

Models handle familiar reasoning better than embodied reasoning

The paper reports strong performance on relationship detection. The best model score reaches 0.94, and the human baseline is 0.89. That is not surprising. Visual relationship tasks are well represented in training and evaluation culture.

But performance drops on object in-view check, a task that asks whether an object is visible from the agent’s egocentric view. This is more specifically embodied. It requires the model to reason about the agent’s viewpoint, not just object relations in an image.

The difference is the lesson. Models may look competent when the task resembles familiar image-based reasoning. They become less impressive when the task requires spatial awareness tied to an embodied agent’s position.

A leaderboard score can hide this because it averages away the distinction. A room cannot. A room asks exactly the wrong follow-up question.

The navigation evaluation is narrower but revealing.

Model	Navigation gain	Success rate	Step number	Target neglect rate	Lack of 3D awareness
GPT-4o	0.48	0.62	4.04	0.08	0.14
o1	0.12	0.38	9.80	0.72	0.00

The authors interpret the failures differently for the two models. o1 often neglects the target, moving away even when it has spotted it. GPT-4o more often shows deficits in 3D spatial awareness, including prematurely stopping when a target is not initially visible or moving toward mirror reflections instead of the actual object.

This is exactly the kind of distinction businesses need before deployment. “The model failed navigation” is too vague. Did it miss the object? Did it see the object but choose the wrong action? Did it confuse a reflection for the target? Did it fail to turn? Did it terminate too early?

Different failures require different mitigations. Some are perception problems. Some are planning problems. Some are memory or state-tracking problems. Some are interface and action-space problems. Treating them all as one generic “agent reliability” issue is how pilot projects become expensive theater.

Humans fail differently, which is also useful

One of the more interesting results is that humans underperform models on mirror counting.

The paper explains this through human feedback: participants often ignored the “mirror-only” constraint and counted real objects instead. In cases without visible mirrors, some still reported the actual number of objects rather than zero. Models, being more literal, followed the instruction more rigidly and therefore scored better.

This is a useful reminder that human baseline does not mean human perfection. Humans bring assumptions. Models bring rigidity. In evaluation design, both matter.

The practical lesson is not “models are better than humans at mirrors.” Please do not build a company around that sentence. The lesson is that task design can reveal different classes of cognitive error: human over-generalization versus model brittleness.

The business value is local diagnosis, not another leaderboard

For robotics, smart-home agents, warehouse automation, AR assistants, facility inspection, and healthcare support systems, TEA points toward a different evaluation workflow.

The conventional workflow is:

Choose a model with good public benchmark results.
Run a small demo in a controlled environment.
Declare it promising.
Discover the ugly details after deployment.

TEA suggests a more disciplined workflow:

Deployment step	TEA-inspired evaluation question	Business meaning
Scan or simulate the target environment	What objects, views, spatial relations, and interaction zones exist here?	The evaluation begins from the actual operating context
Generate local cognitive tasks	What perception, reasoning, and navigation tasks does this environment naturally require?	The test suite becomes environment-specific
Filter for diversity	Are we testing independent situations or repeating near-duplicates?	Evaluation budget is spent on coverage, not noise
Reuse and recombine task graphs	Can existing observations produce additional valid tests without altering the environment?	More diagnostic breadth with less manual authoring
Compare models and humans	Which failures are model-specific, task-specific, or environment-specific?	Procurement and deployment decisions become evidence-based
Gate deployment by failure mode	Which failures are acceptable, mitigable, or disqualifying?	Risk control happens before the agent enters operations

This is the actual ROI pathway. It is not that TEA magically makes embodied AI safe. It is that environment-specific task generation can make failures visible earlier and cheaper.

For a warehouse, the relevant question may be whether the agent distinguishes a target box from visually similar inventory across partial views. For a hospital room, it may be whether the agent can reason about visibility, occlusion, and safe navigation around beds and equipment. For a home assistant, it may be whether the system understands mirrors, cabinet positions, clutter, and objects outside the initial camera view.

Public benchmarks cannot answer these questions with enough specificity. They were not built for your room.

What the paper directly shows, and what Cognaptus infers

It is useful to separate the evidence from the business interpretation.

Layer	Claim
What the paper directly shows	TEA can generate a large number of physically grounded in-situ tasks across 10 unseen UE scenes, improve task diversity under MIR, and expose weaknesses in strong models on TEA-Test tasks.
What Cognaptus infers	Similar mechanisms could support environment-specific QA for embodied AI deployments, especially where manual test writing is expensive or incomplete.
What remains uncertain	The results are based on simulated UE environments and focus on perception, reasoning, and decision-making rather than full physical robot execution in uncontrolled real spaces.

That boundary is important.

The paper is not a field trial of household robots over six months. It does not prove that TEA-generated tests cover every safety-critical edge case. It does not solve sim-to-real transfer. It does not evaluate physical manipulation reliability, hardware constraints, human-agent interaction under stress, or long-horizon household routines.

What it does provide is a credible mechanism for generating local cognitive evaluations before deployment. That is already valuable because many embodied-AI projects do not fail at the final grand challenge. They fail at the supposedly basic step: seeing, locating, checking visibility, and navigating in the actual place where they are expected to work.

The uncomfortable lesson: basic tasks are not beneath advanced models

The paper is especially useful because it attacks a lazy assumption: that basic perception tasks are too simple to worry about now.

In AI discussions, attention naturally drifts upward. Reasoning. Planning. Agents. Long-horizon autonomy. Tool use. Reflection. Self-improvement. Beautiful words, all of them. Very fundable.

But an embodied agent still needs to know whether the red table is visible from where it stands.

If it cannot do that reliably, the higher-level reasoning stack becomes an elegant structure built on wet cardboard. The TEA-Test results show that failures in classification, localization, egocentric view reasoning, and navigation remain serious even for strong models.

This should change how businesses evaluate embodied AI vendors. Do not begin with the most impressive demo. Begin with the dull local tests:

Can the agent classify the objects that matter in this environment?
Can it localize them from different viewpoints?
Can it tell whether an object is actually visible or merely nearby?
Can it avoid confusing reflections, occlusions, or partial views?
Can it move closer to the target over multiple steps?
Can it explain failure without inventing a confident fairy tale?

The boring tests are not beneath the product. They are the product’s foundation.

Boundaries: simulated rooms are still not homes

The limitations are not cosmetic.

First, the experiments are conducted in Unreal Engine scenes. UE can support high-fidelity simulation and real-world scene scanning, but simulation is still not the same as a physical home, warehouse, hospital, or factory. Lighting, sensor noise, object movement, human interference, and hardware constraints can change the failure profile.

Second, the paper focuses on agents’ perception, reasoning, and decision-making within UE. It does not fully evaluate robotic execution. A model that chooses the correct target may still fail to move a real robot safely toward it. Embodiment is not just vision plus words; motors have a nasty habit of existing.

Third, TEA’s task quality depends on the environment data collected, the task-generation functions, the filtering strategy, and the available modalities. A poorly scanned environment or weak object annotation pipeline would produce weaker tests.

Fourth, the paper’s generated tasks are validated by sampled human judgment, not by exhaustive operational risk analysis. For business deployment, TEA-style evaluation should be part of a broader QA system, not the entire safety case.

These boundaries do not weaken the paper’s main argument. They clarify where it should be used. TEA is best read as a framework for pre-deployment cognitive diagnosis in target environments, not as a final certification system for real-world robotics.

Reality is the benchmark that refuses to be gamed politely

The most important contribution of TEA is not the dataset size. It is not even the specific model ranking.

The important contribution is the shift in evaluation philosophy.

Public benchmarks ask: how does this model perform on tasks we prepared in advance?

In-situ evaluation asks: what tasks does this environment naturally demand, and how does the model behave when tested there?

That second question is harder, less convenient, and much more relevant. It also makes inflated confidence harder to maintain. A model may look broadly capable on public tests and still fail inside the room where it is supposed to help.

For businesses, the lesson is blunt: do not buy embodied intelligence by leaderboard. Buy it by failure profile.

Before deploying an agent into a home, warehouse, hospital room, or inspection site, generate tasks from that environment. Test perception before reasoning. Test egocentric visibility before navigation. Test mirrors, occlusion, and partial views before celebrating “world understanding.” Then decide whether the failures are acceptable, fixable, or disqualifying.

Benchmarks can be memorized, contaminated, averaged, and marketed.

Rooms are less cooperative.

Cognaptus: Automate the Present, Incubate the Future.

Xinyi He et al., “Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents,” arXiv:2602.05249, 2026. https://arxiv.org/abs/2602.05249 ↩︎

The room is not impressed by your leaderboard#

TEA treats a task as a graph, not a sentence#

Stage one: let the agent walk before you test it#

Stage two: evolve tasks without changing the room#

The 87,876-task number is impressive, but it is not the main point#

The human check says the tasks are not just benchmark confetti#

TEA-Test shows the failure is not one failure#

Basic perception is still not solved#

Models handle familiar reasoning better than embodied reasoning#

Navigation exposes 3D interaction weakness#

Humans fail differently, which is also useful#

The business value is local diagnosis, not another leaderboard#

What the paper directly shows, and what Cognaptus infers#

The uncomfortable lesson: basic tasks are not beneath advanced models#

Boundaries: simulated rooms are still not homes#

Reality is the benchmark that refuses to be gamed politely#