Context Is Not a Costume: Why Strong Agents Still Fail on Contact

The agent looks ready. Then reality answers back.

The current AI-agent story is conveniently simple. Take a powerful foundation model, wrap it in tools, give it a workflow, add a polite system prompt, and call the result “ready for deployment.”

Reality, as usual, has poor manners.

Two recent arXiv papers examine very different agent settings. One studies whether multimodal AI agents can align their behavior with the cognitive age of child users. The other studies whether behavior foundation models for imitation learning can remain robust when the physical dynamics of an environment shift after training. They do not share a benchmark, a model class, or even the same deployment domain. That is precisely why they are useful together.

Their shared message is sharper than either paper alone: agent readiness is not the same thing as generic competence. A capable agent becomes deployable only when its behavior is constrained at the point of contact with a specific user, task, or environment.¹²

This sounds obvious until one remembers how much of the current industry still evaluates agents as if “more intelligence” were a universal solvent. More capable. More fluent. More autonomous. More tool calls. More dashboards. Fine. But the moment an agent meets a child, a confused customer, a regulated workflow, a damaged robot joint, or a slightly abnormal operating condition, average benchmark performance becomes a rather expensive ornament.

The issue is not whether the model is smart. The issue is whether it is correctly fitted.

The shared problem: deployment mismatch

The two papers expose different versions of the same failure mode.

In the child-facing agent paper, the mismatch is between the agent’s default adult-like reasoning and the developmental stage of the user. A tutor, assistant, or pediatric-facing agent that always uses abstract reasoning may look competent on a normal benchmark while being poorly matched to a seven-year-old’s vocabulary, memory span, or reasoning strategy. In that setting, “better” does not always mean “more advanced.” Sometimes better means simpler, slower, more concrete, and deliberately constrained.

In the behavior-foundation-model paper, the mismatch is between the training environment and the deployment environment. A policy inferred from nominal offline data may perform well when the physics remain familiar. Then friction changes, gravity tilts, actuator strength weakens, or contact dynamics shift. The policy still has a pretrained representation. It still has demonstrations. It still looks elegant in the paper diagram. The robot, sadly, has to walk on the actual floor.

These are not identical technical problems. They are the same operational problem wearing different clothes.

Dimension	Cognitive-age alignment paper	Robust BFM paper	Shared lesson
Deployment mismatch	Agent reasoning does not match the user’s developmental stage	Policy inference assumes nominal dynamics but deployment dynamics shift	Generic capability is not enough
False shortcut	“Act like a child” prompting	Standard task inference from nominal demonstrations	Surface adaptation is brittle
Intervention	Skill-guided cognitive filters for language, memory, reasoning, visual reliance, and social perspective	Robust minimax task inference over dynamics uncertainty	Adaptation must be explicit
Evaluation target	Age-ordered behavioral trajectories, not raw score maximization	Return under perturbation, not nominal performance only	Test the shifted case
Main caveat	Alignment is uneven across cognitive domains and model families	Robustness depends on sufficiently rich pretrained representations	Constraints cannot rescue weak foundations indefinitely

The important point is not that a child-facing tutor and a locomotion policy are secretly the same technology. They are not. The important point is that both papers reject the same lazy assumption: if the base model is strong enough, context will take care of itself.

It will not. Context is work.

Paper role one: user alignment is not role-play

The cognitive-age paper introduces ChildAgentEval, a WISC-inspired interactive benchmark for evaluating whether multimodal agents can produce age-appropriate cognitive behavior. The authors are careful not to frame this as a simple “can the model answer child-level questions?” problem. Their concern is developmental calibration: whether the agent’s language, memory, reasoning strategy, perceptual behavior, and error patterns change systematically across target ages.

This distinction matters. A model can answer a child’s worksheet correctly while explaining it in a way no child would naturally process. That is not teaching. That is adult cognition wearing a small backpack.

The paper compares standard age prompting with a skill-guided distillation approach. Standard prompting means telling the agent the target age. The result, according to the authors, is weak: baseline agents tend to keep maximizing correctness and show flat or irregular age trajectories. In other words, “pretend to be younger” mostly changes the costume, not the cognition.

The alternative approach is more structured. The authors extract developmental markers from age-stratified child and adolescent data, then convert those markers into cognitive skill cards and filters. These filters constrain vocabulary abstraction, working-memory load, reasoning depth, visual reliance, and social perspective. The aim is not to make the agent stupid. The aim is to make it developmentally plausible.

That difference is worth underlining:

Cognitive alignment is selective behavioral reconfiguration, not uniform capability reduction.

The paper’s strongest result is that skill-guided constraints induce clearer age-ordered differentiation in stronger proprietary models. The weaker result, and arguably the more useful one for practitioners, is that the alignment is uneven. Language-mediated behavior is easier to shape. Working memory, perceptual reasoning, and processing-speed behavior are harder to calibrate because current model architectures do not naturally share human limits on attention, memory decay, or visual processing.

That is the business lesson hiding inside the psychometric framing. Prompting can adjust tone. It cannot automatically impose the user’s cognitive constraints.

For child-facing products, healthcare intake tools, educational assistants, or customer-support agents serving low-literacy users, this matters directly. “Friendly explanation” is not enough. The agent needs a constraint model of the user, and the evaluation must check whether behavior changes in the right direction.

Otherwise, the product may be accurate, articulate, and misaligned. A very polished failure, which is still a failure.

Paper role two: robustness is not more pretraining by default

The second paper moves from users to environments. It revisits behavior foundation models for offline imitation learning. In this setting, a model is pretrained on task-agnostic exploratory data and later adapted to new tasks from demonstrations. The attraction is obvious: instead of retraining from scratch for every behavior, the system learns reusable representations and performs fast task inference.

The problem is also obvious to anyone who has ever watched a robotics demo behave perfectly until the lighting, floor, object weight, or motor response changes. Existing behavior foundation models often assume fixed dynamics. Deployment rarely returns the favor.

The authors formulate task inference as a robust minimax problem. The learner chooses a task vector and induced policy; an adversary selects transition dynamics within an uncertainty set that maximize imitation loss. The aim is to infer a policy that remains viable under worst-case dynamics perturbations, without modifying the pretraining stage and without requiring expert demonstrations from perturbed environments.

They propose two variants. RBFM-Light relaxes transition uncertainty into an occupancy-space uncertainty set, making the robust objective tractable. RBFM-Heavy adds structural constraints through Bellman flow conservation, reducing some conservatism at higher computational cost. The naming is not poetic, but it is mercifully honest.

The experiments evaluate Walker, Quadruped, and Cheetah tasks under perturbations such as gravity changes, mass changes, joint friction loss, actuator strength reduction, contact stiffness shifts, and range-of-motion limits. The reported pattern is consistent: robust BFM variants outperform standard task inference under dynamics shifts, with RBFM-Heavy usually strongest, RBFM-Light often intermediate, and nominal FB-IL more brittle.

But again, the caveat is the lesson. The robust approach depends on sufficiently rich pretrained representations. When pretraining data are insufficient, performance becomes unpredictable under perturbation. In business language: inference-time adaptation helps only if the underlying representation contains enough useful structure to adapt from.

This is not a magic patch. It is a disciplined way to spend the representational capital accumulated during pretraining.

The shared insight: fit beats raw strength

Put the two papers together and the pattern becomes clearer.

The child-agent paper says: a powerful agent may fail because it is too adult-like for the user.

The robust-BFM paper says: a pretrained policy may fail because it is too nominal for the environment.

Both failures are failures of fit.

That word is less glamorous than “general intelligence,” but much more useful. Deployment fit has at least three parts:

User fit: Does the agent match the user’s cognitive, linguistic, emotional, and procedural context?
Task fit: Does the agent follow the constraints that matter for the actual business workflow, not just the benchmark task?
Environment fit: Does the agent remain reliable when operating conditions shift?

A business agent does not live in the benchmark. It lives in the messy gap between the benchmark and the user. That gap is where ROI is either created or quietly composted.

A simple deployment-readiness equation is therefore:

$$ \text{Agent readiness} \neq \text{base capability} $$

A more useful version is:

$$ \text{Agent readiness} = \text{base capability} \times \text{context constraints} \times \text{shift testing} $$

This is not meant as a literal statistical model. It is a governance reminder. If any term is near zero, the deployment is weak. A capable base model with no contextual constraint is brittle. A carefully constrained agent with poor base capability may collapse. A system that passes only nominal tests may fail on first contact with abnormal reality.

Why average success is the wrong comfort metric

Many businesses still want a simple answer: “What is the agent’s success rate?”

That number is useful, but incomplete. The two papers show why.

In cognitive-age alignment, higher raw performance is not always the goal. A seven-year-old-aligned agent should not necessarily solve every task with adult abstraction. If it does, the benchmark score may rise while developmental fit worsens. A lower score can be a sign of correct constraint, provided the trajectory is age-ordered and domain-appropriate.

In robust imitation learning, high nominal performance is also insufficient. A policy can perform well in the original environment and degrade sharply under perturbation. What matters is not only the leftmost point on the curve. It is the slope of failure as conditions shift.

This distinction is central for business AI evaluation. The agent should not be judged only by its clean-case performance. It should be judged by its behavior under the mismatches that the business can reasonably expect.

For a customer-support agent, that means ambiguous requests, irritated users, incomplete records, multilingual input, policy exceptions, and partial tool failure.

For an AI tutor, it means different ages, different vocabulary levels, distracted students, wrong intermediate reasoning, and emotional frustration.

For a finance or compliance assistant, it means edge cases, missing documents, conflicting rules, stale data, and auditability.

For robotics or operations automation, it means wear, noise, latency, sensor drift, unusual inputs, and environment changes.

If the evaluation ignores those cases, the deployment plan is mostly theater. Polite theater, but still theater.

The practical framework: define the mismatch first

The two papers suggest a useful implementation rule:

Before asking how powerful the agent is, define what kind of mismatch it must survive.

That rule changes the deployment conversation. Instead of beginning with model selection, begin with context diagnosis.

Business question	Bad version	Better version
User alignment	“Can the model explain this?”	“Can it explain this at the right cognitive level for this user?”
Workflow reliability	“Can it complete the happy path?”	“What happens when records are incomplete or rules conflict?”
Environment robustness	“Does it work in the demo?”	“How does performance degrade under realistic perturbations?”
Evaluation	“What is the average score?”	“Which shifted cases expose brittle behavior?”
Governance	“Did we add a system prompt?”	“What constraints are enforced, logged, and tested?”

The child-agent paper makes the case for user-context constraints. The robust-BFM paper makes the case for environment-context constraints. The business translation is the same: deployment should be designed around mismatch classes.

A useful checklist looks like this:

Step	Question	Output
1. Identify the mismatch	What is different between benchmark/training and deployment?	User, task, tool, data, policy, or environment mismatch map
2. Define the constraint	What behavior should change when the context changes?	Vocabulary limits, memory limits, escalation rules, robustness margins, policy boundaries
3. Choose the intervention point	Where should adaptation happen?	Prompt layer, memory layer, planner, task inference, tool router, human review gate
4. Test shifted cases	What failure modes are likely but absent from clean tests?	Stress-test suite with nominal and perturbed scenarios
5. Measure degradation	Does the agent fail gracefully or abruptly?	Failure slope, exception rate, unsafe-action rate, escalation quality
6. Keep the caveat visible	What cannot be fixed by constraints alone?	Base-model limits, data insufficiency, domain gaps, architecture limits

This is not glamorous. It is also where most deployment value sits.

What the papers show versus what business should infer

It is important not to overgeneralize. The child-agent paper does not prove that all business agents need psychometric benchmarking. The robust-BFM paper does not prove that every software agent should use minimax task inference. Both are research papers, not procurement manuals.

What they show is narrower and stronger.

The first paper shows that standard prompting is not enough to produce reliable developmental alignment, and that explicit cognitive filters can improve age differentiation in stronger models while leaving difficult gaps in memory, perception, and processing speed.

The second paper shows that robust task inference can improve behavior-foundation-model performance under dynamics perturbations, without changing pretraining, but that this robustness depends on the quality and richness of pretrained representations.

The business inference is broader:

Context adaptation should be treated as an explicit design layer, not as a vague expectation that the foundation model will “figure it out.”

That design layer can take many forms. It might be a cognitive profile for educational software. It might be a policy-bound workflow graph for back-office automation. It might be a robust inference procedure for robotics. It might be a stress-test suite for customer-support agents. It might be a human escalation rule when the agent detects mismatch beyond its safe operating range.

The form varies. The principle does not.

The uncomfortable caveat: constraints cannot create missing capacity

There is a tempting managerial interpretation of these papers: “Good, we can avoid retraining and just add better inference-time controls.”

Sometimes, yes. Not always.

Both papers carry a warning against cheap optimism. In the cognitive-age paper, skill-guided constraints work better in sufficiently capable proprietary models than in weaker open-weight models. If the base model cannot follow the constraints or lacks relevant cognitive controllability, constraints may produce failure rather than alignment.

In the robust-BFM paper, the method depends on learned representations that are rich enough to support robust task inference. If pretraining data are too thin, robustness becomes unstable. You cannot infer a robust policy from representations that never learned the relevant structure.

This matters for procurement. Businesses often want to know whether they can use a cheaper model and patch it with prompts, routing, or governance. The answer is: sometimes, but only after testing the constraint layer against realistic shifted cases. Controls are not a substitute for capability. Capability is not a substitute for controls.

They are complements, not excuses.

The real ROI question

For business leaders, the useful question is not “Which model is smartest?”

The useful question is:

What mismatch will cost us money, trust, compliance, or safety if the agent handles it poorly?

Once that is clear, the evaluation becomes more honest.

A child-facing AI tutor should not be measured only by answer accuracy. It should be measured by developmental appropriateness, confusion recovery, scaffolding quality, and whether explanations stay within the learner’s grasp.

A customer-support agent should not be measured only by first-contact resolution. It should be measured by escalation quality, hallucination containment, policy compliance, and performance under incomplete or conflicting records.

A robotics or operations agent should not be measured only by nominal task completion. It should be measured by degradation under plausible environmental shifts.

A financial or compliance assistant should not be measured only by whether it can summarize rules. It should be measured by what it does when rules conflict, records are missing, or the cost of a confident error is high.

This is where agent deployment becomes less like model shopping and more like systems engineering. An unfortunate development for slide decks, perhaps, but a useful one for reality.

The article’s spine in one sentence

The two papers are best read as one shared insight:

Strong agents do not become deployment-ready by being strong; they become deployment-ready when their strength is constrained, adapted, and tested against the specific mismatches they will meet in use.

That insight is more valuable than a serial summary of the papers because the domains are so different. Child-facing multimodal agents and robust imitation-learning policies do not need to be forced into one technical category. Their common value is conceptual and operational. Both show that the default behavior of a pretrained system is not the same as the right behavior for deployment.

The industry likes to talk about agents as if autonomy were the finish line. These papers suggest a less theatrical conclusion: autonomy without fit is just overconfident motion.

And overconfident motion, whether in a tutoring app or a quadruped robot, tends to find the furniture.

Footnotes

Cognaptus: Automate the Present, Incubate the Future.

Yifan Shen, Jiawen Zhang, Jian Xu, Junho Kim, Ismini Lourentzou, Xu Cao, and Meihuan Huang, “Evaluating Cognitive Age Alignment in Interactive AI Agents,” arXiv:2605.17894, 2026. https://arxiv.org/html/2605.17894 ↩︎
Rishabh Agrawal, Rahul Jain, and Ashutosh Nayyar, “When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited,” arXiv:2605.17017, 2026. https://arxiv.org/html/2605.17017 ↩︎

The agent looks ready. Then reality answers back.#

The shared problem: deployment mismatch#

Paper role one: user alignment is not role-play#

Paper role two: robustness is not more pretraining by default#

The shared insight: fit beats raw strength#

Why average success is the wrong comfort metric#

The practical framework: define the mismatch first#

What the papers show versus what business should infer#

The uncomfortable caveat: constraints cannot create missing capacity#

The real ROI question#

The article’s spine in one sentence#

Footnotes#