When the Sandbox Thinks Back: Training AI Agents in Simulated Realities

Workflow software has a deeply unglamorous problem: reality keeps changing.

A customer support agent may know the refund policy, but then the customer changes their address, the order record has a missing field, the tool returns a cryptic error, and the next API call requires a schema nobody mentioned in the demo. A spreadsheet agent may know how to summarise a table, but the file path is wrong, the calendar has a conflicting event, and the “obvious” action fails because the world, in its charmingly vindictive way, is not a benchmark prompt.

This is the gap addressed by Simulating Environments with Reasoning Models for Agent Training, a paper from researchers at the University of Washington, Microsoft, and Carnegie Mellon University.¹ The paper’s central move is simple enough to sound suspicious: instead of building a real environment for every agent training task, use a strong reasoning model to simulate the environment’s responses.

Not to answer the user. Not merely to generate synthetic instructions. To play the role of the world.

That is the interesting bit. The paper is not just another “synthetic data improves smaller models” story, although it is partly that. Its more important claim is architectural: the training environment itself can become model-mediated. The sandbox does not just contain the agent. It talks back.

Agent training is really training on consequences

The easiest way to misunderstand this paper is to think it is about making agents “imagine” better. That sounds poetic, and therefore dangerous. The more precise version is this: agents learn from trajectories, and trajectories are made of consequences.

A tool-using agent does not only need to know that create_event exists. It needs to learn what happens after it calls create_event with the wrong time, missing fields, invalid app context, malformed JSON, or a plausible action that violates policy. In real software, these consequences come from APIs, databases, user simulators, task-specific validators, and reward functions. That means every new training domain usually requires its own mini-world: schemas, state, mock data, execution logic, error messages, scoring rules, and maintenance.

The Simia paper calls attention to the asymmetry. Many agent tasks are cognitively simple but environmentally broad. Booking, updating, querying, filtering, scheduling, retrieving, and correcting are not Olympiad mathematics. Yet they become difficult because the action space is messy and the agent must stay coherent across tools, states, and exceptions.

Traditional agent training tries to handle this by engineering environments. Simia asks whether a reasoning LLM can approximate enough of that environment to produce useful training data and reward feedback. That is not a small substitution. It changes the bottleneck from “build another testbed” to “specify the world well enough that a simulator model can act inside it.”

The paper presents two frameworks around that substitution:

Framework	What it replaces	What it uses instead	Training role
Simia-SFT	Real environment executions for supervised trajectories	LLM-generated multi-turn trajectories grounded in seed examples, tool specs, policies, and output formats	Supervised fine-tuning data
Simia-RL	Task-specific environment implementations and reward functions	LLM-simulated observations plus LLM-scored task completion	Reinforcement learning loop

The mechanism matters because the value does not come from synthetic text in isolation. It comes from synthetic interaction: user request, agent reasoning, tool call, environment response, correction, next action, final outcome. In agent training, the unit of learning is not a sentence. It is a controlled little disaster with a resolution.

Simia-SFT turns small seed sets into synthetic operating experience

Simia-SFT begins with seed trajectories. These are examples of agent-tool interaction from existing datasets or benchmarks. The authors use three sources: APIGen-MT for airline and retail tasks, AgentTuning for operating-system, WebShop, and Mind2Web tasks, and OfficeBench for office automation workflows.

The scaling is the point. The paper reports taking roughly 1.5k airline and 3.5k retail trajectories from APIGen-MT and generating 90k synthetic trajectories. It expands 668 AgentTuning samples into 15k trajectories across OS, WebShop, and Mind2Web. For OfficeBench, it starts from 76 one-app tasks with o4-mini trajectories and synthesises 30k samples targeting multi-app settings.

That is the paper’s practical recipe: do not hand-build the whole universe. Use a small number of plausible traces, then ask a strong simulator to produce more varied consequences inside the same formal action space.

The Simia-SFT pipeline has four main stages.

First, an LLM pre-filters seed trajectories for completeness, logic, and format. That step is not glamorous, but it is important. If the seeds are incoherent, the simulator will faithfully amplify incoherence. A garbage-in, garbage-out system with better punctuation is still garbage.

Second, the prompt anchors the simulator using tool specifications, policies, expected formats, and a reference trajectory. This is where “simulation” becomes less mystical. The model is not hallucinating an entire domain from vibes. It is constrained by action schemas and examples.

Third, the simulator generates complete multi-turn trajectories. In one generation pass, it produces user queries, assistant reasoning, tool calls, and simulated environment observations until the task is complete. Temperature and multiple passes are used to increase diversity.

Fourth, rule-based post-processing repairs and filters the output. Malformed JSON is fixed where possible. Invalid tool calls are discarded. Tool-call formats are normalised. The system prompt is adjusted to match the target deployment format.

That last stage is a useful reminder: the paper does not ask the LLM simulator to be magic. It surrounds the simulator with boring mechanical guardrails. As usual in applied AI, the “breakthrough” is accompanied by a broom and a validation script.

Simia-RL makes the environment part of the reinforcement loop

Simia-RL pushes the idea further. Instead of using simulated trajectories only for supervised fine-tuning, it uses an LLM simulator inside reinforcement learning.

In normal agent RL, the model acts in an environment. The environment returns observations. A reward function decides whether the task succeeded. For agent workflows, that usually means building custom environment logic and task-specific rewards.

Simia-RL replaces that with an o4-mini simulator. The prompt includes tool usage specifications, environment feedback formats, reference trajectories, and interaction history. The simulator has two jobs: produce environment feedback after the model acts, and compute a final binary reward for success or failure.

This creates a peculiar but powerful setup. The agent is learning from a world that is itself generated by a reasoning model. A weaker policy model is trained through interaction with a stronger simulator model. The teacher is not just giving answers; it is staging consequences.

The OfficeBench case study makes the mechanism concrete. In the real environment, an attempted calendar event creation fails with a short generic message. In the simulated environment, the feedback explains that the requested time conflicts with an existing lunch break. The model can then adjust the proposed time and complete the task. The paper uses this as a case study, not as the main proof. Its purpose is explanatory: it shows why simulated feedback can sometimes be more useful for learning than a brittle real environment that says, essentially, “no” and walks away.

That distinction matters. A real environment is not automatically a good teacher. It may be truthful but pedagogically useless. A simulated environment may be less grounded but more informative. The business question is not “which one is philosophically purer?” It is “which one produces agents that survive more realistic failure modes after proper validation?”

Annoyingly, the answer is: test it. Reality remains stubborn like that.

The main evidence is benchmark performance, but the ablations do the interpretive work

The paper evaluates across three benchmark families: τ²-Bench for airline and retail multi-turn tool use, OfficeBench for cross-application office workflows, and AgentBench for operating system, WebShop, and Mind2Web tasks.

The headline results are strong. Simia-Tau based on Qwen2.5-32B reaches an average of 58.9 on τ²-Bench, above GPT-4o’s 54.2 and xLAM-2-70B’s 56.3, while remaining below GPT-5 and o4-mini. Simia-Tau based on Qwen3-8B reaches 49.3, outperforming GPT-4’s 45.4 and xLAM-2-8B’s 44.7. After simulated-environment RL, the Qwen3-8B Simia-Tau-RL model rises slightly to 51.0.

On OfficeBench, the proprietary models remain ahead: GPT-5 averages 80.4, o4-mini 78.6, GPT-4.1 71.0, and GPT-4o 62.7. But the Simia-trained smaller models improve sharply over open baselines. Simia-OB-RL based on Qwen2.5-7B reaches 49.6 average. Simia-OB based on Qwen3-8B reaches 44.0, above GPT-4’s 31.1 and far above the base Qwen3-8B score reported in the table.

On AgentBench, Simia-AB based on Qwen3-8B and Qwen2.5-Coder-7B both average 42.6, close to GPT-4’s 44.2 and above GPT-4o’s reported 38.1. A Qwen2.5-7B Simia-AB variant reaches 41.6.

A compact reading:

Evidence item	Likely purpose in the paper	What it supports	What it does not prove
τ²-Bench table	Main evidence	Simulated trajectories can substantially improve open tool-use agents in airline and retail workflows	That simulation reliably transfers to live airline or retail systems
OfficeBench table	Main evidence	Simia helps smaller models handle multi-app office workflows better than seed-only or base models	That it closes the gap to the strongest proprietary models
AgentBench table	Main evidence and comparison with prior agent training	Simia improves across OS, WebShop, and Mind2Web-style tasks	That one simulator style covers all web and software environments
Pass^k test on τ²-Bench	Robustness/sensitivity test	Some gains persist under stricter repeated-success evaluation	Uniform dominance across all retry settings; Simia trails xLAM-2-70B on Retail at higher k
Real vs simulated trajectory comparison	Ablation	At equal data size, simulated data is comparable in several settings; at larger scale, synthetic simulation gives a data-volume advantage	That simulated data is intrinsically more truthful than real environment data
RL on simulated environments	Main RL evidence plus implementation comparison	Simulated RL can improve OfficeBench and slightly improve τ²-Bench	That LLM reward models are reliable enough for deployment scoring
Simulator comparison: GPT-5 vs o4-mini	Ablation	Strong simulator models produce broadly comparable synthetic data, with domain-dependent differences	That any cheaper model can be substituted without loss
Mixed-dataset fine-tuning	Exploratory extension	Joint training on simulated datasets may improve broad average performance	That multi-domain training is universally better on every benchmark

This is where the paper becomes more useful than the headline. The strongest evidence is not simply that a Simia-trained model beats GPT-4o somewhere. Benchmark comparisons are noisy, model versions differ, and agent benchmarks can punish small format mistakes in ways that look dramatic.

The more important evidence is the ablation logic. Figure 6 compares training on real-environment seed data with training on simulated trajectories. At the same dataset size, simulated trajectories are reported as comparable to real-environment trajectories on OfficeBench and AgentBench, and better on τ²-Bench. As the simulated dataset scales, performance improves further. That supports the paper’s core mechanism: the benefit is not just that synthetic data is cheap; it is that simulation lets the training distribution expand beyond the bottleneck of collected environment traces.

The simulator comparison in Appendix E adds another useful boundary. GPT-5 and o4-mini generate 15k synthetic trajectories with broadly similar downstream effects, but not identical ones. o4-mini is slightly better on OfficeBench 2-apps, OfficeBench 3-apps, and WebShop. GPT-5 is better on Mind2Web and much better on τ²-Bench Airline and Retail in that ablation. This suggests the approach is not tied to a single simulator, but it is absolutely tied to simulator quality. Replace the teacher with a weaker model and the classroom may become a theatre troupe.

The robustness story is promising, not perfectly clean

The pass^k evaluation is a useful antidote to over-reading the main table. Pass^k requires success across repeated attempts, making it stricter than a one-shot success rate. On τ²-Bench Airline, Simia-Tau Qwen2.5-32B posts 56.0, 48.0, and 46.0 for k = 1, 2, and 3, ahead of xLAM-2-70B’s 49.3, 40.0, and 34.0. That is meaningful because repeated success is closer to operational reliability than one lucky run.

Retail is more nuanced. Simia-Tau Qwen2.5-32B reaches 61.7, 47.7, and 38.6, while xLAM-2-70B reports 63.2, 51.5, and 46.5. So the Simia model does not dominate everywhere. It wins the neat headline average against xLAM-2-70B on τ²-Bench overall, but the stricter retry lens shows domain-specific weakness.

That is not a flaw in the paper. It is a useful constraint. Simulated environments can improve robustness, but robustness remains domain-shaped. Airline and retail are not interchangeable just because both have customers and APIs. Anyone who has integrated enterprise systems will now pause politely to avoid screaming.

The business value is cheaper rehearsal, not automatic truth

For companies building AI agents, the obvious temptation is to read this paper as permission to skip environment engineering. That is the wrong lesson. Simia does not remove the need for real systems. It changes where real systems are needed most.

The practical pathway looks like this:

Start with a small number of high-quality traces from real or benchmark environments.
Convert the domain into explicit tool specifications, policies, output formats, and failure rules.
Use a strong reasoning model to generate larger synthetic interaction datasets.
Fine-tune smaller open models on those trajectories.
Use simulated RL for additional learning where feedback richness matters.
Validate against real systems before deployment.

The ROI story is not “LLMs replace your backend.” Please do not put that sentence in a board deck unless you enjoy incident reviews. The more defensible story is that LLM simulators can reduce the marginal cost of training-world construction.

That matters because environment engineering is a hidden tax on agent development. For every new domain, teams must maintain mock APIs, tool wrappers, user simulators, validators, state machines, data fixtures, and reward criteria. Those assets are expensive, brittle, and rarely reusable. Simia reframes much of this work as prompt-and-schema design plus validation.

A useful business interpretation:

Technical contribution	Operational consequence	ROI relevance
LLM-simulated trajectories	More training cases from fewer seed traces	Lower data collection and environment construction cost
Tool-spec-grounded prompts	Synthetic data stays inside allowed action spaces	Better format adherence and fewer invalid tool calls
Rule-based post-processing	Structural failures are filtered before training	Less contamination from malformed trajectories
LLM-simulated RL feedback	Agents receive richer failure explanations during training	Faster iteration on multi-turn correction behaviour
Cross-benchmark gains in smaller models	Open models become more competitive after targeted training	Potentially lower inference cost and more controllable deployment stack

The strongest near-term use case is not highly regulated production autonomy. It is pre-production rehearsal. Simia-like methods are well suited to generating training and evaluation traces for support workflows, internal operations, office automation, CRM tasks, retail service agents, and API-heavy assistant systems where the organisation can describe tool behaviour and policy constraints with reasonable precision.

The uncertainty boundary is equally important. These experiments are limited to airline, retail, office, web, and operating-system-style benchmarks. They rely on strong simulator models such as GPT-5 and o4-mini. Their reward signals are simulated. The generated environments may inherit distributional bias from the simulator and the seeds. If the seed traces underrepresent ugly edge cases, the synthetic world may simply produce cleaner versions of your blind spots. Very efficient. Very dangerous. Very on brand for enterprise automation.

The simulated environment may be a better teacher than the real one

One of the paper’s more intriguing observations is that simulated feedback can outperform real environment feedback in RL for OfficeBench. The reported OfficeBench RL results show simulated-environment RL reaching 64.7 on 2-apps and 34.5 on 3-apps, compared with real-environment RL at 60.8 and 28.6. Relative to the SFT model, the simulated RL path adds 6.9 and 7.2 points.

This does not mean fake worlds are better than real worlds. It means a training environment has two jobs: represent the task and teach the policy. Real environments often represent the task but teach poorly. Their errors are fixed, sparse, and unhelpful. A simulator can produce feedback that explains why an action failed, which gives the policy more information during learning.

In the OfficeBench case study, the simulated environment does not merely say the calendar event failed. It explains the lunch-break conflict. That extra information helps the model revise its next action. For training, this is valuable. For deployment, it must be verified. The simulator can be pedagogically richer and epistemically weaker at the same time. Two things can be true. AI discourse will survive, somehow.

This distinction opens an interesting design space: simulated environments may be used not as substitutes for reality, but as curriculum builders. A real system can tell the agent what happens. A simulated teacher can tell the agent why the failure pattern matters. The eventual production model should still be tested against the real system, because the customer will not be impressed that the simulator found the refund elegant.

The appendix is not a second thesis; it is the boundary map

The appendix results are worth reading because they clarify what kind of claim the paper is making.

The GPT-5 versus o4-mini simulator ablation is a sensitivity test. It asks whether the synthetic-data pipeline depends on one specific simulator. The answer appears to be no, but with domain variation. This supports generality across strong simulators, not generality across arbitrary cheap ones.

The mixed-dataset experiment is an exploratory extension. Qwen3-8B fine-tuned jointly on the three simulated datasets improves strongly over the base model and achieves a higher average than GPT-4 across the reported benchmarks. But the individual benchmark pattern is uneven: it exceeds GPT-4 on some tasks and remains below on others. The useful inference is that multi-domain simulated training can broaden capability, not that one synthetic stew cures every benchmark ailment.

The RL training-step plots are robustness and dynamics checks. OfficeBench 2-apps improves over training and finishes around 64.7. OfficeBench 3-apps rises from 27.3 to 34.5 and remains above the real-environment RL line across the plotted steps. τ²-Bench RL improves only slightly and shows noisy movement. That should temper any simplistic “RL on simulated worlds scales smoothly” interpretation. The signal is positive, but not frictionless.

Even the prompt appendices matter. They show that much of the method’s effectiveness depends on strict format preservation, allowed-tool constraints, app-switching rules, path validation, action validity checks, and reward criteria that refuse to credit mere claims of task completion. In other words, the simulator is useful because it is boxed in. The sandbox thinks back, yes, but only after someone writes the sandbox rules.

Where this should change agent development practice

The paper’s strategic implication is that agent development may move from environment-first engineering to simulation-first curriculum design.

In the environment-first approach, teams build a realistic testbed, collect trajectories, fine-tune or evaluate models, then repeat the process for each domain. This is faithful but slow.

In the simulation-first approach, teams collect a smaller set of seed traces, formalise the tool and policy surface, generate synthetic trajectories, train open models, and reserve real environments for validation, calibration, and high-risk cases. This is less pure but potentially much faster.

That shift would change the economics of agent projects in three ways.

First, it could make smaller models more attractive. The paper repeatedly shows 7B and 8B models becoming far more competitive after Simia-style training. For companies concerned with inference cost, latency, data control, or on-premise deployment, this is not a footnote. It is the budget conversation.

Second, it could make workflow coverage less painful. Instead of handcrafting every exception path, teams could synthesise broader failure distributions from policy and tool specs. The resulting model may learn the shape of operational correction before it touches production.

Third, it could separate training realism from validation realism. During training, richer simulated feedback may be useful. During validation, real-system behaviour must remain the judge. This is a healthier division of labour than pretending every synthetic trace is either useless fantasy or perfect replacement. Reality, regrettably, is not binary.

The boundary: simulated reality is still synthetic reality

The paper is disciplined enough to acknowledge its limits. The experiments cover a finite set of domains: airline, retail, web navigation, operating-system tasks, and office workflows. These are useful domains, but they are not the full enterprise universe. Tool schemas differ. State dynamics differ. Risk profiles differ. A simulator that handles a calendar conflict does not automatically understand medical triage, procurement fraud, tax compliance, or industrial control systems. One hopes this did not need saying. It probably did.

The method also depends on strong LLM simulators. If GPT-5 or o4-mini is doing the world simulation, then the training pipeline inherits their biases, omissions, and priors. Prompt constraints reduce the action space, but they do not guarantee semantic correctness. A simulator can produce plausible feedback that is wrong in exactly the way a confident model is wrong: smoothly, consistently, and with excellent formatting.

There is also a governance issue. Once an LLM generates both the environment feedback and the reward, the training loop can optimise toward the simulator’s preferences. If the simulator rewards neat task completion while missing rare policy violations, the trained agent may become very good at the wrong version of the world. That is not an argument against the method. It is an argument for audit trails, simulator evaluation, adversarial validation, and real-environment testing before deployment.

The best business reading is therefore cautious but not timid: Simia-style simulation is a way to scale rehearsal, not to outsource reality.

The sandbox becomes part of the stack

Simia’s contribution is not that agents can be trained on synthetic data. That idea is already familiar. Its sharper contribution is that the environment itself can be simulated by a reasoning model well enough to improve agent training across multiple benchmark families.

That turns the sandbox into an active component of the model stack. It is no longer just a place where agents are tested. It becomes a generator of experience, a provider of feedback, and in reinforcement learning, a judge of success.

The result is a practical reframing. For many agent systems, the next bottleneck may not be more parameters or another heroic benchmark run. It may be the cost of constructing enough varied, consequential experience for the model to learn how tools push back.

The paper does not prove that simulated environments can replace production systems. It proves something more useful: for training agents, a well-constrained simulated world can be good enough to teach, cheap enough to scale, and structured enough to make smaller models surprisingly competent.

The sandbox thinks back. Sensible teams will still check what it says.

Cognaptus: Automate the Present, Incubate the Future.

Yuetai Li, Huseyin A. Inan, Xiang Yue, Wei-Ning Chen, Lukas Wutschitz, Janardhan Kulkarni, Radha Poovendran, Robert Sim, and Saravan Rajmohan, “Simulating Environments with Reasoning Models for Agent Training,” arXiv:2511.01824, 2025. arXiv. ↩︎

Agent training is really training on consequences#

Simia-SFT turns small seed sets into synthetic operating experience#

Simia-RL makes the environment part of the reinforcement loop#

The main evidence is benchmark performance, but the ablations do the interpretive work#

The robustness story is promising, not perfectly clean#

The business value is cheaper rehearsal, not automatic truth#

The simulated environment may be a better teacher than the real one#

The appendix is not a second thesis; it is the boundary map#

Where this should change agent development practice#

The boundary: simulated reality is still synthetic reality#

The sandbox becomes part of the stack#