Agents Need Worlds, Not Prompts: Inside ScaleEnv’s Synthetic Environment Revolution

Workflow automation has a bad habit of looking impressive right up to the moment it touches reality.

A demo agent can summarize a refund policy, draft a polite message, and call a refund_order() tool with great confidence. Then the real workflow asks a boring question: does this order exist, is it within the refund window, has it already been refunded, does the customer’s loyalty tier matter, and should the database state change after approval?

This is where many “agent” demos quietly become theatre. The model is not really operating in a world. It is performing around a script.

That is the useful reading of ScaleEnv: Scaling Environment Synthesis from Scratch for Generalist Interactive Tool-Use Agent Training.¹ The paper is not merely about generating more synthetic tool-call data. That would be too small, and frankly we already have enough synthetic examples teaching models to call APIs that may or may not survive contact with a database. ScaleEnv’s stronger claim is that generalist tool-use agents need executable worlds: tools, databases, user intents, hidden state, distractors, feedback, and verifiable final outcomes.

The difference matters. A prompt teaches a model how an action should sound. A world teaches it what the action does.

The problem is not tool calling; it is missing consequences

The paper begins from a familiar bottleneck in agent training. Language models can learn function-call syntax from demonstrations, but real agency requires interaction. The agent must act, observe consequences, repair mistakes, and learn which actions are merely plausible versus actually valid.

That requires an environment. Not a document. Not a policy paragraph. Not a simulator that invents convenient responses when the database is missing. An environment.

ScaleEnv frames the training task as a partially observable decision process: the agent sees the conversation and tool outputs, but the true external state lives in databases. A task succeeds only if the final environment state satisfies the user’s hidden intent. This framing is important because it moves evaluation away from “did the answer look good?” toward “did the world end up in the right state?”

For business readers, this is the whole game. In a company workflow, the costliest mistakes are often not linguistic. They are state mistakes: issuing the wrong refund, changing the wrong booking, approving a request without permission, updating one table but not the related table, or confidently telling the user a process is complete when the backend disagrees.

A customer may forgive a clumsy sentence. Databases are less sentimental.

ScaleEnv starts with a domain keyword, then builds a small operating universe

ScaleEnv’s first move is almost provocatively simple: start from a domain name, such as “Job Seeking,” then synthesize a domain foundation.

That foundation has two parts:

Component	What ScaleEnv generates	Why it matters
Tool schema	Tool names, descriptions, parameters, preconditions, postconditions, and output structures	Defines what the agent can do
Database schema	Tables, fields, constraints, and tool-database mappings	Defines what the world can remember and change

The paper’s job-seeking example is a useful sanity check. Tools include actions such as recording interview feedback, retrieving interviews for an application, searching applications by keyword, and updating salary expectations. The corresponding database schema includes job applications, application stages, interview schedules, and interview feedback.

This is not glamorous. Good. The future of agentic automation will be built from many deeply unglamorous state machines. The impressive part is not that an LLM can invent a function called add_salary_expectation. The impressive part is making that function operate against a coherent database where application_id, interview records, stages, and feedback actually line up.

The paper’s mechanism is top-down. First, an LLM conceptualizes the domain logic and proposes tool schemas. Then a database agent derives the required database structure from those tools. Finally, ScaleEnv maps tools to the tables they read or modify. In business terms, this resembles turning a process map into an executable workflow twin.

The important word is executable.

Procedural testing is the bridge from plausible APIs to reliable environments

Synthetic environments have an obvious failure mode: they can look coherent while being broken underneath.

ScaleEnv addresses this through procedural testing. After generating database code and tool code, the system generates tests, executes tools on matched database instances, and checks whether the resulting state transitions match expectation.

The paper distinguishes three outcomes:

Outcome	Meaning	Training implication
Success	The tool runs and updates state as expected	The environment can support valid exploration
Anticipated rejection	Invalid input is rejected by the specified exception logic	The agent can learn boundaries, not just happy paths
Unexpected failure	Runtime errors or inconsistent state appear	The environment must be debugged before training

This is a practical detail with strategic implications. Reinforcement learning depends on feedback. If the environment gives noisy feedback because a tool fails for accidental implementation reasons, the agent does not learn “this action is logically wrong.” It learns from garbage. A beautifully scaled pile of garbage is still garbage, just with a dashboard.

Procedural testing also changes the meaning of synthetic data. In many synthetic-data pipelines, the synthetic item is the training example. In ScaleEnv, the synthetic item is closer to an executable laboratory. The agent can try multiple paths, including imperfect paths, and still receive meaningful observations.

That is the paper’s core correction to the likely misconception: ScaleEnv is not mainly producing more tool-call transcripts. It is producing environments where transcripts can be generated through interaction.

Graph expansion makes the sandbox explorable, not just scripted

A minimal environment can solve one task. It cannot train a robust agent.

Suppose the reference solution is: search for a job, upload a resume, submit an application. If the database contains only the exact records required for that path, the agent can survive by following a narrow trail of breadcrumbs. That is not exploration. That is a guided museum tour with function calls.

ScaleEnv tries to avoid this by constructing a tool dependency graph. Edges represent relationships such as parameter flow, logical prerequisites, and shared database state. Then the system samples executable seed tool chains and expands them into richer subgraphs.

This is where the mechanism becomes more interesting. The paper identifies two requirements for RL-ready task environments:

Requirement	Explanation	Business analogue
Entity consistency	Entities must align across tables	A customer ID in one system must correspond to the same customer elsewhere
Interaction completeness	Valid alternative tool calls should still return meaningful observations	An employee taking a non-optimal but allowed workflow path should not crash the process

The second requirement is easy to underestimate. Many workflow demos validate only the golden path: the sequence the designer expected. Real agents will not stay on the golden path. They will ask extra questions, retrieve irrelevant records, try actions in inconvenient order, and occasionally poke the workflow with all the grace of a raccoon in a server room.

ScaleEnv’s graph expansion attempts to make the environment robust to those deviations. It adds tools only when their dependencies can be satisfied by the existing subgraph. It executes newly added tools and refines the environment when errors appear. It also uses an LLM-gated expansion policy based on structural complexity and feasibility, with an oracle-style agent estimating whether additional executable chains can still be found.

The result is a synthetic world that is not merely solvable. It is explorable.

Rule-based final-state rewards keep learning attached to the database

The reward design is one of the paper’s most business-relevant choices.

Instead of relying primarily on an LLM judge, ScaleEnv uses a rule-based evaluator that compares the final database state against a ground-truth state. The evaluator treats different fields differently:

Field type	Matching policy	Example
Exempt fields	Ignore fields that do not affect success	Generated IDs, optional metadata
Hard constraints	Require strict equality	Prices, quantities, timestamps
Semantic alignment	Allow fuzzy semantic matching	Descriptive notes or comments

This is more than an engineering preference. It is an anti-theatre mechanism.

LLM-as-judge evaluation is flexible, but it can reward answers that sound aligned while the underlying state is wrong. ScaleEnv’s reward says: the final database has to be right. For enterprise automation, that is the only interpretation that survives audit.

The paper’s reward ablation is modest but telling. Averaged across three τ²-Bench domains, the rule-based reward reaches 38.5 Avg@4 versus 36.5 for LLM-as-a-Judge, 62.9 Pass@4 versus 58.8, and 15.0 Pass^4 versus 14.6. The gains are not fireworks. They are exactly the kind of small, consistent improvement one expects when a reward signal becomes less negotiable.

The cost argument may be just as important: rule-based verification avoids repeatedly asking a large model to judge every rollout. In large-scale RL, that matters. Apparently “please let another expensive model decide if this expensive model did the thing” is not a business model forever.

What the experiments actually test

The experiments should be read as a chain of evidence, not as one giant victory lap.

ScaleEnv trains Qwen3-SE models from Qwen3 using GRPO over synthesized domains and tasks. The synthesis phase uses several strong models for different agent roles, including DeepSeek-V3.2, GLM-4.7, GPT-5.1, and Qwen3-32B. A Qwen2.5-72B-Instruct model acts as the user simulator. Across 16 synthesized domains, each environment contains roughly 50 tools and 5–20 database tables.

The evaluation then asks whether training on those synthetic environments improves zero-shot performance on unseen domains and formats, especially τ²-Bench and VitaBench. The paper emphasizes that the training domains are disjoint from the evaluation domains, and Appendix A uses tool-embedding visualization to support the out-of-distribution claim.

Here is the clean way to classify the paper’s evidence:

Test	Likely purpose	What it supports	What it does not prove
Main zero-shot benchmark results	Main evidence	ScaleEnv training improves Qwen3 variants on unseen tool-use benchmarks	Universal agent generalization across all business workflows
VitaBench Pass@4	Capability-ceiling analysis	Trained models are more likely to find at least one correct solution within four attempts	Reliable single-shot deployment behavior
Domain scaling analysis	Sensitivity / scaling test	More training domains help when task count is fixed	A precise scaling law for all environment synthesis
Executability verification ablation	Ablation	Verification contributes to stronger training data and better final performance	Verification alone explains all gains
Reward mechanism ablation	Ablation	Rule-based final-state rewards outperform LLM judging in this setting	LLM judges are always inferior
Domain stability analysis	Robustness check	Gains are not obviously from one lucky domain subset	Any random domain set will work equally well
Tool-embedding visualization	OOD verification support	Evaluation domains appear semantically separated from training domains	Full causal proof of out-of-distribution reasoning
Token cost statistics	Implementation and cost boundary	Environment synthesis is token-expensive	Commercial ROI without further engineering

This table matters because agent papers often blur evidence categories. A main result is not an ablation. A robustness check is not a second thesis. A figure showing semantic separation is useful support, not a mystical certificate of general intelligence.

One must keep the evidence in its proper container. Otherwise the usual AI discourse happens, and we all lose several IQ points before lunch.

The benchmark gains are broad, but the interpretation should stay sober

The main results show consistent improvements from Qwen3 to Qwen3-SE across the reported evaluation domains.

Model comparison	τ²-Bench Retail	τ²-Bench Airline	τ²-Bench Telecom	VitaBench Cross	VitaBench Delivery	VitaBench Instore	VitaBench OTA
Qwen3-8B	38.4	30.5	21.5	1.5	18.3	14.8	4.5
Qwen3-SE-8B	50.9	37.5	27.2	3.0	26.3	23.8	7.0
Gain	+12.5	+7.0	+5.7	+1.5	+8.0	+9.0	+2.5
Qwen3-32B	59.5	48.0	27.2	5.3	27.0	22.5	4.5
Qwen3-SE-32B	63.6	48.0	30.9	10.8	31.3	34.5	12.5
Gain	+4.1	+0.0	+3.7	+5.5	+4.3	+12.0	+8.0

The pattern is more important than any single number. ScaleEnv improves both the 8B and 32B models, with particularly visible gains in several VitaBench categories. For Qwen3-SE-32B, the cross-domain VitaBench subset rises from 5.3 to 10.8. That is roughly a doubling, but from a low base. This is progress, not a reason to hand the model the corporate procurement system and go golfing.

Pass@4 gives a complementary view. On VitaBench, Qwen3-SE-8B improves average Pass@4 from 27.0 to 35.8, while Qwen3-SE-32B improves from 36.0 to 46.8. The cross-domain subset is especially notable: Qwen3-32B rises from 15 to 29 under ScaleEnv training.

Pass@4 is best read as a search-capacity measure. It asks whether the model can find at least one correct trajectory across four attempts. For business use, that suggests improved recoverability and reasoning range. It does not mean the agent is production-ready in one attempt under strict latency and compliance constraints.

The domain scaling analysis adds another important point. The authors train models with $N \in {2,4,8,16}$ unique domains while keeping the task count fixed at 1024. Performance improves monotonically across the plotted VitaBench and τ²-Bench settings, and the curve has not obviously plateaued at 16 domains. The interpretation is straightforward: diversity of worlds matters, not just quantity of tasks.

This is a useful correction for companies building internal agent training data. One thousand examples from the same workflow corner may teach the model to behave well in that corner. It may not teach the model how to handle new state structures, permission patterns, and multi-tool dependencies. Variety is not decoration. It is part of the training signal.

The ablations say verification matters, but not in a cartoonishly simple way

The executability verification ablation is one of the strongest parts of the paper because it tests a specific mechanism rather than just admiring the final model.

When ScaleEnv removes execution-based verification, performance falls below full Qwen3-SE-8B across the three τ²-Bench domains in Table 3:

Setting	Retail	Airline	Telecom
Qwen3-8B	38.4	30.5	21.5
Without executability verification	42.3	30.0	25.2
Qwen3-SE-8B	50.9	37.5	27.2

This table should be read carefully. Without verification, the model still beats the base model on Retail and Telecom, but it does not reach the full ScaleEnv model. So the evidence does not say “verification is the only ingredient.” It says “verification is a meaningful ingredient in the full system.”

That distinction matters. The framework contains multiple useful mechanisms: synthetic domain diversity, tool/database schemas, graph expansion, distractors, executable tool chains, and rule-based rewards. Verification is not magic dust. It is the part that prevents plausible-but-broken worlds from poisoning the learning process.

The authors’ explanation is convincing: without execution verification, generated tool calls can look semantically reasonable while failing at runtime because of unsatisfied preconditions or mismatched database states. In RL, that creates conflicting reward signals. The agent is punished or rewarded for artifacts of the environment rather than for the logic of its decisions.

For enterprise agent training, this maps directly to workflow QA. If a synthetic workflow twin cannot survive test execution, it should not train an agent. Otherwise the company is not training autonomy. It is training confusion at scale.

The appendix is mostly about boundaries: OOD, diversity, and cost

Appendix A supports the out-of-distribution claim through tool-embedding visualization. The 16 synthesized training domains form separated clusters, while τ²-Bench and VitaBench evaluation domains occupy distinct regions. This is useful because it reduces the suspicion that the benchmark gains come from simple domain overlap.

Still, embedding separation is not a proof of deep reasoning. It is supporting evidence. The stronger evidence remains behavioral: performance improves on evaluation domains with different tools, task patterns, and formats.

Appendix B is more operationally interesting than it first appears. The 16 synthesized domains cover different action-space sizes, state-space complexities, and dependency densities. The number of tools per domain ranges from roughly 25 to over 70, while database table counts range from 5 to 22. The paper also reports 2,560 total tasks distributed unevenly across domains, from 512 tasks for wedding planning and knowledge management down to 64 tasks for several smaller domains.

The cost numbers deserve attention. A complete domain foundation requires about 546k tokens, and a single verifiable task requires about 93.2k tokens. That is not a footnote; it is an economic boundary.

For research, this is acceptable evidence that environment synthesis can scale. For business, it says the obvious quiet part: building reliable synthetic worlds is cheaper than breaking production, but not free. The ROI case depends on how many workflows reuse the same domain foundation, how costly errors are, how often processes change, and whether the synthetic environment can be maintained as the real workflow evolves.

In other words, the sandbox is an asset only if it stays synchronized with the business. Otherwise it becomes enterprise fan fiction with unit tests.

The business value is workflow twins, not synthetic data for its own sake

The most practical lesson from ScaleEnv is that businesses should think less about “agent prompts” and more about “workflow twins.”

A workflow twin is a safe, executable copy of a business process: tools, permissions, database state, constraints, user roles, exceptions, and expected final outcomes. Agents can practice inside it. Evaluators can test them inside it. Engineers can inspect failures inside it. Compliance teams can define what success means before the model touches production.

This leads to a useful implementation hierarchy:

Layer	Business question	ScaleEnv-inspired design principle
Process schema	What actions exist?	Define tools with preconditions, postconditions, parameters, and outputs
State schema	What must be remembered?	Build database structures and integrity constraints first
Execution testing	Do the tools actually work?	Run procedural tests before training or evaluation
Task generation	What user goals should the agent satisfy?	Ground instructions in executable seed chains
Exploration space	Can the agent take alternative valid paths?	Expand environments beyond the golden path
Reward design	What counts as success?	Check final state, not just language quality
Deployment boundary	What remains unsafe?	Keep production APIs behind validation, permissions, and rollback

This is where ScaleEnv becomes relevant to Cognaptus-style automation. Many companies want agents that can coordinate CRM updates, invoice handling, customer support actions, HR requests, compliance documentation, and operational scheduling. The temptation is to attach tools to a model and write a better system prompt.

That may work for read-only assistance. It is weak medicine for state-changing workflows.

A serious automation pipeline should first build the sandbox: define the workflow, generate representative database states, include distractors and edge cases, verify tools, and evaluate final-state correctness. Only then does agent training or tuning become meaningful.

The agent should learn inside a world before being allowed near the real one. Radical concept, apparently.

What the paper directly shows, what we can infer, and what remains uncertain

ScaleEnv directly shows that Qwen3 models trained on synthesized, executable, verifiable environments improve zero-shot performance on unseen τ²-Bench and VitaBench domains. It also shows that increasing domain diversity helps under the tested setup, that execution verification improves the full system, and that rule-based final-state rewards perform better than LLM-as-judge rewards in the reported ablation.

Cognaptus can reasonably infer that sandbox-first agent development is a stronger architecture for business automation than prompt-first development. The practical path is to build verified workflow twins, use them for training and evaluation, and reserve real API access for controlled deployment. This is especially relevant where workflows involve multi-step state changes, permissions, hidden user goals, and costly mistakes.

What remains uncertain is equally important.

First, the evidence is benchmark-based. τ²-Bench and VitaBench are useful, but they are not your company’s ERP, claims system, procurement portal, or delightfully ancient spreadsheet macro from 2014 that everyone fears but nobody removes.

Second, ScaleEnv depends on strong LLMs for synthesis, testing, debugging, user simulation, and expansion decisions. The paper reports token costs, but not a complete commercial operating model for maintaining these environments over time.

Third, the generated environments are only as valuable as their fidelity. A workflow twin that omits a rare compliance exception may train an agent to fail exactly where failure is most expensive.

Fourth, ScaleEnv is domain-agnostic, which is both powerful and risky. The authors’ impact statement notes that arbitrary environment synthesis could be misused to model harmful behavior. In business settings, the more immediate risk is less cinematic: someone generates a workflow world that looks valid but encodes the wrong policy.

Finally, Pass@4 improvements should not be mistaken for guaranteed production reliability. A model that can find one correct solution in four attempts may still be unacceptable for workflows requiring deterministic first-attempt correctness.

Prompts teach manners; worlds teach consequences

ScaleEnv’s contribution is not that it discovered agents need tools. That part has been obvious for a while. The contribution is a more operational claim: agents need worlds where tool use has consequences, where consequences are verifiable, and where exploration does not collapse because the environment only supports the designer’s favorite path.

For business automation, this shifts the design question.

The old question was: “How do we prompt the agent to follow the workflow?”

The better question is: “Can we build a safe, executable version of the workflow where the agent can learn, fail, recover, and be judged by final state?”

That is a harder question. It requires schemas, tests, database logic, synthetic states, reward design, and maintenance. It is also much closer to how real automation survives.

ScaleEnv is not a deployment recipe by itself. It is a research framework with benchmark evidence, synthetic-domain assumptions, and real cost boundaries. But its central lesson is sturdy: if an agent is expected to operate in a business process, do not train it only on descriptions of that process.

Give it a world.

Then see what it does when the world pushes back.

Cognaptus: Automate the Present, Incubate the Future.

Dunwei Tu et al., “ScaleEnv: Scaling Environment Synthesis from Scratch for Generalist Interactive Tool-Use Agent Training,” arXiv:2602.06820, 2026. https://arxiv.org/abs/2602.06820 ↩︎

The problem is not tool calling; it is missing consequences#

ScaleEnv starts with a domain keyword, then builds a small operating universe#

Procedural testing is the bridge from plausible APIs to reliable environments#

Graph expansion makes the sandbox explorable, not just scripted#

Rule-based final-state rewards keep learning attached to the database#

What the experiments actually test#

The benchmark gains are broad, but the interpretation should stay sober#

The ablations say verification matters, but not in a cartoonishly simple way#

The appendix is mostly about boundaries: OOD, diversity, and cost#

The business value is workflow twins, not synthetic data for its own sake#

What the paper directly shows, what we can infer, and what remains uncertain#

Prompts teach manners; worlds teach consequences#