World Models Meet the Office From Hell

Office software has a special talent: it says “success” at the exact moment something has gone wrong somewhere else.

A ticket is updated. A role is assigned. An asset is transferred. The API returns a cheerful confirmation. The agent, bless its silicon heart, declares victory.

Then a background workflow fires. A user’s clearance changes. Another workflow reacts to that clearance change. A different record is silently updated. A constraint is now violated. The agent does not notice, because the agent saw the office equivalent of a green checkmark and mistook it for reality.

That is the trap at the center of World of Workflows: A Benchmark for Bringing World Models to Enterprise Systems, which introduces WoW, a ServiceNow-based enterprise environment, and WoW-bench, a benchmark designed to test whether frontier LLM agents can reason about hidden workflow dynamics rather than merely call tools with acceptable grammar.¹

The paper’s point is not that enterprise agents are bad at using APIs. That would be too easy, and frankly less interesting. The sharper finding is that modern agents can execute locally valid actions while failing to model the invisible state transitions those actions trigger. In real enterprise systems, that is not a minor evaluation artifact. That is the job.

The failure begins after the tool succeeds

The cleanest way to understand WoW is to start with the paper’s asset-clearance example.

A user has clearance level 4 and already owns three assets. The agent is asked to assign two more assets, both requiring clearance level 4. Looking only at the visible state before the first action, assigning the first asset appears valid. The user has sufficient clearance. The tool call succeeds.

But the system contains a hidden workflow: if a user has more than three assets, their clearance is decremented by one. That first successful assignment pushes the user over the threshold, so the workflow lowers the user’s clearance from 4 to 3. That clearance change then triggers a second workflow, which removes assets whose required clearance now exceeds the user’s clearance.

The agent’s local action was valid. The system-level outcome was not.

This is the mechanism the paper wants readers to notice. Enterprise software is not a flat spreadsheet with a chatbot attached. It is a partially observable, workflow-governed state machine. A single action does not simply produce a response; it induces a transition:

$$ s_t \xrightarrow{a_t} s_{t+1} $$

The problem is that the agent usually sees only a small observation of that transition. In the standard setting, it sees the tool response: success, failure, fetched record, error message, and perhaps some visible fields. It does not see the full relational database diff. It does not see every triggered workflow. It does not see the hidden causal chain.

So the benchmark asks a more demanding question than “Can the model choose the right tool?” It asks: can the model mentally simulate the enterprise system well enough to avoid damage when the system does not explain itself?

That is a much nastier question. Naturally, it is the useful one.

WoW turns enterprise software into a world-model test

WoW is built on a ServiceNow developer instance. The environment includes IT-style business domains such as users, incidents, assets, knowledge bases, catalog items, and expenses. Agents interact through MCP tools rather than a visual UI. That is a sensible design choice: in many serious enterprise automation deployments, APIs are the natural interface, and UI navigation errors are an expensive distraction from the deeper problem.

The environment contains more than 4,000 business rules, reported in the paper’s environment section as 4.8K, and 55 active workflows. The appendix also notes that the authors extended an existing ServiceNow MCP server to 108 tools covering create, read, update, and delete operations. This matters because it prevents the benchmark from becoming a toy tool-selection exercise. The action space still contains free-form parameters, IDs, timestamps, categorical choices, and text fields. One wrong identifier can turn a good plan into enterprise confetti.

WoW-bench contains 234 tasks:

Task category	Count	What it tests	Why it matters
Constraint understanding	50	Whether the model can identify violations caused by hidden workflows	Tests whether “policy compliance” survives system dynamics
Agentic task completion	50	Whether the model can complete long-horizon tasks under constraints	Tests operational reliability, not just surface completion
Action prediction	67	Whether the model can infer the action that caused observed state changes	Tests inverse dynamics: from state diff back to cause
Audit prediction	67	Whether the model can predict database diffs caused by actions	Tests forward dynamics: from action to hidden consequences

The important split is between two observation modes.

The first is ordinary tool feedback, $O_{tool}$. This is what most agents effectively get: the output of the API call.

The second is table-audit feedback, $O_{audit}$. This augments tool feedback with structured database changes. Each audit item is a tuple containing table name, column name, old value, and new value. In other words, the agent sees a tractable version of the state diff.

This second mode is not a magic production feature. The authors treat it as an ablation: a way to test whether agents fail because they lack information, or because they cannot reason even when the relevant state changes are exposed. That distinction is the spine of the paper.

The audit-log ablation is the paper’s diagnostic lever

A weak summary of WoW would say: “Agents do better with audit logs.” True, but too shallow.

The more useful reading is this: audit logs separate observability failure from reasoning failure.

If agents perform badly with normal tool responses but much better with audit logs, then the deployment problem is partly about visibility. They need better state-diff access, monitoring, memory, or instrumentation.

If agents still perform badly with audit logs, then the problem is not merely visibility. It is symbolic tracking, causal rollout, and transition modeling.

WoW finds both.

In agentic task completion, the paper reports ordinary task success rate (TSR) and task success rate under constraint (TSRUC). TSR asks whether the task goal was achieved. TSRUC asks whether it was achieved without violating hidden constraints. That second metric is the one enterprise buyers should care about, unless their compliance strategy is “the demo looked fine.”

Model	TSR with audit	TSRUC with audit	TSR with tool response	TSRUC with tool response
GPT-5.1	32%	14%	22%	2%
Gemini-3-Pro	42%	16%	38%	6%
Sonnet-4.5	58%	30%	32%	4%
Opus-4.5	36%	14%	26%	8%

The pattern is blunt. Standard tool feedback allows agents to achieve some task success, but reliability under constraints collapses. GPT-5.1 reaches 22% task success with tool responses, but only 2% task success under constraint. Gemini-3-Pro reaches 38% TSR, but only 6% TSRUC. Sonnet-4.5 reaches 32% TSR, but only 4% TSRUC.

The gap between “completed” and “completed safely” is not a rounding error. It is the paper.

Audit visibility improves the picture. Sonnet-4.5 rises to 58% TSR and 30% TSRUC with audit logs. GPT-5.1 rises from 2% to 14% TSRUC. Gemini-3-Pro rises from 6% to 16%.

But even with audit logs, the strongest reported TSRUC is 30%. That is the interesting discomfort. More visibility helps a lot. It does not make the agent reliable.

The benchmark therefore avoids two lazy conclusions. It does not say, “Just expose logs and the agent will be fine.” It also does not say, “Models are hopeless.” It says the system must be designed around dynamics: explicit state, causal monitoring, transition prediction, and rollback-aware control.

Apparently, “please be careful” was not an architecture after all.

The main results are about dynamics, not model ranking

Model leaderboards are tempting. They are also the least interesting part of this paper.

The main evidence is distributed across several tests, each serving a different purpose:

Test or figure	Likely purpose	What it supports	What it does not prove
Figure 1 cascading workflow example	Mechanism illustration	Shows how locally valid actions can trigger hidden downstream violations	Does not quantify model performance
Agentic task table	Main evidence	Shows the gap between task completion and constrained reliability	Does not isolate whether every failure is from observability
Constraint understanding with tool vs audit observations	Main evidence plus ablation	Shows that state-diff visibility sharply improves violation detection	Does not imply audit access is cheap or always available
Action prediction	Dynamics diagnostic	Shows models often identify tool type but fail exact action reconstruction	Does not directly measure end-to-end task safety
Audit prediction IoU and exact match	Dynamics diagnostic	Shows weak forward modeling of state transitions	Does not mean models know nothing; partial IoU captures some partial structure
Single-trajectory constraint understanding appendix table	Additional evaluation	Shows simpler violation cases become much easier with audits	Does not solve compositional, long-context constraint reasoning
Tool-dependency graph sampling	Benchmark construction detail	Improves trajectory realism by sampling connected tool chains	Does not itself prove agents are reliable or unreliable

This table matters because the paper’s experiments are not all making the same claim.

The agentic-task results are the operational warning. The audit-log comparison is the observability ablation. The action and audit prediction tasks isolate whether LLMs can behave like enterprise world models. The appendix tables sharpen the failure analysis: single violations are easier than compositional ones; exact state prediction remains weak; parameter grounding is a recurring source of error.

In other words, the paper is not merely saying “frontier models score low.” It is decomposing the failure.

That decomposition is exactly what enterprise teams need before they start pouring budget into longer prompts, prettier dashboards, and yet another agent orchestration layer with the word “copilot” taped onto it.

Tool names are not enough when parameters carry the world

The action prediction results are especially revealing.

The models do reasonably well at predicting the tool name. GPT-5.1 reaches 77.8% tool-name accuracy, Gemini-3-Pro 79.0%, and Sonnet-4.5 81.7%. If the benchmark stopped there, one might conclude that these agents understand the workflow reasonably well.

But full action accuracy, which requires the correct tool and correct parameters, falls to 27.3% for GPT-5.1, 18.3% for Gemini-3-Pro, and 28.1% for Sonnet-4.5.

That gap is the difference between “I know this is an update incident operation” and “I know exactly which incident record, user ID, or system object must be updated.” Enterprise systems live in that gap. A tool name is a verb. The parameter is the thing being acted upon. If the agent confuses a username with a sys_id, it is not making a cosmetic formatting mistake. It is losing the entity.

The paper’s error analysis gives concrete examples. Models confuse human-readable names with symbolic identifiers. For user IDs, the appendix reports that 73.5% of user_id failures involve predicting the username instead of the sys_id; for incident IDs, 91.2% of incident_id failures involve predicting the incident number instead of the sys_id.

This is why “semantic understanding” is not enough. In a relational system, “John Smith,” jsmith, and 6816f79cc0a8016401c5a33be04be441 may refer to the same human-facing entity, but they are not interchangeable operational objects. A reliable agent needs an entity model, not a vibes-based relationship with labels.

The business implication is direct: any enterprise agent that touches production systems needs an explicit entity-resolution layer. The model should not be trusted to preserve identity through raw text alone. It needs durable mappings, validated IDs, reference checks, and a state object that survives beyond the chat transcript.

Forward dynamics remain weak even when the task is isolated

Audit prediction asks the model to predict the state changes caused by actions. This is the forward-modeling task: given an action sequence, what database diffs should occur?

The results are not flattering. Average audit-prediction IoU is 18% for GPT-5.1 and 22% for both Gemini-3-Pro and Sonnet-4.5. Exact full audit prediction is harsher: in the appendix, all three models score 0% exact accuracy at $K=1$, and average exact accuracy across rollout steps remains in single digits: 5.4% for GPT-5.1, 8.6% for Gemini-3-Pro, and 8.1% for Sonnet-4.5.

The distinction between IoU and exact match is important. IoU gives credit for partially matching the set of audit changes. Exact full matching requires getting all fields and values right. In enterprise workflows, partial understanding is useful for diagnosis, but exactness is often what separates safe automation from a postmortem.

The paper also notes characteristic audit-prediction errors: wrong predictions, false positives, and false negatives. A model may predict side effects that never occur, miss side effects that do occur, or get values wrong in ways that look trivial to a human but matter to software. Confusing "0" with False may not impress a database trigger.

This is where the phrase “world model” earns its keep. A world model is not just a memory of what usually happens. It is a predictive model of transitions. In this setting, it means knowing that creating an incident may update incident, but also may produce records in metric_instance, task_sla, or other tables depending on system workflows.

The paper’s diagnosis is that models rely on plausible semantic priors. “Create incident” sounds like it should change the incident table. True. But the real enterprise system has physics, and those physics are written in workflows and business rules. Plausibility is not simulation.

The most dangerous agent is the greedy one that looks productive

The agentic-task failures are not only about missing logs or wrong IDs. They also reveal a planning pathology: agents optimize for immediate action success.

The paper calls out cascading failures of the form:

$$ Action \rightarrow Workflow_A \rightarrow Workflow_B \rightarrow Violation $$

That pattern is familiar to anyone who has worked near ERP, ITSM, HRIS, CRM, or finance systems. A local update changes a status. The status triggers a process. The process updates another record. The updated record changes eligibility, ownership, approval, SLA, billing, inventory, or access. Somewhere downstream, an invariant breaks.

The agent sees the first green checkmark and keeps going.

The appendix adds two details that should make enterprise teams pause. First, even when audit logs are available, models struggle to keep symbolic state updated over long trajectories. Second, when tasks are inherently impossible because of constraints, none of the models correctly identify that impossibility, despite being prompted to report such cases. The agents also generally do not attempt to reason about or revert constraint violations.

That is not merely a benchmark failure. It is an operating pattern.

A reliable enterprise agent should be able to say: “This task cannot be completed safely under the current constraints.” Today’s agents are more likely to proceed until something breaks invisibly. Very efficient. Very modern. Very much the sort of productivity that creates a governance meeting.

What this changes for enterprise agent design

The paper directly shows that frontier LLM agents perform poorly in WoW’s workflow-heavy ServiceNow environment, especially when judged by constrained reliability and dynamics prediction. Cognaptus’ business inference is broader: similar failure modes should be expected in any enterprise system where actions trigger hidden state changes across relational records.

That includes ERP, ITSM, CRM, HR, procurement, compliance, finance operations, and internal approval systems. The specific percentages should not be copied into a board deck as universal failure rates. The mechanism, however, travels well.

A practical architecture for enterprise agents should therefore treat tool execution as only one layer:

Layer	What it must do	Why WoW makes it hard to ignore
Tool interface	Execute validated CRUD or business actions	Tool success does not imply task safety
Entity memory	Track users, assets, incidents, roles, and IDs as persistent objects	Models confuse names, numbers, and system IDs
State-diff monitor	Capture before/after changes when possible	Audit visibility sharply improves constraint detection
Constraint engine	Evaluate explicit and derived policy rules after actions	TSR without constraints overstates reliability
Workflow simulator	Predict likely downstream effects before execution	Audit prediction results show weak internal dynamics
Rollback and repair planner	Revert or compensate when constraints are violated	Agents rarely reason about reverting violations
Probe policy	Test boundary conditions in safe sandboxes	Zero-shot assumptions are brittle in opaque systems

This architecture is heavier than a simple tool-calling agent. That is the point.

If a vendor promises production-grade enterprise autonomy with “LLM + tools + prompt,” WoW gives the buyer a better question: where is the state model? Where is the workflow simulator? Where is the constraint monitor? Where is the audit-diff loop? Where is the rollback plan?

If the answer is “the model reasons about it,” ask to see the logs. Preferably before audit season.

The ROI is cheaper diagnosis before expensive autonomy

The business value of this paper is not that WoW gives companies an immediate plug-and-play deployment recipe. It does not. The benchmark is an evaluation environment, not a finished enterprise agent platform.

Its value is diagnostic. It helps separate four questions that are often confused:

Can the agent choose the right tool?
Can it bind the right parameters and entities?
Can it observe the state changes that matter?
Can it predict hidden transitions before acting?

Most demos answer the first question. Some answer the second. Few seriously test the third and fourth.

This matters for ROI because failed enterprise automation is expensive in boring ways. Not cinematic “AI destroys the company” ways. More like: exception handling expands, compliance review slows deployment, engineers build custom patches, domain experts babysit workflows, and the automation team quietly reduces scope until the agent becomes a glorified search box.

WoW suggests a better investment sequence.

Start with constrained domains where workflows are mapped, audit logs are available, and rollback is safe. Instrument the system so state diffs are visible. Build entity-level memory and validation before adding broader autonomy. Use sandbox probing to learn workflow triggers. Evaluate agents by constrained success, not surface completion.

Only then expand autonomy.

This is less glamorous than announcing an agentic transformation initiative. It is also less likely to produce a spreadsheet full of invisible violations. One must choose one’s aesthetic.

The boundary: ServiceNow is not the whole enterprise universe

The paper’s limitations are material and should not be polished away.

WoW is based on a ServiceNow-centered environment. Its tasks, constraints, and workflows are curated by experts. The benchmark covers a subset of possible workflow dynamics inside a single enterprise platform. The authors also state that WoW is designed primarily as an evaluation benchmark rather than a training environment, though its interactive design may support future training-time work.

So the correct conclusion is not: “All enterprise agents will fail at exactly these rates.”

The correct conclusion is: “Workflow-governed enterprise systems create a class of hidden-dynamics problems that current frontier agents handle poorly under this benchmark.”

That boundary is enough. The benchmark does not need to represent every enterprise stack to be useful. It identifies a failure mode that is structurally common: partial observability plus hidden workflows plus relational state plus constraints. ServiceNow is one office from hell. There are many others, lovingly customized by operations teams over the last twenty years.

The appendix quietly reinforces the main story

The appendix is not a second thesis. It mostly sharpens the diagnosis.

The constraint list shows that the benchmark is built around realistic policy-like rules: flagged articles should not be published, users should not be assigned the same role as their managers, assets should not move across countries without approval, high-priority requests should not be rejected for certain reasons, and clearance constraints should govern asset transfers.

The annotation process also matters. Constraint tasks are derived from expert-designed trajectories that intentionally trigger hidden workflow violations. Agentic tasks are built from templates and permutations of those constraints. Action and audit prediction trajectories come from a mix of manual creation, tool-dependency graph sampling, and random sampling. The tool-dependency graph is especially important because random tool calls often fail to reach meaningful workflow states. Real side effects emerge from connected action chains over the same entities.

The appendix tables then reinforce three points.

First, audit visibility helps dramatically on simpler constraint-understanding cases. In the single-trajectory constraint table, models score 70–80% with audit observations and only 10–20% with ordinary tool observations.

Second, compositional cases remain hard. The paper notes that compositional constraint tasks can reach much longer contexts, with average trajectory lengths around 109K tokens versus 53K for single-violation cases. More logs create more evidence, but also more state-tracking burden.

Third, exact audit prediction remains weak. This is the dynamics problem again: models can sometimes guess partial changes, but they do not reliably reconstruct the full state transition.

The appendix therefore supports the main argument rather than distracting from it. Visibility helps. State tracking remains hard. Hidden workflows are not solved by enthusiasm.

From reactive agents to dynamics-aware systems

The paper’s proposed direction is dynamics-aware enterprise agents.

That phrase can sound abstract, so let’s make it operational. A dynamics-aware agent should not merely ask, “What tool should I call next?” It should ask:

What entities are currently involved?
What state variables matter for the task and constraints?
What hidden workflows might trigger if I act?
What state diff do I expect after the action?
What did the actual state diff show?
Did the action make future steps unsafe?
Should I continue, repair, roll back, or stop?

This is closer to model-based control than chat automation. The agent needs a predictive layer that simulates state transitions before execution and updates its beliefs after execution. It needs persistent symbolic grounding, not just a long context window. It needs active probing in safe environments because zero-shot behavior inside a customized enterprise stack is not strategy; it is gambling with better typography.

The paper does not prove that model-based reinforcement learning is the only answer. It does, however, make a strong case that purely reactive tool chains are underpowered for workflow-heavy enterprise systems.

The real benchmark is not whether the agent acts

WoW-bench is useful because it attacks a comfortable assumption: that enterprise autonomy is mainly about better instruction following, broader tool access, and cleaner integration.

Those things matter. They are not enough.

The hard part is the office after the click. The hidden rule. The downstream table. The invisible workflow. The constraint that becomes violated only after a successful local action changes the world.

That is where current agents struggle. They can act. They can often act plausibly. Sometimes they can even act cheaply. But reliability requires something more specific: the ability to represent symbolic entities, predict state transitions, monitor hidden consequences, and reason across causal chains.

World models meet the office from hell. The office wins, for now.

Cognaptus: Automate the Present, Incubate the Future.

Lakshya Gupta, Litao Li, Yizhe Liu, Sriram Ganapathi Subramanian, Kaheer Suleman, Zichen Zhang, Haoye Lu, and Sumit Pasupalak, “World of Workflows: A Benchmark for Bringing World Models to Enterprise Systems,” arXiv:2601.22130, 2026, https://arxiv.org/abs/2601.22130. ↩︎

The failure begins after the tool succeeds#

WoW turns enterprise software into a world-model test#

The audit-log ablation is the paper’s diagnostic lever#

The main results are about dynamics, not model ranking#

Tool names are not enough when parameters carry the world#

Forward dynamics remain weak even when the task is isolated#

The most dangerous agent is the greedy one that looks productive#

What this changes for enterprise agent design#

The ROI is cheaper diagnosis before expensive autonomy#

The boundary: ServiceNow is not the whole enterprise universe#

The appendix quietly reinforces the main story#

From reactive agents to dynamics-aware systems#

The real benchmark is not whether the agent acts#