World Models Meet the Office From Hell

Opening — Why this matters now

Enterprise AI has entered an awkward phase. On paper, frontier LLMs can reason, plan, call tools, and even complete multi-step tasks. In practice, they quietly break things.

Not loudly. Not catastrophically. Just enough to violate a policy, invalidate a downstream record, or trigger a workflow no one notices until audit season.

This gap—between apparent task success and actual system correctness—is exactly what World of Workflows (WoW) sets out to expose. And it does so with uncomfortable precision.

Background — The illusion of “enterprise-ready” agents

Most existing enterprise benchmarks look enterprise-themed but behave like consumer tasks in disguise. They test:

UI navigation
Tool calling accuracy
Instruction following

What they don’t test is the defining feature of real enterprise systems: hidden workflows with cascading side effects.

In platforms like ServiceNow, SAP, or internal ERP stacks, a single API call rarely means a single state change. It means:

Business rules firing
Workflows triggering other workflows
Silent database mutations across unrelated tables

Previous benchmarks flatten this reality. WoW does not.

Analysis — What WoW actually builds

WoW is a ServiceNow-based enterprise environment with:

4,000+ business rules
55 active workflows
A large, partially observable relational database

On top of this, the authors introduce WoW-bench, a 234-task benchmark designed not to test whether an agent can act, but whether it can model enterprise dynamics.

The tasks fall into four categories:

Category	What it tests
Constraint Understanding	Can agents detect violations caused by hidden workflows?
Agentic Task Completion	Can agents finish long tasks without triggering silent failures?
Audit Prediction	Can agents predict downstream database state changes?
Action Prediction	Can agents infer which action caused an observed state change?

Crucially, WoW introduces two observation modes:

Tool Response: Standard API feedback (what most agents see today)
Audit Logs (Oracle): Explicit table-level state diffs after each action

This split allows a clean diagnosis: are agents failing because they reason poorly—or because they see too little?

Findings — The numbers are not kind

The headline result is blunt: frontier LLMs are dynamics-blind.

1. Task success collapses under real constraints

Even when agents complete tasks, they often violate constraints invisibly.

Model	TSR (Tool)	TSRUC (Tool)	TSR (Audit)	TSRUC (Audit)
GPT‑5.1	22%	2%	32%	14%
Gemini‑3‑Pro	38%	6%	42%	16%
Sonnet‑4.5	32%	4%	58%	30%

Translation: without explicit state visibility, agents almost always break rules while thinking they succeeded.

2. World modeling accuracy is near zero

When asked to predict what actually changes in the database after an action:

Full audit prediction accuracy hovers between 5–9%
Multi-step rollouts degrade rapidly

Agents guess plausible effects. They do not simulate the system.

3. The real failure modes

The paper identifies three structural gaps:

Representation Gap — Agents confuse names with IDs, text with entities.
Dynamics Gap — No internal transition model of workflows.
Causal Gap — No multi-hop rollout to foresee delayed violations.

In short: agents optimize for immediate tool success, not long-term system validity.

Implications — Why prompt engineering won’t save this

WoW delivers an uncomfortable conclusion for enterprise AI builders:

Reliability is not a prompting problem. It is a modeling problem.

If your agent cannot internally simulate:

How workflows propagate
How state mutates invisibly
How constraints emerge after success

Then more context, longer prompts, or better tool schemas won’t help.

The paper points toward a different future:

Explicit state abstractions, not raw text memory
Predictive world models, not reactive tool chains
Active probing, not blind execution

This is closer to model-based reinforcement learning than to today’s chat-driven agents.

Conclusion — Acting is easy, understanding is rare

World of Workflows does not argue that LLMs are useless in enterprises. It argues something sharper:

Today’s agents act convincingly while understanding almost nothing.

By forcing models to confront hidden workflows, WoW exposes the missing layer in enterprise AI: dynamics awareness. Until agents can reason about invisible state transitions, autonomy will remain fragile, expensive, and quietly dangerous.

WoW doesn’t just benchmark agents. It benchmarks our assumptions.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The illusion of “enterprise-ready” agents#

Analysis — What WoW actually builds#

Findings — The numbers are not kind#

1. Task success collapses under real constraints#

2. World modeling accuracy is near zero#

3. The real failure modes#

Implications — Why prompt engineering won’t save this#

Conclusion — Acting is easy, understanding is rare#