TL;DR for operators
Most LLM agents still behave like overconfident interns with a browser: observe, guess the next action, click, apologise, repeat. SiRA proposes a more serious pattern. Before acting, the agent writes down a belief state, proposes several high-level candidate actions, simulates likely future states with an LLM-based world model, scores those futures against the goal, and only then converts the selected intent into an executable browser action.1
That distinction matters. The paper is not claiming that “more chain-of-thought” magically fixes agents. Its sharper claim is that planning needs an explicit counterfactual loop: what would happen if I did this? In the constrained flight-search benchmark introduced by the authors, OpenHands BrowsingAgent scored 0% correct, a matched reactive SiRA variant scored 14.4%, and full SiRA with simulative reasoning reached 32.2%. Strong reasoning models used as unstructured chain-of-thought planners, including o1 and o3-mini, still produced near-zero success in this setting. Apparently, thinking harder in a straight line is not the same as checking alternate futures. Who knew.
For enterprise use, the lesson is architectural. If the task involves many constraints, partial observability, brittle interfaces, long workflows, or multi-source synthesis, a single prompt loop is the wrong abstraction. Separate perception, state, memory, planning, simulation, critique, and execution. That does not make agents reliable by decree, but it gives failure somewhere to live.
The boundary is also practical. SiRA is still a research prototype. It uses text-heavy browser observations, LLM-based evaluation in FlightQA, limited samples in some experiments, and inference-time search over candidate plans. Captchas, browser crashes, anti-scraping defences, multimodal gaps, and model-version sensitivity remain very real. The business value is not “deploy this paper tomorrow.” It is: stop treating agent reliability as a prompt-writing problem.
The familiar failure: agents that act before they understand
A travel request is a good test because it looks trivial until it is not. “Find me a one-way flight from New York to Los Angeles after 6 PM, under a budget, with a preferred airline, no long layover, and arriving before a certain time” is not one instruction. It is a bundle of constraints that must survive interface changes, partial search results, scrolling, filters, delayed page updates, and the agent’s own memory lapses.
The usual LLM-agent pattern handles this badly. It observes the page, generates a thought, chooses a concrete action, observes again, and repeats. This can work when the next action is obvious. It becomes fragile when the agent must evaluate trade-offs, avoid premature commitment, remember what it has already checked, and distinguish “not visible yet” from “not available.”
SiRA’s key contribution is to make the agent pause at the right level of abstraction. It does not merely ask the model to produce a longer rationale. It asks the system to simulate candidate futures before taking the next real action.
That sounds subtle. It is not. It changes the unit of planning.
A reactive agent asks:
Given what I see now, what should I do next?
SiRA asks:
Given what I believe the state is, what high-level action could I take, what state would likely follow, and which future gets me closer to the goal?
This is the difference between clicking because the button looks relevant and mentally testing whether clicking the button will expose the missing constraint, commit to the wrong option, or waste the remaining action budget. One is reaction wearing a lab coat. The other is planning.
SiRA’s mechanism is a loop of controlled imagination
The paper frames SiRA as a model-agnostic architecture. In the experiments, the modules are implemented with LLMs, but the architectural claim is broader: any system that can encode observations, simulate state transitions, evaluate progress, and execute actions could instantiate the pattern.
The working loop has six useful parts:
| Module | What it does | Operational meaning |
|---|---|---|
| Encoder | Converts the current browser observation into a natural-language belief state | Compresses messy interface evidence into a state the planner can reason over |
| Memory | Stores selected summaries and previous simulated actions | Prevents the agent from treating every page as its first day at work |
| Policy / proposer | Generates candidate high-level simulated actions | Creates options before commitment |
| World model | Predicts the next belief state after each candidate intent | Allows counterfactual evaluation without touching the real page |
| Critic | Scores simulated terminal states against the goal | Selects futures based on goal progress, not vibe |
| Actor | Converts the chosen high-level intent into a concrete API/browser action | Keeps planning abstract while execution remains grounded |
The important design move is the separation between simulated actions and executable actions. SiRA does not simulate at the level of “click element 638” or “fill textbox 576.” It simulates intentions such as selecting a filter, exploring a result, changing a departure field, or searching within a site. Only after the planner chooses the best intent does the actor ground that intent into a concrete browser command.
This avoids a common trap in agent design: forcing the planning layer to reason over low-level interface mechanics. Atomic browser actions are noisy, site-specific, and brittle. Higher-level intents are more stable. A flight-search page may change its element IDs, layout, and widget structure; the intent “set the origin city” remains meaningful.
The paper’s mechanism therefore has two forms of compression. First, the raw observation becomes a natural-language belief state. Second, concrete actions become abstract simulated intentions. Both compressions are lossy in the ordinary engineering sense, but they make planning tractable. The agent does not need to imagine every pixel-level future. It needs to imagine enough of the next useful state.
Why longer chain-of-thought is the wrong comparison
The most tempting misunderstanding is to file SiRA under “chain-of-thought, but more organised.” That misses the point.
Chain-of-thought expands the current reasoning trace. SiRA branches over possible futures. A long monologue can still be reactive if it ultimately commits to the first plausible next action without modelling alternatives. The paper explicitly tests this distinction in constrained navigation by replacing the planner with strong reasoning LLMs using unstructured chain-of-thought. The results are unflattering: o1 and o3-mini variants achieve only 1.1% and 3.3% correctness on FlightQA, respectively.
This is the paper’s cleanest conceptual result. Internal compute is not planning. More tokens can help a model explain itself into a decision, but planning requires a representation of possible state transitions. A system must be able to say, in effect: if I choose action A, this state may follow; if I choose action B, another state may follow; one of those futures better satisfies the goal.
For business readers, this distinction should sound familiar. A junior analyst can produce a long explanation for a recommendation without having compared real alternatives. The problem is not lack of prose. The problem is lack of option evaluation.
SiRA turns that evaluation into a system component.
FlightQA is the stress test, not just the showcase
The authors introduce FlightQA because existing web-agent benchmarks are often broad but hard to control. Many tasks are generated through self-instruction and human verification, which provides variety but makes it difficult to isolate how performance changes as constraint complexity rises.
FlightQA is narrower and more controlled. It contains 90 flight-search questions organised into 15 sequences, where each sequence grows from 3 to 8 constraints. This lets the authors test whether an agent can handle incremental compositional difficulty rather than merely succeed on a random set of web tasks.
The evaluation itself reflects the awkwardness of live web tasks. Flight prices and availability change, so there is no stable ground-truth answer. The authors evaluate responses on two dimensions: groundedness, meaning whether the answer is supported by the interaction history, and relevance, meaning whether it satisfies the user’s constraints as far as available results allow. A response counts as correct only if it is both grounded and relevant.
That design choice is reasonable, but it matters. FlightQA is not a clean static benchmark where every answer can be deterministically checked. It is closer to the operational mess of real agents, which is exactly why it is valuable — and why the evaluator itself becomes part of the uncertainty.
The results:
| Method | Correct | Grounded | Relevant | Repetitive actions | Action errors |
|---|---|---|---|---|---|
| OpenHands BrowsingAgent | 0.0% | 0.0% | 0.0% | 0.0% | 93.3% |
| SiRA with unstructured CoT, o1 / o3-mini | 1.1% / 3.3% | 1.1% / 4.4% | 1.1% / 3.3% | 37.8% / 32.2% | 10.0% / 8.9% |
| SiRA reactive policy | 14.4% | 15.6% | 14.4% | 44.4% | 1.1% |
| SiRA simulative reasoning | 32.2% | 36.7% | 32.2% | 18.9% | 1.1% |
The comparison that matters most is not SiRA versus OpenHands, because the architectures differ in more than one way. The tighter comparison is full SiRA versus the matched reactive SiRA baseline. There, simulative reasoning raises correctness from 14.4% to 32.2%, a 124% relative improvement, significant at the 0.01 level in the authors’ pairwise test.
The repetition metric is also telling. The reactive SiRA variant repeats actions in 44.4% of runs, while full SiRA reduces that to 18.9%. That is not merely a nicer number. Repetition is a classic symptom of agents lacking a stable model of progress. They do not know whether they are stuck, whether the last action changed anything, or whether the current state invalidates an earlier assumption. SiRA’s explicit memory and simulated state transitions give the agent a stronger grip on whether the workflow is moving.
There is a caveat hiding in plain sight: full SiRA still reaches only 32.2% correctness. This is not “solved travel booking.” It is evidence that simulation helps under hard constraints, not evidence that a production assistant should be trusted with your boss’s last-minute business-class itinerary. Let us keep the champagne in the fridge.
The second test checks breadth rather than interface depth
FlightQA tests constraint-heavy navigation inside a live web interface. FanOutQA tests something different: multi-hop information aggregation across sources. The agent must gather partial information from multiple locations and compile a final answer. This resembles many enterprise tasks: checking vendor records, compiling competitor data, comparing policy documents, or gathering compliance evidence across systems that were never designed to cooperate.
The authors evaluate on the first 100 examples of the FanOutQA development set. Here the evidence is less dramatic than FlightQA but still directionally important:
| Method | Accuracy | Strict accuracy | Response returned | Browser crashed | Repetitive actions | Action error |
|---|---|---|---|---|---|---|
| OpenHands BrowsingAgent | 17.0% | 4.0% | 32.0% | 17.0% | 0.0% | 43.0% |
| SiRA reactive policy | 20.2% | 3.0% | 37.0% | 24.0% | 18.0% | 10.0% |
| SiRA simulative reasoning | 29.8% | 4.0% | 55.0% | 24.0% | 8.0% | 1.0% |
Simulative reasoning improves accuracy over the reactive policy by 48.6%, with a reported p-value of 0.011. It also improves response rate from 37.0% to 55.0% and reduces action errors from 10.0% to 1.0%.
The strict accuracy column is a useful brake on overinterpretation. Exact-match performance remains low at 4.0%, equal to BrowsingAgent. In multi-source tasks, the agent may gather partially correct information, return a useful but non-exact answer, or fail to match a rigid ground truth. That makes fact-level accuracy more forgiving and arguably more realistic, but also less final.
The paper also notes that some FanOutQA questions can be answered from a single source, such as a Wikipedia page. That means the benchmark is not a pure test of multi-source planning in every case. Still, SiRA’s advantage over its matched reactive baseline suggests the mechanism transfers beyond flight search. It is not just learning how to poke Google Flights with a stick.
WebArena is the broad sanity check
The third evaluation uses a random 100-example subset of WebArena, a benchmark built around simulated websites such as a shopping site, a forum, a GitLab-like environment, a map, and a Wikipedia-like site. This is the broadest test category: general instruction following across varied interfaces.
The results are modest:
| Method | Success rate |
|---|---|
| OpenHands BrowsingAgent | 12.0% |
| SiRA reactive policy | 19.0% |
| SiRA simulative reasoning | 23.0% |
Full SiRA improves over the matched reactive policy by 21.1% and over BrowsingAgent by 91.7%. The authors are careful that these rates are not directly comparable to standard WebArena results because their setup uses BrowserGym through the OpenHands platform and a modified answer format.
That caveat is not a footnote; it changes the claim. WebArena supports the argument that the relative advantage of simulation persists in a third task family. It does not establish a new state-of-the-art benchmark result. The paper is stronger when read as a mechanism study than as a leaderboard announcement.
What each experiment actually supports
The paper’s evidence is best read as a portfolio, not a single knockout punch.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| FlightQA constrained navigation | Main evidence plus controlled complexity test | Simulation helps with multi-constraint live-web planning; unstructured CoT is not a substitute | Production reliability, deterministic correctness, or broad autonomy |
| Constraint-count analysis in FlightQA | Robustness / compositional sensitivity | SiRA maintains a gap over reactive policy as constraints vary | Smooth scaling with every added constraint; the curve itself is noisy |
| FanOutQA | Generalisation to multi-hop aggregation | Simulation improves fact-level accuracy and response rate over the matched reactive variant | Exact-answer mastery or flawless multi-source research |
| WebArena subset | Broad instruction-following sanity check | The advantage persists in a more varied benchmark family | Direct comparability to standard WebArena leaderboards |
| Appendices and prompts | Implementation detail | The architecture is concrete enough to inspect and reproduce as a research artefact | That the exact prompts are stable across model versions |
This is also the right way to translate the paper into business relevance. SiRA’s value is not the absolute score on any one task. The value is the repeated pattern: when agents must reason over possible future states, explicit simulation improves behaviour relative to a structurally similar reactive system.
The enterprise implication: build agents as planners, not prompt chains
For operators, the actionable idea is not “use SiRA.” It is “stop hiding the planning problem inside one giant model call.”
A production-grade workflow agent should usually have separate places for:
- Perception: What did the system observe?
- Belief state: What does the agent currently believe is true?
- Memory: What must persist across steps?
- Candidate intentions: What are the plausible next high-level moves?
- Simulation: What is likely to happen after each move?
- Critique: Which future best satisfies the user’s goal and constraints?
- Execution: Which concrete tool call or interface action implements the chosen intent?
- Recovery: What changed, what failed, and what should be revised?
That structure is not aesthetic. It creates operational handles. If an agent fails, the team can inspect whether the observation was incomplete, the belief state was wrong, the action candidates were poor, the world model predicted nonsense, the critic rewarded the wrong thing, or the actor clicked the wrong element.
In a single reactive prompt loop, those failures collapse into one unpleasant category: “the model messed up.” Technically true. Operationally useless.
The pattern is especially relevant for workflows with high constraint density:
| Workflow type | Why reactive agents struggle | How simulative planning helps |
|---|---|---|
| Travel booking | Many filters, live prices, hidden trade-offs, dynamic availability | Simulates whether a search/filter action will expose valid candidates before committing |
| Procurement | Vendor constraints, budget rules, compliance checks, multi-step approvals | Compares paths through search, validation, and documentation |
| Compliance research | Multiple documents, exceptions, changing rules, audit trail requirements | Maintains belief state and evaluates whether evidence actually satisfies the query |
| CRM / ERP operations | State changes are consequential and sometimes irreversible | Tests intended state transitions before execution |
| Market and competitor research | Sources are partial, contradictory, and distributed | Plans source coverage rather than grabbing the first plausible answer |
| Internal IT automation | Interface details vary across systems and sessions | Separates abstract intent from concrete UI commands |
The ROI pathway is not “agents become smarter” in the abstract. It is cheaper diagnosis, fewer repeated actions, better constraint satisfaction, and less reliance on bespoke control-flow engineering for every new workflow. That is a sober business case. It may even survive a procurement meeting.
The natural-language belief state is a feature and a liability
SiRA’s belief states are written in natural language. This is one of the paper’s most practical design choices because LLMs are already good at summarising observations, predicting text-described state changes, and evaluating goal progress. Natural language also makes the system inspectable. Engineers and domain experts can read the belief state and ask whether it matches reality.
That said, natural language is not magic state management. It can omit details, over-compress interface structure, or preserve the wrong facts. It also depends on prompts and model behaviour. The paper notes that simulative reasoning deteriorated on FanOutQA when using newer versions of GPT-4o, possibly because model response patterns changed under the same prompts. That is a serious deployment warning. If your planning architecture depends on stable prompt-shaped module behaviour, then model upgrades are not routine maintenance. They are regression events.
The right operational response is not to abandon natural-language state. It is to treat it as an auditable intermediate representation, ideally paired with structured fields, test suites, version pinning, and task-specific validators.
In other words: language is a useful planning substrate. It is not a database, a contract, or a memory palace. Please do not build one out of vibes.
Where the paper is strongest
The strongest part of the paper is the controlled comparison between simulative reasoning and a matched reactive baseline inside the same broad architecture. This matters because many agent papers compare a new system against older baselines with many confounding differences: better prompts, different tools, newer models, extra memory, more forgiving evaluators, or a larger action budget.
Here, the reactive policy baseline replaces SiRA’s planning module with direct action selection from the policy, committing to the first sample without simulation. That isolates the value of world-model-based planning more cleanly than a simple comparison against an external browser agent.
The second strength is that the three task families stress different failure modes:
- FlightQA stresses deep constraint satisfaction in a partially observable live interface.
- FanOutQA stresses breadth across sources and long-horizon aggregation.
- WebArena stresses varied instruction following across simulated sites.
The performance improvements are not identical, and they should not be. A mechanism that helps equally everywhere is often a mechanism that was evaluated too vaguely. SiRA helps most where the need for counterfactual planning is clearest.
Where the evidence should not be overread
The limitations are not ceremonial. They shape how far the paper can travel.
First, inference cost is higher. SiRA explores multiple candidate plans, simulates future belief states, and scores them. That is more expensive than a reactive next-action policy. For enterprise deployments, the obvious question becomes: when is simulation worth the cost? A refund-processing agent may not need tree search for every click. A compliance agent deciding whether a document satisfies a regulated requirement probably does.
Second, the browser environment is a brittle substrate. The paper reports browser crashes as a sizable contributor to failures in FanOutQA, with 24% crash rates for the SiRA variants. Captchas and anti-scraping tools can also block open-web agents. This is not a minor engineering inconvenience. If the tool layer fails, the planning layer can only produce beautifully reasoned frustration.
Third, the experiments rely primarily on text representations of webpages. The authors acknowledge that images, layout, occlusion, and other visual signals may be missed. Many real interfaces communicate meaning visually. A text-only accessibility tree can be informative, but it is not the full page.
Fourth, FlightQA evaluation uses LLM judgment for groundedness and relevance because live flight data lacks stable ground truth. This is understandable, but it introduces evaluator risk. The paper itself notes that hallucinated answers from some strong reasoning models can fool the evaluator at significant rates. That is precisely the kind of detail operators should remember before turning benchmark scores into a vendor slide.
Fifth, the sample sizes are research-scale. FanOutQA and WebArena each use 100 examples. That is enough to show meaningful experimental signals, but not enough to map the full reliability envelope of deployed agents across domains, websites, account states, permissions, geographies, and user preferences.
What to build next if you take SiRA seriously
A useful production translation of SiRA would not copy the paper verbatim. It would harden the pattern.
Start with a constrained domain where the state can be checked. Travel search, procurement comparison, internal knowledge lookup, or CRM updates are plausible candidates. Then give the agent explicit modules:
- a structured observation parser;
- a belief state with both natural-language and typed fields;
- a short-term memory policy;
- a candidate-intent generator;
- a world-model simulator;
- a critic with task-specific validators;
- an actor grounded in safe tool calls;
- a recovery loop for contradiction, no-change states, and repeated actions.
The critic is especially important. In the paper, the critic evaluates goal progress in simulated terminal states. In business settings, the critic should not be a generic “does this look successful?” prompt unless the task is low-stakes. It should include deterministic checks where possible: all constraints satisfied, price below threshold, required fields completed, source links present, approval status unchanged until confirmation, and so on.
The simulation budget should also be dynamic. Not every step deserves search. An agent can act reactively for low-risk obvious moves and invoke simulation when constraints conflict, state changes are irreversible, confidence falls, or the workflow approaches a decision boundary. The future of agents is probably not “simulate everything.” It is “know when guessing is cheap and when guessing is negligence.”
The real lesson is architectural humility
SiRA is an argument against a lazy assumption: that sufficiently large reasoning models will automatically become competent planners if we let them think longer. The paper’s evidence points in a different direction. Planning is not just more thinking. Planning is structured comparison among possible futures.
That is a useful correction for enterprise AI. Many failed agent pilots do not fail because the model lacks eloquence. They fail because the system has no durable state, no explicit representation of progress, no way to test candidate actions before executing them, and no disciplined separation between intention and tool use.
SiRA does not solve those problems completely. Its absolute success rates are still modest, its inference cost is higher, its evaluation has rough edges, and its browser substrate remains fragile. But it gives the field a clearer design pattern: agents should not merely narrate their way through tasks. They should simulate, evaluate, and then act.
For operators, that is the takeaway worth keeping. The next reliable agent stack will not be the one with the most theatrical chain-of-thought. It will be the one that can say, before it clicks: “I have considered what happens next.”
Cognaptus: Automate the Present, Incubate the Future.
-
Mingkai Deng, Jinyu Hou, Zhiting Hu, and Eric P. Xing, “General Agentic Planning Through Simulative Reasoning with World Models,” arXiv:2507.23773v3, 2026. ↩︎