A procurement workflow looks boring until an AI agent touches it.

Before that moment, the process is usually wrapped in the comforting machinery of enterprise software: approval rules, validation checks, role permissions, exception paths, and enough audit trails to make everyone feel governed. Then someone inserts an agent and asks it to “handle the workflow.” The agent may know the words. It may call the right tools. It may even produce the next step that looks plausible.

That is not the same as being safe to run.

The paper The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence gives this problem a sharper name and a useful accounting structure.1 Its central move is not to ask whether an agent can generate a reasonable action. That question is too small. The paper asks whether an entire sequence of decisions remains statistically supported, locally unambiguous, and economically governable.

This is where many enterprise AI projects quietly begin to rot. They treat reliability as a model-quality issue, then treat human review as an operations issue, then treat ROI as a finance issue. Convenient, tidy, and mostly wrong. In a workflow agent, those three are tied to the same underlying structure: which states the business has actually seen, which actions were historically taken from those states, how uncertain those choices are, and how often humans must step back in.

The paper calls the mismatch the stochastic gap. A deterministic enterprise system moves through a controlled process. An agentic system samples or selects actions under uncertainty. The failure is not only that the agent might pick a bad next step. The deeper failure is that local uncertainty compounds along the path.

The real unit of reliability is the trajectory, not the next answer

A chatbot mistake is easy to understand. A workflow-agent mistake is harder, because it may not look like a mistake at the moment it happens.

Consider a purchase-to-pay process. A case may pass through purchase requisition, purchase order creation, goods receipt, invoice receipt, payment blocking, approval changes, cancellations, corrections, and other unpleasantly normal enterprise events. At any individual point, an agent might choose a next action that appears reasonable. But enterprise workflows are not isolated trivia questions. They are trajectories.

The paper formalizes this using a Markov view of business processes. A case generates a sequence of workflow states and actions. A state is not merely the current activity. In the paper’s refined version for the BPI 2019 purchase-to-pay log, the state can include the current activity, item type, goods-receipt-based invoice verification flag, value bin, and actor class. Actions are the next recorded workflow activities.

That refinement matters. If the only state is “Record Invoice Receipt,” the process may look well covered. If the state is “Record Invoice Receipt for this item type, under this verification condition, in this value band, handled by this actor class,” the comfort level changes. Reality becomes more useful and less flattering. Enterprise data has a gift for doing that.

The paper’s mechanism can be reduced to one practical claim:

An enterprise agent is reliable only where its state-action trajectory remains inside a historically supported, low-ambiguity, risk-acceptable region of the workflow.

This is not a slogan about caution. It is a measurement problem.

State coverage is the wrong comfort blanket

The first trap is believing that a large event log solves the support problem.

The BPI 2019 log used in the paper is not tiny. It contains 251,734 cases, 1,595,923 events, and 42 distinct workflow actions. The mean case length is 6.34 events, the median is 5, the 99th percentile is 24, and the longest observed case contains 990 events. The log also contains recurrent structure: 15.7% of transitions are self-loops, and 6.3% of cases contain at least one self-loop.

In ordinary business language, that means the process has plenty of routine backbone activity, but also enough loops, rework, and exceptions to punish any lazy story about “standard automation.”

The paper defines two support measures:

Quantity What it asks Why it matters for agents
State blind-spot mass How much workflow mass lies in low-support states? Tells us where the agent is standing on weak historical ground.
State-action blind mass How much workflow mass lies in low-support state-action pairs? Tells us where the agent must choose an action without enough historical precedent.

The distinction is the article’s first important business lesson. A workflow may have good state coverage while still having poor decision coverage. The agent does not merely recognize a state. It must act from that state.

The full-log audit makes the point numerically.

State abstraction Observed states Observed state-action pairs State blind mass at $\tau=1000$ State-action blind mass at $\tau=1000$
Activity only 42 498 0.0021 0.0324
Activity + item type + GR flag 190 1,217 0.0212 0.0681
Activity + item type + GR flag + value + actor 668 3,262 0.0460 0.1253

The refined state-action blind mass reaches 0.1253 at the support threshold used in the paper. In plain English: once the process is represented with more operationally meaningful context, more than 12% of observed transition mass falls into state-action pairs with fewer than one thousand historical examples.

This is not a small nuisance. It is the difference between “the workflow is well represented in our logs” and “the agent’s next decision is often under-supported.” Those are not the same statement. The second one is the one that matters.

Context improves realism and worsens statistical comfort

There is a tempting but wrong response to the previous result: “Then we should keep the state simple.”

That would make the audit prettier and the deployment more dangerous.

A coarse state abstraction hides operational distinctions. It may treat a low-value routine action and a high-value exception-sensitive action as variants of the same state. That is analytically convenient in the same way that removing smoke alarms is convenient. The dashboard becomes calmer because it knows less.

The paper’s refinement adds context because the agent’s decision environment actually contains context. Value matters. Actor class matters. Verification conditions matter. Exception-sensitive activities matter. Once these are included, the state space expands from 42 activity-only states to 668 refined states, and state-action pairs expand from 498 to 3,262.

This is the second business lesson: better process representation can reveal weaker statistical support.

That does not mean richer context is bad. It means richer context exposes where autonomy is not justified. If a company’s automation plan depends on pretending that context does not matter, the plan is not efficient. It is merely under-described.

A useful audit should make the business case harder before it makes the deployment safer. Annoying, yes. Also cheaper than discovering the same thing after the agent has already touched real invoices.

Entropy identifies the places where “many reasonable actions” is the problem

Support is only one part of the stochastic gap. Even when a state is well represented in the historical log, the next action may be ambiguous.

The paper measures this ambiguity using Shannon entropy over the empirical next-action distribution. A low-entropy state has a dominant next step. A high-entropy state has several plausible next steps. For a human process analyst, that is familiar: some workflow points are routine pipes; others are decision junctions.

The highest-entropy areas in the refined BPI analysis are not random. They concentrate in approval changes and exception-management contexts. At the activity level, the paper reports large next-step entropies for approval changes, goods-receipt cancellations, and delivery-indicator changes. These are precisely the areas where businesses least want an agent to improvise with confidence and a pleasant tone.

This is where the paper’s human-in-the-loop gate becomes important. The gate escalates decisions when support is inadequate, entropy is too high, or risk is too elevated. The exact thresholds can vary, but the principle is stable: autonomy should be scoped by the local decision environment, not granted uniformly across the workflow.

That produces a useful contrast between event-level autonomy and case-level autonomy.

Gate setting in the full-log audit Autonomous nonterminal events Fully autonomous complete cases Interpretation
Entropy threshold $h_0=2.0$ bits 72.2% 49.6% Many individual steps pass, but only about half of full cases avoid escalation.
Entropy threshold $h_0=1.5$ bits 53.3% 7.1% A stricter local gate leaves many steps autonomous, but end-to-end autonomy collapses.

This is the path effect. A workflow can appear mostly automatable at the step level while remaining only partially automatable at the case level. Every additional decision point is another chance to hit a high-entropy, low-support, or high-risk region.

For enterprise leaders, this is a useful cure for the demo illusion. A demo often shows a successful step. A deployment requires a successful path. The difference is not theatrical. It is multiplicative.

The held-out agent test is validation, not magic

The paper then moves from descriptive audit to held-out validation.

The authors split the BPI process chronologically: the first 80% of cases by completion time are used for training, and the final 20% are held out. The training segment contains 201,387 cases and 1,267,250 events; the held-out segment contains 50,347 cases and 328,673 events.

The simulated agent is deliberately simple. If a held-out state falls outside the autonomy gate, it escalates. Otherwise, it chooses the historically most probable next action from the training log. Correctness is measured against the observed next activity in the held-out trace.

This design matters because it defines what the result does and does not prove.

Component Likely purpose What it supports What it does not prove
Full-log support audit Main descriptive evidence Large logs can hide state-action blind spots, especially after meaningful state refinement. It does not prove a particular deployed LLM agent would fail in exactly those places.
Entropy-based autonomy gate Main mechanism demonstration Local ambiguity and risk can be converted into a scoped autonomy envelope. It does not identify the universally optimal threshold for every company.
Chronological 80/20 split Held-out validation The audit quantities track future held-out process behavior under an imitation-style agent. It is not a counterfactual online deployment test.
Step-accuracy surrogate comparison Validation of theoretical proxy The surrogate tracks realized autonomous step accuracy within 3.4 percentage points on average. It does not guarantee case-level success under richer agent policies.
Reliability-cost frontier Business interpretation of the same gate Safer autonomy and lower human cost are jointly determined by the gate. It does not calculate a company-specific ROI without real cost and penalty values.

This is a good example of a paper result that is useful because it is modest. The simulated agent is not a heroic foundation model wrapped in a spectacular orchestration layer. It is a log-driven greedy policy. That makes the test cleaner: the question is whether support, entropy, and gating statistics have predictive value before fancy agent engineering begins.

They do.

Across entropy thresholds from 1.0 to 2.5, the theoretical autonomous step-accuracy surrogate tracks realized autonomous step accuracy with a mean absolute gap of 3.4 percentage points and a maximum gap of 4.0 percentage points. At $h_0=2.0$, for example, the surrogate predicts 67.5% autonomous step accuracy, while the held-out agent realizes 63.7%.

That is not a claim that the agent is good. It is a claim that the audit is informative. The paper is not saying, “Deploy this simple greedy agent.” Please do not. The point is that pre-deployment log structure can already forecast where autonomous behavior is likely to be weak.

The same gate prices reliability and labor

Now the paper becomes commercially interesting.

The human-in-the-loop gate does two things at once. It improves reliability by escalating uncertain or risky decisions. It also creates labor cost by requiring human touches. These are not separate knobs.

The held-out results show the trade-off:

Entropy threshold Realized autonomous step accuracy Realized safe case completion Realized zero-touch completion Mean human touches per case
1.50 68.4% 55.6% 3.3% 3.02
2.00 63.7% 49.6% 16.1% 2.26
2.25 61.5% 45.2% 23.1% 1.90

A stricter threshold raises review burden. A looser threshold reduces human touches but allows more autonomous decisions in uncertain regions. Nothing shocking there. The useful part is that the paper ties this trade-off to measurable workflow quantities before deployment.

At $h_0=2.0$, 42.3% of held-out cases remain inside the autonomous envelope, but only 16.1% achieve zero-touch completion with no autonomous mismatch. That gap is the cost of allowing a workflow to be “autonomous enough” in coverage terms while still being wrong along the path.

The paper also explains why case-level behavior can jump when certain gateway states are admitted into the autonomous envelope. In the temporal split, 33 gateway states in the entropy band from 1.5 to 2.0 satisfy the support and risk conditions. They account for 13.4% of held-out nonterminal decisions but are visited by 69.1% of held-out cases. Many are procurement hubs around Create Purchase Order Item, Record Goods Receipt, and Record Invoice Receipt.

That is operationally important. Some states are not just frequent. They are bottleneck gateways. Moving them into or out of the autonomy envelope can change case-level automation disproportionately. The ROI implication is obvious enough to be dangerous: companies should not merely rank tasks by volume. They should rank workflow states by their effect on end-to-end autonomy, reliability, and review load.

What Cognaptus would infer for deployment design

The paper directly shows that support, entropy, risk, and human review can be audited from event logs, and that these quantities track held-out behavior in a procurement process. The business inference is broader but still disciplined: companies should scope agentic autonomy before building the agent interface.

A practical pre-deployment audit would look like this:

Audit step Direct measurement Business decision it informs
Define operational states Activity plus control-relevant context such as value, actor, document status, tool state, or policy flags Prevents the audit from hiding risk behind overly coarse process labels.
Estimate state and state-action support Blind-spot mass under chosen support thresholds Identifies where historical evidence is too thin for autonomous action.
Measure local entropy Distribution of next actions from each state Separates routine pipes from ambiguous decision junctions.
Add risk weighting Value intensity, compliance criticality, exception-sensitive activities, downstream error cost Prevents high-consequence rare actions from being treated as harmless edge cases.
Define the HITL gate Escalation rule based on support, entropy, and risk Creates a scoped autonomy envelope instead of a vague “human oversight” promise.
Estimate review burden Expected human touches per case under the gate Converts governance design into a cost line.
Compare frontier options Safe completion, zero-touch completion, review load, and error exposure Decides whether to automate, assist, redesign, or leave the workflow alone.

This is not anti-agent. It is anti-theater.

For many workflows, the rational architecture will not be a single fully autonomous agent. It will be a selective autonomy system: full automation for high-support, low-entropy states; assisted automation for moderate ambiguity; and human control for high-risk or poorly supported branches.

That architecture may sound less exciting than “end-to-end autonomous enterprise agent.” It also has the advantage of being less likely to turn accounts payable into a small casino.

The paper’s boundaries are part of its value

The empirical evidence is useful, but it should not be overread.

First, the BPI 2019 log is observational. The held-out agent is evaluated against historical next actions. That is appropriate for a pre-deployment audit, because the company is asking what its historical process can support. But it is not the same as testing a deployed agent online, where counterfactual actions can change future states.

Second, the Markov representation depends on state design. The paper uses a first-order approximation after state abstraction. In a richer agentic system, the effective state might include longer memory, retrieved documents, tool outputs, conversation history, and latent business context. Adding meaningful context often expands the state space and can make support harder, not easier. But irrelevant fragmentation can also distort the audit. The state representation must be chosen by people who understand the process, not by someone trying to win a dashboard beauty contest.

Third, the paper’s risk weights are reproducible from public BPI fields, so they emphasize value intensity and exception-sensitive activities. A real deployment should replace that proxy with a domain-specific risk function: compliance severity, customer harm, fraud exposure, contractual penalties, audit requirements, and downstream operational cost.

Fourth, the validation is based on one procurement workflow. The method is designed for event-log-rich enterprise processes, not for every possible agent environment. The lesson travels best where cases, states, actions, timestamps, actors, and outcomes are already logged.

These limitations do not weaken the core business message. They clarify it. The paper is not a universal certificate for agent deployment. It is a way to stop pretending that deployment readiness begins after the prototype works.

The uncomfortable replacement for “deploy and monitor”

The common enterprise instinct is still: build the agent, test a few flows, deploy to a bounded user group, monitor failures, then tune.

The paper suggests a better sequence:

  1. Audit the workflow log.
  2. Measure support, entropy, and risk.
  3. Estimate the human-review burden implied by different autonomy gates.
  4. Decide which parts of the workflow deserve autonomy.
  5. Only then build the agent around that scoped design.

This reverses the usual emotional order of AI adoption. Instead of starting with what the model can do, it starts with what the business process can statistically justify.

That is less glamorous. It is also more serious.

The stochastic gap is not a mysterious property of LLMs. It is what appears when a probabilistic policy is placed inside a process that was never designed to support probabilistic autonomy at every branch. Some regions will be routine. Some will be ambiguous. Some will be rare but expensive. Some will require human judgment no matter how elegantly the agent explains itself.

The point is to know which is which before the invoice, ticket, claim, customer, or regulator becomes the test case.

Enterprise AI does not fail only because models hallucinate. It fails because businesses grant autonomy to trajectories they have not measured. The paper’s contribution is to make that failure mode auditable.

A good agent strategy, then, does not begin with a prompt library. It begins with a map of where the process can bear stochastic action, where it cannot, and how much human review the difference will cost.

Slightly less magical. Considerably more useful.

Cognaptus: Automate the Present, Incubate the Future.


  1. Biplab Pal and Santanu Bhattacharya, “The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence,” arXiv:2603.24582v1, March 25, 2026, https://arxiv.org/abs/2603.24582↩︎