Opening — Why this matters now
Enterprise AI has entered its most awkward phase: impressive demos, disappointing deployments.
The industry is discovering—quietly, and expensively—that building an agent that can act is not the same as building one that should act. The difference is not philosophical. It is statistical, operational, and ultimately financial.
The paper “The Stochastic Gap” formalizes this discomfort. It reframes agentic AI not as a prompt-engineering problem, but as a trajectory reliability problem under uncertainty. In other words, your agent isn’t failing because it picked a wrong answer—it’s failing because it walked down a path your business has never statistically justified.
That is a far more uncomfortable diagnosis.
Background — From deterministic workflows to stochastic agents
Traditional enterprise systems—ERP, BPM, rule engines—are designed to behave almost deterministically. Every step is governed by approvals, constraints, and validation rules.
Then we insert an LLM-based agent.
Suddenly:
- Actions are sampled from probability distributions
- Workflows become stochastic trajectories
- Rare branches and loops accumulate uncertainty
The paper calls this mismatch the stochastic gap.
| System Type | Decision Logic | Risk Profile |
|---|---|---|
| Rule-based ERP | Deterministic | Predictable, auditable |
| LLM Agent | Probabilistic policy | Path-dependent, compounding uncertainty |
The key shift is subtle but critical:
Reliability is no longer about a single correct step—it is about the entire path remaining statistically supported.
Analysis — The Markov audit nobody is doing
The authors propose a Markov framework built directly from enterprise event logs. Instead of asking “Can the agent do this?”, they ask:
“Has this sequence of decisions ever been sufficiently observed before?”
1. The hidden variable: Blind-spot mass
Two core metrics emerge:
- State blind-spot mass: how much of the workflow occurs in rarely seen states
- State-action blind mass: how often decisions themselves lack historical support
Here’s the uncomfortable result from the BPI 2019 dataset:
| Metric | Value (τ = 1000) |
|---|---|
| State blind mass | 4.6% |
| State-action blind mass | 12.5% |
The implication is brutal:
Your process may look well-covered—until the moment a decision is required.
And agents, unfortunately, specialize in making decisions.
2. Entropy: Where autonomy quietly collapses
Even in well-supported states, the next action may be ambiguous.
The paper measures this using Shannon entropy. High entropy = many plausible next steps.
From the empirical findings:
- Approval changes and exception handling exhibit the highest entropy (~3 bits)
- These are precisely the steps businesses least want automated blindly
Now the interesting part.
| Entropy Threshold | Autonomous Events | Fully Autonomous Cases |
|---|---|---|
| h₀ = 2.0 | 72.2% | 49.6% |
| h₀ = 1.5 | 53.3% | 7.1% |
This is the “illusion of autonomy.”
At the step level, everything looks fine. At the workflow level, autonomy collapses.
3. The gate: Where humans re-enter the system
The framework defines a simple but powerful control mechanism:
Escalate if:
- Low support (rare state)
- High entropy (ambiguous decision)
- High risk (economic or exception-sensitive)
This creates a Human-in-the-Loop (HITL) gate.
And here’s the twist:
The same gate that improves reliability also defines your cost structure.
4. Reliability and cost are the same equation
The paper derives an elegant identity:
| Component | Effect |
|---|---|
| More autonomy | Lower human cost, higher error risk |
| More escalation | Higher cost, lower error risk |
This produces a reliability-cost frontier:
| Entropy Threshold | Safe Completion | Human Touches per Case |
|---|---|---|
| 1.5 | 55.6% | 3.02 |
| 2.0 | 49.6% | 2.26 |
| 2.25 | 45.2% | 1.90 |
There is no free lunch. Only trade-offs.
Findings — What actually predicts agent performance
The paper validates its framework with a simulated agent trained on 80% of the data and tested on 20%.
Two findings matter:
1. Simple probabilities beat intuition
The metric:
- m(s) = max π(a|s) (top predicted action probability)
Tracks real-world accuracy within 3.4 percentage points.
Translation:
You don’t need a better model. You need to respect the probability distribution you already have.
2. Path dependence dominates outcomes
Even when individual steps are accurate, small uncertainties compound.
| Metric | Value at h₀ = 2.0 |
|---|---|
| Cases staying in autonomous zone | 42.3% |
| Fully correct zero-touch completion | 16.1% |
That gap is the cost of stochasticity.
Implications — What businesses should actually do
The paper quietly dismantles a common enterprise instinct:
“Let’s deploy first, then monitor.”
No.
You already have the data to audit your agent—before it exists.
1. Pre-deployment audits should become standard
Before building an agent:
- Measure state-action support
- Identify high-entropy decision points
- Quantify blind-spot mass
This tells you where autonomy is justified.
2. Most workflows are partially automatable—not fully
The data suggests a layered architecture:
| Zone | Strategy |
|---|---|
| High support, low entropy | Full autonomy |
| Medium uncertainty | Assisted automation |
| High entropy / high risk | Human control |
In practice, this resembles a selective autonomy system, not a fully autonomous agent.
3. ROI is constrained by oversight, not model quality
This is the paper’s most commercially relevant insight.
If your agent requires frequent escalation:
- Human cost dominates
- ROI collapses
Which explains why so many AI projects fail—not technically, but economically.
4. More context can make things worse
Refining the state (adding value, actor, context):
- Improves realism
- Expands state space
- Increases blind-spot mass
In short:
The more “intelligent” your system becomes, the less statistically supported it may be.
A slightly ironic outcome.
Conclusion — The uncomfortable truth about agentic AI
The paper does not argue that agents are unreliable.
It argues something more precise—and more inconvenient:
Most enterprise workflows were never statistically designed to support autonomy in the first place.
The implication is clear.
Agent deployment is not a modeling problem. It is a data support problem disguised as intelligence.
And until businesses start auditing that support, they will continue to deploy agents that look competent in isolation—and fail spectacularly in sequence.
Subtle, predictable, and entirely avoidable.
Cognaptus: Automate the Present, Incubate the Future.