The Reliability Gap: Why Smarter AI Agents Still Fail When It Matters

A customer service agent gets the refund policy right on Monday, wrong on Tuesday, and confidently wrong on Wednesday. A coding agent passes the benchmark, then casually rewrites the wrong file in production. A workflow agent behaves perfectly in a demo, then becomes confused when the API returns the same fields in a different order.

This is not the usual “AI is powerful but risky” sermon. That sermon is already old enough to need a walking stick.

The sharper point is simpler: an AI agent can become more accurate without becoming reliably deployable.

That is the central argument of Towards a Science of AI Agent Reliability, a paper by Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan.¹ The paper does not merely complain that benchmarks are incomplete. It proposes a concrete reliability profile for AI agents, built around four dimensions: consistency, robustness, predictability, and safety. Across those dimensions, the authors define twelve metrics and evaluate 15 agent-model combinations on GAIA and τ-bench.

Their main finding is uncomfortable for anyone selling “agentic automation” as if benchmark progress were a boarding pass to production: recent model generations have improved in raw task success, but reliability has improved only modestly. Some dimensions, such as calibration and safety, show visible progress. Others, especially consistency and failure discrimination, remain stubbornly weak.

The business implication is not that firms should stop deploying agents. That would be lazy caution, the intellectual equivalent of putting a traffic cone on everything. The implication is that firms need to stop treating one-shot accuracy as the main deployment signal. Accuracy asks, “How often does the agent succeed?” Reliability asks the more expensive question: “When it succeeds or fails, can we understand, predict, bound, and govern that behavior?”

That second question is where production systems live.

Accuracy measures the answer; reliability measures the behavior around the answer

The paper’s useful move is to separate capability from reliability.

Capability is about whether an agent can complete a task. Reliability is about how the agent behaves across repeated runs, altered inputs, uncertain conditions, and failure cases. Those are not the same property.

An agent that solves 80% of tasks may fail in two very different ways. It may fail on the same hard 20% every time. That is annoying but diagnosable. Or it may succeed and fail randomly across repeated attempts at the same task. That is operationally worse, because the same request can produce different outcomes depending on hidden randomness, infrastructure conditions, or minor phrasing differences. Congratulations: the system is now a slot machine with an enterprise UI.

The authors borrow from safety-critical engineering, where reliability is not reduced to average success. Aviation, nuclear power, automotive systems, and industrial control all care about repeatability, graceful degradation, known failure modes, and bounded harm. A system that performs well on average but fails unpredictably is not reliable simply because the average looks pleasant.

For AI agents, this matters more than it did for earlier chatbots because agents act. They call tools, change records, process refunds, modify files, write code, browse the web, and coordinate workflows. When a chatbot gives a bad answer, someone may roll their eyes. When an agent takes a bad action, someone may roll back a database, explain a compliance breach, or apologize to a customer with legal counsel copied in.

The paper formalizes reliability into four mechanisms:

Reliability mechanism	What it asks	Business risk exposed
Consistency	Does the agent behave similarly under identical conditions?	Unstable outputs, unfair treatment, audit difficulty, unpredictable cost
Robustness	Does performance survive prompt, tool, or environment perturbations?	Fragility under real user language, API changes, tool failures
Predictability	Does the agent know when it is likely to fail?	Bad escalation, misplaced trust, useless confidence scores
Safety	Are violations rare, and are failure consequences bounded?	Unauthorized actions, privacy exposure, financial loss, irreversible damage

The table is almost embarrassingly practical. It says: before asking whether the agent is impressive, ask whether it is governable.

Consistency is the boring metric that becomes exciting after the first lawsuit

The first reliability mechanism is consistency. This sounds dull until one imagines an airline refund agent approving a refund in three runs and denying it in two others under identical conditions.

The paper measures consistency at several levels. Outcome consistency asks whether the final success or failure repeats across runs. Trajectory consistency asks whether the agent follows similar action patterns and action sequences. Resource consistency asks whether the agent consumes similar amounts of time, tokens, API calls, or other resources across repeated executions.

This distinction is important. Two agents may both answer correctly, but one follows a stable path while the other wanders through different tools, files, and intermediate assumptions each time. In a creative writing assistant, that diversity may be fine. In a CI/CD coding agent, financial operations agent, or compliance-sensitive customer service agent, it becomes a problem. Auditors do not enjoy being told, “It usually gets there somehow.”

The paper finds that outcome consistency remains modest across models. Even frontier systems do not reliably improve consistency across both GAIA and τ-bench. The authors also identify a “what but not when” pattern: agents often choose similar categories of actions across runs, but vary the order and sequence of those actions. In plain terms, they may know the ingredients but still improvise the recipe.

For business use, this is not a small detail. A stable action path is easier to test, monitor, explain, and roll back. A variable path creates more possible failure surfaces. It also makes post-incident review harder: if the agent can solve the same problem through several routes, then one successful replay does not necessarily reproduce the failed production trace.

Resource consistency deserves special attention. In demos, cost variance is invisible because the demo is short and the budget is imaginary. In production, one identical request consuming 1,000 tokens and another consuming 50,000 tokens is not just a technical curiosity. It affects latency, cloud cost, rate limits, and service-level commitments. A capable agent that unpredictably burns resources is not cheap automation. It is variable-cost consulting with no calendar invite.

Robustness is where “same meaning” stops meaning “same result”

The second mechanism is robustness. The paper breaks robustness into fault robustness, environment robustness, and prompt robustness.

Fault robustness tests whether agents survive tool failures such as timeouts, server errors, rate limits, invalid responses, partial responses, or empty results. Environment robustness tests whether agents can handle semantic-preserving changes in data formats, tool interfaces, or response structures. Prompt robustness tests whether agents respond consistently when the user instruction is rephrased without changing its meaning.

The results are asymmetric. Fault and environment robustness show ceiling effects across many models under the perturbations the authors tested. That sounds reassuring, but only within a narrow boundary: the perturbations are controlled and represent only part of what production environments throw at agents. Schema migrations, tool version changes, shifting document layouts, authentication edge cases, and messy enterprise data can be much less polite.

Prompt robustness is more revealing. Many models remain sensitive to surface-level instruction changes. In τ-bench’s structured customer-service environment, rephrased instructions often still map to a narrow set of valid tool actions. The environment itself acts like guardrails. In GAIA’s open-ended web and tool-use tasks, the same kind of rephrasing can send the agent down a different search path, where small interpretation differences compound across steps.

This is the practical lesson: controlled environments hide some forms of fragility.

A customer may say “cancel my booking,” “end this reservation,” “I don’t want to fly anymore,” or “can you get me out of this trip?” A human sees the same intent. An unreliable agent may not. Worse, it may partially understand the intent, choose the wrong policy branch, and perform a formally valid but substantively wrong action.

For firms, prompt robustness should not be treated as a nice-to-have benchmark extension. It is the closest thing to real user traffic in miniature. Real users do not write benchmark prompts. They type fragments, abbreviations, typos, emotional complaints, and contradictory updates. Some even write like managers on a train with 7% battery. Enterprise systems should be tested against that reality, not against the clean grammar of a benchmark curator.

Predictability is not confidence theater

The third mechanism is predictability. This is where many AI products quietly perform a small magic trick: they display confidence without proving that confidence means anything.

The paper measures predictability through calibration, discrimination, and Brier score. Calibration asks whether stated confidence matches empirical success rates. If an agent says it is 80% confident, it should be right roughly 80% of the time. Discrimination asks whether the agent gives higher confidence to tasks it will solve than to tasks it will fail. Brier score combines aspects of both.

The paper’s key distinction is that calibration and discrimination are not interchangeable.

A model can be well calibrated in aggregate while still failing to identify which specific tasks are dangerous. For example, if an agent says 70% on every task and is correct 70% of the time overall, calibration may look acceptable. But the confidence score is useless for routing. It does not tell the system which tasks to accept, defer, or escalate.

This distinction is especially important for business workflows. Most firms do not need an agent to confess, in aggregate, that life is uncertain. They need operational signals: “This invoice reconciliation should go through automatically,” “This refund request should be reviewed,” “This database update should be blocked,” or “This legal-policy answer needs escalation.”

The paper finds that calibration has improved in recent models, with Claude models standing out in several settings. That is real progress. But discrimination is inconsistent. On τ-bench, newer models show some improvement. On GAIA, discrimination has not clearly improved and may worsen for some recent models. The authors interpret this as evidence that models may become better at estimating their average success rate without becoming better at identifying their own likely failures on individual tasks.

That is a very business-relevant failure mode. An agent with decent average confidence but poor task-level discrimination is like a weather forecast that says “somewhere in the country it may rain.” Technically probabilistic. Operationally useless if you are deciding whether to bring the delivery fleet indoors.

The paper’s confidence protocol is also worth noting. The authors use post-hoc self-assessment: after task completion, the agent is prompted to rate its confidence based on its execution trace. This is not the most sophisticated possible uncertainty estimator. But it is the one many practitioners can actually use through standard frontier-model APIs, where logits and internal states are usually unavailable. That makes the result more relevant, not less. The paper is testing the confidence signal that normal deployments can realistically obtain.

Safety is separated because averaging tail risk is how dashboards become fiction

The fourth mechanism is safety, but the paper uses the term narrowly. It does not mean full AI alignment, fairness, privacy, or broad social trustworthiness. It means bounded operational severity when agents fail.

The authors measure compliance and harm severity. Compliance asks whether an agent violates predefined constraints, such as exposing unrelated personal information, bypassing authorization, making unauthorized modifications, or circumventing policies. Harm severity asks how bad the violations are when they occur.

The aggregation choice here is important. The authors do not simply average safety into the overall reliability score. They treat safety separately because safety failures are tail phenomena. A catastrophic event occurring 1% of the time should not disappear inside a pleasant average. Anyone who has ever seen an executive dashboard knows how easily one red number can be diluted into twelve green ones. The spreadsheet says “mostly fine”; the incident report says “production database deleted.”

On τ-bench, the paper examines safety in a customer-service setting with domain-specific constraints such as blocking unauthorized modifications, ensuring correct transaction amounts, requiring identity verification, and resisting policy circumvention. Recent frontier models show lower violation rates, and high-severity violations are relatively rare. But financial accuracy remains a common failure mode, especially around incorrect charges or refunds.

That result should travel directly into deployment design. In transaction-heavy workflows, “the agent completed the task” is not enough. The firm must know whether the amount was correct, whether the user was authorized, whether the action followed policy, and whether the final database state hides intermediate violations.

The paper’s method uses LLM-based judging for safety analysis. That is scalable, but it introduces its own reliability issue. A safety metric judged by another model should not be mistaken for legal certainty. For early-stage evaluation, it is useful. For regulated deployment, it should be paired with deterministic checks, human review, or domain-specific validators where possible.

The experiments are evidence for a pattern, not a universal ranking table

The experimental design matters because it defines how far we should generalize the findings.

The authors evaluate 15 models across two benchmarks. GAIA tests general assistant behavior involving web browsing, file manipulation, code execution, and multi-step reasoning. τ-bench tests tool-using customer-service agents in simulated airline and retail environments, where agents interact with users and databases to complete tasks.

These benchmarks are complementary. GAIA is open-ended and messy. τ-bench is structured and transactional. That contrast is useful because reliability behaves differently across them. τ-bench shows more moderate reliability gains, likely because structured environments constrain possible action paths. GAIA shows weaker reliability progress, especially where open-ended search, tool selection, and multi-step execution create more opportunities for divergence.

The paper also restricts τ-bench analysis to a verified 26-task subset because prior work identified errors in the original benchmark. This is not a minor footnote. The authors show that benchmark quality affects reliability measurements, especially predictability. If a correct answer is marked wrong because the benchmark label is flawed, a confident agent will look overconfident. In other words, dirty evaluation data can make a reliability profile lie.

A useful way to read the experimental evidence is this:

Evidence component	Likely purpose	What it supports	What it does not prove
Main GAIA and τ-bench evaluation	Main evidence	Reliability gains lag behind accuracy gains across tested agents	All future agents will show the same pattern
Multi-run consistency testing	Main evidence	Same task can produce unstable outcomes and trajectories	Every deployment will have the same variance level
Prompt perturbation tests	Robustness/sensitivity test	Surface-level rephrasing can change performance, especially in open-ended tasks	Full adversarial robustness
Fault and environment perturbations	Robustness/sensitivity test	Agents handle some controlled disruptions well	Robustness to all production schema, API, and data drift
Confidence self-assessment	Implementation-relevant predictability test	Available confidence signals can be calibrated yet weak at task-level discrimination	Best possible uncertainty estimation
τ-bench clean subset comparison	Benchmark-quality check	Label errors can distort reliability profiles	Benchmark cleaning solves all evaluation problems
Safety analysis with LLM judge	Exploratory operational safety measurement	Constraint violations and harm severity can be measured from traces	Legally complete safety assurance

This framing avoids two mistakes. The first is overclaiming the paper as a final certification method for agents. It is not. The second is underclaiming it as “just another benchmark paper.” It is more useful than that because it turns vague reliability concerns into measurable deployment questions.

The business shift is from model selection to release governance

For firms, the paper’s main value is not the ranking of model providers. Provider rankings age quickly. Governance mechanisms age more slowly.

The more durable lesson is that agent deployment should include reliability gates. A gate is not a vibe check, a demo, or a senior manager saying, “Looks good.” It is a defined threshold or review process that determines whether an agent can move from sandbox to pilot, from pilot to production, or from human-supervised mode to autonomous mode.

A practical reliability gate could include:

Gate	Test	Deployment decision it informs
Repeated-run stability	Run the same task multiple times under identical conditions	Is the agent deterministic enough for this workflow?
Prompt variation	Test naturalistic rephrasings, typos, fragments, and style changes	Can real users express intent safely?
Tool and API perturbation	Change response formats, field names, date formats, and tool outputs	Will minor environment drift break execution?
Confidence calibration	Compare confidence with actual success rate	Can confidence be shown to users or operators?
Failure discrimination	Test whether low confidence identifies likely failures	Can the system route tasks to human review?
Safety constraints	Check authorization, data exposure, policy compliance, and destructive actions	Which actions require hard blocking or approval?
Severity-weighted incident review	Classify failures by operational harm, not just task failure	Which failures are acceptable, recoverable, or deployment-blocking?

The point is not to copy the paper’s metrics mechanically. Different businesses should weight dimensions differently. A brainstorming assistant may benefit from low trajectory consistency. A payment-processing agent should not. A code assistant used by a developer can tolerate more uncertainty than an unattended deployment agent with write access to production systems.

Reliability requirements should scale with autonomy. When a human reviews every output before action, imperfect reliability is manageable. When the agent’s output directly changes records, sends messages, executes trades, issues refunds, or modifies infrastructure, the reliability bar rises sharply.

This distinction is one reason coding assistants spread quickly despite imperfect reliability. They often remain augmentation tools: the human developer reviews, edits, and approves. But the moment an agent moves from “suggest a patch” to “merge and deploy,” the same reliability profile becomes much less charming.

The boundary: this is a measurement starting point, not a production certificate

The paper is careful about limitations, and those limitations matter for business interpretation.

First, the empirical scope is narrow. GAIA and τ-bench are useful but cannot represent all enterprise workflows. A supply-chain planning agent, a clinical documentation agent, a compliance review agent, and a crypto trading agent will each have different failure modes.

Second, each benchmark uses one scaffold. Agent reliability is not only a property of the base model. It is also shaped by the system prompt, tool interface, memory design, retry logic, validators, permissions, and orchestration layer. A better scaffold may improve reliability; a worse one may quietly sabotage a strong model.

Third, the safety analysis relies on LLM judging. That is reasonable for scalable research, but organizations should not outsource high-stakes compliance judgment entirely to another probabilistic model. Where possible, safety checks should be hardened with rule-based constraints, permission systems, transaction limits, approval workflows, and logs that humans can audit.

Fourth, the paper sets temperature to zero where applicable. That isolates non-sampling sources of stochasticity, but it may overestimate reliability in workflows where nonzero temperature improves performance or creativity. In other words, the real deployment may be noisier than the experiment.

Finally, the metrics themselves are design choices. The four-dimensional decomposition is useful because it is concrete, not because it is metaphysically complete. Reliability for AI agents is still an immature discipline. The paper’s contribution is not the final constitution of agent reliability. It is a working measurement system that makes previously invisible failure modes visible enough to argue about.

That is progress. Not glamorous progress, perhaps, but production systems are mostly built from unglamorous progress. Glamour does not pass audits.

The reliability gap is a management problem disguised as a benchmark problem

The paper’s most important message is that agent reliability is not a property firms can infer from model capability alone. It has to be measured, managed, and governed directly.

For AI builders, this means model evaluation should report reliability profiles, not only task success. For product teams, it means user-facing confidence should be treated as an empirically tested signal, not a decorative number. For operations teams, it means prompts, tools, schemas, and model versions need change-control processes. For executives, it means agent deployment should be tied to autonomy level: the more the agent can do without a human, the more reliability evidence it must provide.

The misconception to discard is simple: smarter agents are not automatically safer, steadier, or more production-ready. They may be more capable, but capability is only one piece of the deployment puzzle.

The replacement view is more useful: agent readiness is a profile. A firm should ask whether the agent is consistent enough for repeated use, robust enough for messy inputs and shifting environments, predictable enough to support escalation, and safe enough that failures remain bounded.

That is less exciting than announcing that the new model scored higher on a benchmark. It is also more likely to prevent the Friday evening incident where everyone discovers that the agent was “accurate” in the same way a confident intern is accurate: impressive in interviews, dangerous near write permissions.

Cognaptus: Automate the Present, Incubate the Future.

Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan, “Towards a Science of AI Agent Reliability,” arXiv:2602.16666, 2026. https://arxiv.org/abs/2602.16666 ↩︎

Accuracy measures the answer; reliability measures the behavior around the answer#

Consistency is the boring metric that becomes exciting after the first lawsuit#

Robustness is where “same meaning” stops meaning “same result”#

Predictability is not confidence theater#

Safety is separated because averaging tail risk is how dashboards become fiction#

The experiments are evidence for a pattern, not a universal ranking table#

The business shift is from model selection to release governance#

The boundary: this is a measurement starting point, not a production certificate#

The reliability gap is a management problem disguised as a benchmark problem#