Opening — Why this matters now

AI agents are no longer experimental toys. They browse the web, execute code, manage workflows, interact with databases, and increasingly operate without human supervision. Their raw task accuracy is climbing steadily.

Yet something uncomfortable is emerging: higher accuracy does not mean dependable behavior.

An agent that succeeds 80% of the time but fails unpredictably—or catastrophically—does not behave like software. It behaves like a probabilistic intern with admin privileges.

A recent research effort proposes a shift in perspective: stop asking “How often does the agent succeed?” and start asking “How predictably, consistently, robustly, and safely does it behave?”

That shift sounds semantic. It is not.


Background — Accuracy Was Never Enough

Traditional AI benchmarks optimize for task success rates. But safety‑critical industries—aviation, nuclear power, automotive systems—have long known that average performance is only one dimension of reliability.

They decompose reliability into structured pillars:

Dimension Engineering Question Why Accuracy Fails Here
Consistency Does the system behave the same way under identical conditions? Accuracy ignores run‑to‑run variance
Robustness Does performance degrade gracefully under perturbation? Accuracy measures nominal conditions only
Predictability Does the system know when it is likely to fail? Accuracy says nothing about calibration
Safety How severe are failures when they occur? Accuracy treats benign and catastrophic errors equally

AI evaluation rarely mirrors this structure. Agents are typically ranked by pass@k or aggregate accuracy—metrics that measure capability, not operational dependability.

The paper reframes the discussion: reliability must be disentangled from capability.


Operationalizing Reliability — From Philosophy to Metrics

The authors introduce a four‑pillar framework with twelve concrete sub‑metrics designed to be computable, reproducible, and independent of raw accuracy.

1. Consistency (RCon)

Agents are stochastic. Re‑running the same task may yield different outcomes, action sequences, and resource usage.

Consistency is decomposed into:

  • Outcome consistency (Cout) – Do repeated runs yield the same success/failure outcome?
  • Trajectory consistency (Ctraj) – Do agents follow similar action paths?
  • Resource consistency (Cres) – Do token and compute costs remain stable?

Formally:

$$ R_{Con} = \frac{1}{3}(C_{out} + C_{traj} + C_{res}) $$

Crucially, outcome variance is normalized by $p(1-p)$ to isolate reliability from success rate. A 90% accurate model that fails randomly is less reliable than one that fails on a fixed, diagnosable subset of tasks.


2. Robustness (RRob)

Robustness measures relative degradation under perturbation:

$$ R_{env} = \frac{Acc_{perturbed}}{Acc_{baseline}} $$

Rather than penalizing weaker models for lower base accuracy, the ratio isolates sensitivity to environmental or structural shifts.

Robustness is further split into fault, structural, and prompt perturbations:

$$ R_{Rob} = \frac{1}{3}(R_{fault} + R_{struct} + R_{prompt}) $$


3. Predictability (RPred)

Does the agent know when it is likely to fail?

The framework uses post‑hoc self‑confidence elicitation. After task completion, the agent assigns a 0–100 confidence score. Calibration and discrimination are evaluated using metrics such as Brier score and AUROC.

$$ R_{Pred} = P_{brier} $$

Calibration has improved in newer models. Discrimination—the ability to distinguish solvable from unsolvable tasks—lags.


4. Safety (RSaf)

Safety follows classical risk decomposition:

$$ R_{Saf} = 1 - P(violation) \cdot E[severity \mid violation] $$

Failures are not equal. A formatting mistake and a destructive database action cannot share the same penalty.

Notably, safety is reported separately rather than averaged into overall reliability, preventing tail risk from being diluted by benign averages.


Overall Reliability

$$ R = \frac{1}{3}(R_{Con} + R_{Pred} + R_{Rob}) $$

Safety acts as a constraint, not a trade‑off variable.

This design choice alone is refreshingly non‑naïve.


Findings — The Reliability Gap Is Real

The study evaluates 14 frontier agentic models across two complementary benchmarks: a structured customer‑service environment and an open‑ended general assistant setting.

1. Reliability Lags Capability

Over 18 months of model releases:

  • Accuracy steadily improves.
  • Overall reliability shows only modest gains.

The gap is most pronounced in open‑ended tasks.

Trend (18 months) Accuracy Reliability
Structured tasks Moderate ↑ Mild ↑
Open‑ended tasks Strong ↑ Minimal ↑

Scaling improves what agents can do faster than how dependably they do it.


2. Consistency Remains Weak

Even frontier models show low outcome consistency. The divergence between pass@k and strict pass∧k reveals that many “successful” agents succeed intermittently rather than reliably.

Additionally:

  • Distributional action patterns are more stable than sequence ordering.
  • Resource usage varies widely across runs—particularly in open web tasks.

Translation: agents often know what to do, but not when or how consistently to do it.


3. Robustness Shows Ceiling Effects

Under tested perturbations, many models display limited additional degradation—though the perturbation space remains narrow.

The more uncomfortable insight: benchmark robustness may not reflect real‑world environmental variability.


4. Calibration Is Improving, Discrimination Is Not

Newer models are less overconfident on average.

But on complex, open‑ended tasks, agents are not significantly better at identifying which tasks they will fail.

Knowing your overall success rate is not the same as knowing when to abstain.


Implications — Automation vs. Augmentation

Reliability requirements differ sharply by deployment context.

Use Case Human Oversight Reliability Threshold
Coding assistant Yes Moderate acceptable
Brainstorming tool Yes Diversity may outweigh consistency
Customer service automation No High required
Autonomous database agent No Extremely high required

In augmentation settings, humans act as reliability buffers. In automation settings, unreliability becomes direct operational risk.

As organizations push toward greater autonomy, reliability ceases to be a secondary metric—it becomes the gating constraint.


Strategic Takeaways for Builders and Operators

  1. Separate capability dashboards from reliability dashboards. Do not let pass@k metrics dominate decision‑making.
  2. Treat safety as a constraint, not an average. Tail risk is governance risk.
  3. Track consistency across runs before deployment promotion. Reproducibility is operational trust.
  4. Invest in calibration tooling. Agents that know their limits are easier to supervise.
  5. Match reliability thresholds to autonomy level. The bar rises with independence.

Most importantly: reliability is measurable.

And once measurable, it becomes optimizable.


Conclusion — From “How Often?” to “How Dependably?”

The central shift is conceptual but profound.

Capability scaling alone will not deliver trustworthy AI agents. Reliability requires explicit measurement, independent incentives, and engineering discipline borrowed from industries that learned—often painfully—that average performance is not the same as dependable performance.

Smarter agents are arriving.

Whether they are dependable is a different question entirely.

Cognaptus: Automate the Present, Incubate the Future.