When Agents Believe Their Own Hype: The Hidden Cost of Agentic Overconfidence

Opening — Why this matters now

AI agents are no longer toy demos. They write production code, refactor legacy systems, navigate websites, and increasingly make decisions that matter. Yet one deceptively simple question remains unresolved: can an AI agent reliably tell whether it will succeed?

This paper delivers an uncomfortable answer. Across frontier models and evaluation regimes, agents are systematically overconfident about their own success—often dramatically so. As organizations push toward longer-horizon autonomy, this blind spot becomes not just an academic curiosity, but a deployment risk.

Background — From token confidence to agentic uncertainty

Traditional uncertainty estimation in machine learning focuses on predictions: probabilities over tokens, labels, or answers. But autonomous agents operate on a different plane. Their success depends on long, multi-step trajectories involving planning, tool use, intermediate decisions, and error recovery.

The authors formalize this gap through agentic uncertainty—the probability an agent assigns to its own eventual task success. This extends earlier ideas like P(IK) (“probability that I know”) into a richer setting: P(IS), the probability that I succeed.

Crucially, the same underlying model plays two roles:

a task agent that attempts the solution, and
an uncertainty agent that estimates success at different stages.

This isolates overconfidence as a property of self-assessment, not model capability.

Analysis — Three moments of self-judgment

The study probes agentic uncertainty at three points in the lifecycle, plus one twist:

Regime	Information Available	Oversight Question
Pre-execution	Task + repo (read-only)	Should we attempt this?
Mid-execution	Partial trajectory	Are we failing already?
Post-execution	Finished patch	Did we succeed?
Adversarial post-exec	Patch + bug-finding prompt	What could be wrong?

All evaluations are run on SWE-bench Pro, a demanding benchmark where real success rates remain low (22–35%), even for frontier models.

Findings — Confidence without calibration

1. Pervasive overconfidence

Post-execution agents routinely predict success rates two to three times higher than reality. One model forecasts 77% success on tasks it completes only 22% of the time. Confidence distributions for successful and failed attempts are nearly indistinguishable.

The implication is stark: high confidence conveys almost no information.

2. More context, worse judgment

Counterintuitively, pre-execution estimates discriminate success better than post-execution review, despite having less information. Seeing a plausible-looking solution appears to anchor agents into believing it works.

This undermines a common assumption in AI deployment: that verification is easier than generation.

3. Mid-execution “cold feet” don’t help

As agents progress, their confidence often declines—but this doubt is largely uninformative. Both successful and failing trajectories exhibit similar confidence drops. The agent feels less sure, but not in a way that predicts failure.

4. Adversarial framing actually works

Prompting agents to actively search for bugs instead of asking whether a solution is correct substantially improves calibration. Overconfidence drops by up to 15 percentage points, and in some models, discrimination improves as well.

This reframing shifts the agent from confirmation to falsification—a small prompt change with outsized effects.

Visualization — Calibration beats intuition

Method	Discrimination (AUROC)	Calibration (ECE)	Overconfidence
Pre-execution	Higher	Moderate	High
Post-execution	Lower	Worst	Extreme
Adversarial post-exec	Competitive	Best	Reduced

Notably, simple post-hoc recalibration can fix some models—but for others, adversarial prompting introduces genuinely new signal, not just a downward shift in confidence.

Implications — Designing safer autonomy

Three practical lessons emerge:

Never trust agent self-confidence at face value. High confidence is not evidence of correctness.
Front-load uncertainty checks. Pre-execution assessment is surprisingly valuable for task routing.
Build adversarial review into the loop. Even lightweight bug-hunting prompts materially improve reliability.

For high-stakes systems, a hybrid strategy—pre-execution filtering plus adversarial post-execution review—appears far safer than naive self-verification.

Conclusion — Confidence is cheap, calibration is not

As AI agents take on longer and riskier tasks, their inability to accurately judge their own success becomes a systemic weakness. This paper shows that overconfidence is not a bug in one model, but a structural feature of current agentic systems.

The fix is not blind optimism, nor blind trust in self-critique—but designed skepticism, embedded directly into how agents are evaluated.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From token confidence to agentic uncertainty#

Analysis — Three moments of self-judgment#

Findings — Confidence without calibration#

1. Pervasive overconfidence#

2. More context, worse judgment#

3. Mid-execution “cold feet” don’t help#

4. Adversarial framing actually works#

Visualization — Calibration beats intuition#

Implications — Designing safer autonomy#

Conclusion — Confidence is cheap, calibration is not#