Opening — Why this matters now
AI agents are no longer toy demos. They write production code, refactor legacy systems, navigate websites, and increasingly make decisions that matter. Yet one deceptively simple question remains unresolved: can an AI agent reliably tell whether it will succeed?
This paper delivers an uncomfortable answer. Across frontier models and evaluation regimes, agents are systematically overconfident about their own success—often dramatically so. As organizations push toward longer-horizon autonomy, this blind spot becomes not just an academic curiosity, but a deployment risk.
Background — From token confidence to agentic uncertainty
Traditional uncertainty estimation in machine learning focuses on predictions: probabilities over tokens, labels, or answers. But autonomous agents operate on a different plane. Their success depends on long, multi-step trajectories involving planning, tool use, intermediate decisions, and error recovery.
The authors formalize this gap through agentic uncertainty—the probability an agent assigns to its own eventual task success. This extends earlier ideas like P(IK) (“probability that I know”) into a richer setting: P(IS), the probability that I succeed.
Crucially, the same underlying model plays two roles:
- a task agent that attempts the solution, and
- an uncertainty agent that estimates success at different stages.
This isolates overconfidence as a property of self-assessment, not model capability.
Analysis — Three moments of self-judgment
The study probes agentic uncertainty at three points in the lifecycle, plus one twist:
| Regime | Information Available | Oversight Question |
|---|---|---|
| Pre-execution | Task + repo (read-only) | Should we attempt this? |
| Mid-execution | Partial trajectory | Are we failing already? |
| Post-execution | Finished patch | Did we succeed? |
| Adversarial post-exec | Patch + bug-finding prompt | What could be wrong? |
All evaluations are run on SWE-bench Pro, a demanding benchmark where real success rates remain low (22–35%), even for frontier models.
Findings — Confidence without calibration
1. Pervasive overconfidence
Post-execution agents routinely predict success rates two to three times higher than reality. One model forecasts 77% success on tasks it completes only 22% of the time. Confidence distributions for successful and failed attempts are nearly indistinguishable.
The implication is stark: high confidence conveys almost no information.
2. More context, worse judgment
Counterintuitively, pre-execution estimates discriminate success better than post-execution review, despite having less information. Seeing a plausible-looking solution appears to anchor agents into believing it works.
This undermines a common assumption in AI deployment: that verification is easier than generation.
3. Mid-execution “cold feet” don’t help
As agents progress, their confidence often declines—but this doubt is largely uninformative. Both successful and failing trajectories exhibit similar confidence drops. The agent feels less sure, but not in a way that predicts failure.
4. Adversarial framing actually works
Prompting agents to actively search for bugs instead of asking whether a solution is correct substantially improves calibration. Overconfidence drops by up to 15 percentage points, and in some models, discrimination improves as well.
This reframing shifts the agent from confirmation to falsification—a small prompt change with outsized effects.
Visualization — Calibration beats intuition
| Method | Discrimination (AUROC) | Calibration (ECE) | Overconfidence |
|---|---|---|---|
| Pre-execution | Higher | Moderate | High |
| Post-execution | Lower | Worst | Extreme |
| Adversarial post-exec | Competitive | Best | Reduced |
Notably, simple post-hoc recalibration can fix some models—but for others, adversarial prompting introduces genuinely new signal, not just a downward shift in confidence.
Implications — Designing safer autonomy
Three practical lessons emerge:
- Never trust agent self-confidence at face value. High confidence is not evidence of correctness.
- Front-load uncertainty checks. Pre-execution assessment is surprisingly valuable for task routing.
- Build adversarial review into the loop. Even lightweight bug-hunting prompts materially improve reliability.
For high-stakes systems, a hybrid strategy—pre-execution filtering plus adversarial post-execution review—appears far safer than naive self-verification.
Conclusion — Confidence is cheap, calibration is not
As AI agents take on longer and riskier tasks, their inability to accurately judge their own success becomes a systemic weakness. This paper shows that overconfidence is not a bug in one model, but a structural feature of current agentic systems.
The fix is not blind optimism, nor blind trust in self-critique—but designed skepticism, embedded directly into how agents are evaluated.
Cognaptus: Automate the Present, Incubate the Future.