Code review has a comforting ritual.
A developer submits a patch. A reviewer inspects it. The reviewer says it looks good. Everyone feels slightly better, because at least someone checked. In AI-agent workflows, this ritual becomes even more tempting: let one agent write the patch, let another agent review it, then ask the reviewer how confident it is.
Neat. Scalable. Cheap. Also, according to a new paper on agentic uncertainty, very capable of fooling itself with excellent posture.1
The paper, Agentic Uncertainty Reveals Agentic Overconfidence, asks a simple question that will matter more as companies delegate longer workflows to AI agents: can an agent estimate whether another agent built on the same underlying model will succeed?
This is not ordinary confidence in a single answer. The authors call the setting agentic uncertainty, or P(IS): the probability that “I succeed.” The object being evaluated is not one token, one classification, or one factual answer. It is a multi-step trajectory: reading files, reasoning about a bug, editing code, possibly running tools, and producing a final patch.
That distinction matters. A model can sound calibrated when answering trivia and still be a terrible judge of whether its own agentic workflow has actually worked. The agent may understand the task, generate a plausible plan, and still miss the exact failure condition. Software is rude that way. Tests do not care whether the patch had a confident narrative arc.
The paper’s most useful result is not merely “agents are overconfident.” That is now almost a genre. The sharper result is comparative: where and how you ask for uncertainty changes what the confidence score means.
The authors compare several oversight regimes:
| Oversight regime | What the uncertainty agent sees | Operational question | Likely role in the paper |
|---|---|---|---|
| Pre-execution | Task description and repository access before any solution attempt | Should we attempt this task, route it, or escalate early? | Main evidence for task-difficulty discrimination |
| Mid-execution | Partial trajectory at 25%, 50%, and 75% of execution | Can the agent detect that work is going badly before completion? | Main evidence on early-warning signals |
| Standard post-execution | Task, repository, and proposed patch | Can an agent verify a finished solution? | Main evidence for post-hoc overconfidence |
| Adversarial post-execution | Same as post-execution, but prompted to find bugs and failure modes | Does falsification improve calibration? | Main mitigation test |
| Ensemble variants | Combinations of pre- and post-execution estimates | Can multiple weak signals be combined into a better routing signal? | Exploratory extension |
| Self-preference test | Own-model versus cross-model patches | Is overconfidence just model self-favoritism? | Ablation / alternative-explanation test |
That table is the article’s spine. The paper is not best read as a linear story from “before” to “during” to “after.” It is better read as a comparison of governance choices.
The uncomfortable lesson is this: more information does not automatically create better oversight. Seeing a finished patch may give the reviewer more context, but it can also give the reviewer something seductive to believe in.
The paper measures confidence in whole workflows, not isolated answers
The experiment uses 100 randomly sampled tasks from SWE-bench Pro, a benchmark designed around long-horizon software engineering tasks. These are not toy problems where the correct answer is a word, a number, or a multiple-choice option. The tasks require substantial code modifications: the paper reports a mean of 107 changed lines across 4.1 files.
The authors generate task-solving trajectories with three frontier models:
| Model used for task-solving and uncertainty estimation | True success rate on the 100 tasks |
|---|---|
| GPT-5.2 Codex | 35% |
| Gemini 3 Pro | 22% |
| Claude Opus 4.5 | 27% |
The uncertainty agent uses the same underlying model as the task-solving agent. That design choice is important. The paper is not asking whether a much stronger model can audit a weaker one. It is asking whether a model family can estimate the success of its own agentic work under different information conditions.
The uncertainty agents are given tool access to inspect the repository, but the setup blocks direct peeking at test outcomes. Otherwise the task would collapse into “run the tests and report what happened,” which is not uncertainty estimation. That would be measurement. Useful, yes. But not the problem businesses face when ground truth is unavailable, delayed, expensive, or ambiguous.
The paper evaluates uncertainty estimates using several metrics:
| Metric | What it asks | Business interpretation |
|---|---|---|
| AUROC | Can the estimate distinguish successful from failed tasks? | Useful for routing, prioritization, and escalation |
| ECE | Are predicted probabilities calibrated against observed outcomes? | Useful for risk thresholds and acceptance policies |
| Brier score | How accurate are probability estimates overall? | Useful when probability quality matters, not just ranking |
| Overconfidence | Mean predicted success minus true success rate | Useful for detecting dangerous optimism |
The distinction between discrimination and calibration is not academic housekeeping. A confidence score can rank tasks reasonably while being numerically inflated. In that case, it may help decide which tasks deserve human review, but it should not be interpreted as “85% chance this is correct.” The paper repeatedly forces that distinction. Good. Someone had to.
Standard post-execution review is the governance trap
The most business-relevant failure mode is the standard post-execution reviewer.
This is the workflow many companies will naturally adopt: an AI coding agent produces a patch, then another AI agent reviews the patch and estimates whether it is correct. It feels like a scalable version of human review. The problem is that the reviewer can become anchored on the patch’s surface plausibility.
The headline numbers are not subtle:
| Model | True success rate | Standard post-execution mean estimate | Overconfidence gap |
|---|---|---|---|
| GPT-5.2 Codex | 35% | 73% | +39 pp |
| Gemini 3 Pro | 22% | 77% | +55 pp |
| Claude Opus 4.5 | 27% | 61% | +34 pp |
Gemini is the most extreme case. The paper reports that its estimates cluster near 100% in some settings, leaving almost no room to distinguish easy tasks from hard ones. That is not merely bad calibration. It is a kind of probability theater: everything is nearly certain, including the failures.
The paper also reports an asymmetric error pattern. Across models and methods, 62% of predictions on failing instances are overconfident, while only 11% of predictions on passing instances are underconfident. In plainer language: agents are much more likely to confidently bless a bad outcome than to unfairly doubt a good one.
That asymmetry is exactly what makes naive agent review dangerous in production. A conservative reviewer that sometimes blocks good work is annoying. An optimistic reviewer that waves through broken work is operationally expensive. In code, it means regressions. In finance, it means bad signals with polished explanations. In compliance, it means an audit trail that looks sensible until someone asks whether it was true.
The paper’s post-execution result therefore has a clear business meaning:
| What the paper directly shows | Cognaptus interpretation | Boundary |
|---|---|---|
| Standard post-execution agents overestimate success across all three models | A same-family AI reviewer should not be treated as an acceptance authority | Tested on coding tasks with objective pass/fail labels |
| Confidence distributions overlap heavily between successes and failures | High confidence alone may carry little signal about correctness | Signal quality varies by model and elicitation regime |
| Overconfidence is stronger on failures than underconfidence on successes | The risk is not merely inefficiency; it is false clearance | Human or external verification remains necessary for high-stakes work |
The seductive part of post-execution review is that it sees the patch. The dangerous part is also that it sees the patch.
A plausible patch gives the reviewer a story: the change is focused, the implementation resembles nearby code, the explanation sounds coherent. The reviewer then evaluates the story rather than the hidden failure condition. This is how a governance layer becomes a rubber stamp with better grammar.
Pre-execution has less information but often better discrimination
The paper’s counterintuitive finding is that pre-execution assessment often separates likely successes from failures better than standard post-execution review.
Pre-execution agents see the task description and repository, but not a generated patch. They cannot execute code or modify files. They must judge difficulty abstractly: how complex is the codebase, how specific is the issue, how likely is the fix to depend on obscure framework behavior, and whether the change requires touching fragile areas.
Despite having strictly less information, pre-execution achieves better AUROC than standard post-execution for all three models:
| Model | Pre-execution AUROC | Standard post-execution AUROC |
|---|---|---|
| GPT-5.2 Codex | 0.62 | 0.58 |
| Gemini 3 Pro | 0.53 | 0.51 |
| Claude Opus 4.5 | 0.64 | 0.55 |
The paper is careful here. The bootstrap confidence intervals are wide. For GPT, pre-execution AUROC is 0.62 with a 95% interval of [0.50, 0.75], while post-execution is 0.58 with [0.46, 0.70]. For Claude, pre-execution is 0.64 [0.49, 0.78], while post-execution is 0.55 [0.42, 0.67]. The individual pairwise differences are not statistically significant.
So the right reading is not “pre-execution definitely beats post-execution.” The better reading is: the consistent direction across all three models is suggestive, and it undermines the lazy assumption that a finished patch must make review more reliable.
The mechanism is plausible. Before the patch exists, the uncertainty agent must reason about the task itself. After the patch exists, the agent can be pulled toward the patch’s apparent coherence. That is not more knowledge in the useful sense. It is more material for motivated agreement.
The paper gives an example involving a qutebrowser task requiring exposure of a new QtWebEngine dark-mode policy. The pre-execution agent assigned a lower success probability because it recognized that the task depended on knowing an exact Chromium key or enum. After seeing a plausible patch, the post-execution agent became much more confident. The patch failed.
This is the core misconception the article should kill quietly: verification is not automatically easier than generation when the verifier is vulnerable to the same representational blind spots as the generator.
For business use, pre-execution confidence is most valuable as a routing signal:
| Use case | How pre-execution estimates help | What not to do |
|---|---|---|
| Task triage | Identify tasks likely to waste agent budget or require human context | Treat the score as a guarantee of failure or success |
| Workflow selection | Route low-risk tasks to automation and high-risk tasks to expert review | Use one universal threshold across domains |
| Budget control | Avoid spending long agent runs on tasks that look structurally fragile | Assume the estimate is well-calibrated numerically |
| SLA planning | Flag tasks likely to need longer review cycles | Replace final verification |
Pre-execution is not magic. Its AUROC values are modest. Gemini’s pre-execution behavior is especially weak because the estimates are too compressed at the optimistic end. But pre-execution has one virtue that post-execution often lacks: it is less exposed to the “this patch looks reasonable” trap.
Sometimes ignorance is not wisdom. Sometimes it is just fewer opportunities to be impressed by nonsense.
Mid-execution doubt is not yet a reliable early-warning system
Mid-execution monitoring sounds like the most operationally attractive regime. If an agent is going off track, why wait until the end? Ask another agent to monitor progress at 25%, 50%, and 75% of the trajectory. If confidence drops for doomed tasks, stop early or escalate.
The paper tests exactly this idea. The result is more awkward than the product slide would prefer.
| Model | Checkpoint | AUROC | Mean estimate | Overconfidence | ECE |
|---|---|---|---|---|---|
| GPT-5.2 Codex | 25% | 0.53 | 0.67 | +0.32 | 0.32 |
| GPT-5.2 Codex | 50% | 0.51 | 0.63 | +0.28 | 0.32 |
| GPT-5.2 Codex | 75% | 0.53 | 0.47 | +0.12 | 0.19 |
| Gemini 3 Pro | 25% | 0.49 | 0.87 | +0.65 | 0.65 |
| Gemini 3 Pro | 50% | 0.64 | 0.80 | +0.58 | 0.58 |
| Gemini 3 Pro | 75% | 0.64 | 0.67 | +0.45 | 0.54 |
| Claude Opus 4.5 | 25% | 0.62 | 0.58 | +0.31 | 0.31 |
| Claude Opus 4.5 | 50% | 0.52 | 0.37 | +0.10 | 0.19 |
| Claude Opus 4.5 | 75% | 0.52 | 0.17 | -0.10 | 0.21 |
The authors call the central pattern “cold feet.” Confidence decreases as execution progresses for 71% of GPT instances and 97% of Claude instances. That sounds promising until the important part arrives: the doubt is often uninformative. Confidence drops for both successful and failed trajectories, and the success and failure curves track closely.
This matters because a mid-execution monitor needs more than emotional movement. It needs differential movement. If confidence falls for everything, the system has detected fatigue, complexity, or late-stage anxiety—not necessarily impending failure.
There is one partial exception. The paper reports that Claude’s confidence drops correlate weakly with outcome, while GPT and Gemini do not show significant relationships. That is useful, but not enough to justify a general early-stopping policy.
A business should therefore treat mid-execution uncertainty as diagnostic telemetry, not as an autonomous kill switch.
| What mid-execution monitoring can support | What the paper does not yet support |
|---|---|
| Collecting traces for later analysis | Automatically stopping tasks based only on falling confidence |
| Identifying moments where agents become uncertain | Assuming uncertainty decline means failure |
| Triggering softer escalation, such as requesting more tests or a second reviewer | Treating mid-run confidence as a calibrated probability |
| Studying process-level failure modes | Claiming process monitoring is already solved |
The practical lesson is that process supervision needs better signals than “the agent seems less confident now.” Anyone who has watched a model debug its own mess already knows this feeling. The agent becomes solemn, says the issue is subtle, and continues directly into a wall.
Adversarial review helps because it changes the job, not because it adds new evidence
The strongest mitigation in the paper is adversarial post-execution review.
The adversarial reviewer sees the same kind of information as the standard post-execution reviewer: task, repository, and patch. The difference is the prompt. Instead of asking whether the patch is correct, the agent is instructed to find bugs, edge cases, failure modes, and gaps before estimating success.
That shift matters. Standard review invites confirmation. Adversarial review invites falsification.
The calibration improvements are meaningful:
| Model | Standard post-execution ECE | Adversarial post-execution ECE | Interpretation |
|---|---|---|---|
| GPT-5.2 Codex | 0.42 | 0.30 | Better calibration, but mostly from lowering estimates |
| Gemini 3 Pro | 0.66 | 0.53 | Better calibration and some discrimination improvement |
| Claude Opus 4.5 | 0.37 | 0.24 | Best calibration and better discrimination |
The paper also reports that adversarial post-execution reduces the overconfident-failure rate from 72% under standard review to 45%. That is still not safe enough for blind acceptance, but it is a real improvement.
Discrimination is more mixed:
| Model | Standard post-execution AUROC | Adversarial post-execution AUROC |
|---|---|---|
| GPT-5.2 Codex | 0.58 | 0.55 |
| Gemini 3 Pro | 0.51 | 0.57 |
| Claude Opus 4.5 | 0.55 | 0.64 |
This is where interpretation needs discipline. Adversarial prompting is not a universal signal amplifier. For GPT, the paper finds that the adversarial shift is nearly identical for passing and failing instances: confidence drops, but not in a way that better separates success from failure. In fact, post-hoc Platt scaling of standard post-execution predictions achieves better calibration for GPT than adversarial prompting, suggesting that adversarial review added little new discriminatory signal in that case.
For Gemini and Claude, the shift is larger on failing instances than on passing ones. That means adversarial review does more than lower confidence across the board; it surfaces some genuine failure signal. The paper notes that these pairwise differences are not individually significant, so the result should be read as suggestive rather than final.
The OpenLibrary example in the paper is a useful miniature. A standard reviewer sees a small, plausible addition and assigns high confidence. The adversarial reviewer inspects the surrounding output-shaping logic and notices that the added field still may not appear in the final response. The patch fails. The difference is not mystical reasoning ability. It is attention allocation.
In a production setting, this distinction matters:
| Review framing | Likely reviewer behavior | Operational risk |
|---|---|---|
| “Is this patch correct?” | Looks for coherence, style, and local consistency | High risk of rubber-stamping plausible patches |
| “Find bugs and failure modes.” | Searches for missing paths, edge cases, and integration gaps | Lower risk, but still overconfident |
| “Estimate success after adversarial search.” | Produces a more skeptical probability | Useful for escalation, not automatic acceptance |
Adversarial review is also more expensive. The paper reports 23.4 steps and $0.52 per instance versus 12.7 steps and $0.23 for standard post-execution review. This is not a scandal. Better review costs more. Shocking discovery. Someone notify procurement.
The real question is where to spend that cost. The answer is not “run adversarial review on everything.” The answer is to use cheaper pre-execution routing to identify tasks that deserve more expensive falsification.
Ensembles help only when they match the decision problem
The paper also evaluates simple ensembles of pre- and post-execution estimates: averaging, conservative selection, and aggressive selection.
The results are unsurprising in the useful way. A conservative ensemble, taking the lower of pre- and post-execution estimates, improves calibration over vanilla post-execution for GPT and Claude. The aggressive ensemble can improve AUROC for GPT and Claude, but at the cost of worse overconfidence and calibration.
| Ensemble choice | What it optimizes psychologically | When it may be useful | Main risk |
|---|---|---|---|
| Average | Compromise between difficulty estimate and patch review | General routing when no single signal dominates | Can preserve shared optimism |
| Conservative minimum | Trust the more skeptical estimate | Risk control, escalation, compliance-sensitive workflows | May over-escalate good work |
| Aggressive maximum | Trust the more optimistic estimate | Finding candidates worth attempting when false positives are cheap | Dangerous for acceptance decisions |
For businesses, this is the governance version of choosing a loss function. If the cost of false acceptance is high, the conservative estimate is more rational. If the cost of false rejection is high and human review is abundant, a more aggressive signal might be acceptable. There is no neutral confidence aggregation rule. There is only a rule whose errors you have decided to tolerate.
The paper’s ensemble section is best read as an exploratory extension, not as a deployment recipe. The broader message is that pre-execution and post-execution estimates contain different kinds of information and fail in different ways. Combining them can help, but only if the business knows whether it is optimizing cost, speed, quality, risk, or auditability.
“Use an ensemble” is not a strategy. It is a way to postpone saying what you are afraid of.
Self-preference is not the main explanation
One possible explanation for agentic overconfidence is self-preference. Perhaps models favor their own patches because the style, reasoning pattern, or implementation approach feels familiar. The paper tests this with a smaller ablation: judges estimate success on own-model patches and cross-family patches.
The result does not rescue the simple story.
| Judge | Patch source | Mean estimate | AUROC | ECE | Brier |
|---|---|---|---|---|---|
| GPT | GPT self | 0.74 | 0.55 | 0.38 | 0.36 |
| GPT | Gemini cross | 0.51 | 0.39 | 0.44 | 0.37 |
| Gemini | Gemini self | 0.72 | 0.57 | 0.56 | 0.59 |
| Gemini | GPT cross | 0.91 | 0.55 | 0.56 | 0.55 |
GPT does show self-preference: it assigns higher estimates to its own patches. Gemini shows the opposite pattern, assigning even higher estimates to GPT patches than to its own. Yet overconfidence appears regardless of direction.
So self-preference may be part of the story for some models, but it cannot explain the main phenomenon. The deeper issue is broader: agentic reviewers are bad at mapping plausible-looking work to actual task success.
That is a useful ablation because it prevents a comfortable but narrow conclusion. The fix is not simply “use a different model as the judge.” Cross-model review may help in some cases, especially when model errors are diverse, but the paper does not show that model diversity alone solves overconfidence.
A practical governance design: route early, falsify late, escalate disagreement
The paper’s business value is not that it gives companies a ready-made confidence score. It does almost the opposite. It shows why confidence scores must be embedded inside a decision protocol.
A reasonable agentic workflow inspired by the paper would look like this:
Task arrives
↓
Pre-execution uncertainty estimate
↓
Route:
- low-risk / routine → agent attempt
- high-risk / unclear → human-assisted setup or narrower task decomposition
↓
Agent produces output
↓
Adversarial post-execution review
↓
Decision:
- strong evidence + low stakes → proceed with monitoring
- disagreement or low adversarial confidence → human review
- high stakes → external verification regardless of confidence
The key is to separate the jobs.
| Governance job | Better-suited signal | Why |
|---|---|---|
| Decide whether to attempt a task | Pre-execution estimate | Less exposed to patch anchoring |
| Decide whether to spend more review budget | Disagreement between pre- and post-execution estimates | Disagreement identifies uncertainty worth investigating |
| Decide whether to accept an output | Adversarial post-execution plus external checks | Falsification is better than confirmatory review |
| Decide whether to remove humans | Not supported by this paper | Overconfidence remains substantial |
This architecture is not glamorous. It does not say, “The agent knows when it is right.” It says, “The agent sometimes knows enough to tell us where the next review dollar should go.” That is less magical and more useful.
For AI coding tools, the immediate implication is clear: do not expose a confidence percentage as if it were a quality guarantee. Use it as a workflow feature: task triage, review prioritization, escalation, and audit logging.
For business-process automation, the analogy is broader. Many agentic workflows do not produce testable code patches. They produce reports, reconciliations, emails, compliance summaries, CRM updates, or procurement recommendations. In those settings, ground truth may be delayed or subjective. That makes overconfidence more dangerous, not less, because failure is harder to detect.
The paper’s direct evidence is coding-specific. The business inference is that any agentic workflow with multi-step hidden failure modes should avoid naive self-certification.
| Domain | Similar failure mode | Safer interpretation of agent confidence |
|---|---|---|
| Software engineering | Plausible patch fails tests | Use confidence for review routing, not merge approval |
| Finance research | Polished thesis rests on bad data join | Use confidence to trigger data lineage checks |
| Operations automation | Agent completes process but misses exception case | Use adversarial review on exception handling |
| Compliance summaries | Report sounds complete but omits material clause | Require source-grounded verification |
| Customer support automation | Response is fluent but policy-inconsistent | Escalate when confidence conflicts with policy complexity |
The general principle is simple: confidence is an input to governance, not a substitute for governance.
The boundaries are narrow enough to matter
The paper’s limitations are not decorative. They affect how the result should be used.
First, the experiments are on software engineering tasks. This is a strength because tests provide objective success labels. It is also a boundary because other domains have fuzzier definitions of success. In creative work, strategy, or legal interpretation, calibration itself may be harder to define.
Second, the sample size is 100 tasks. That is enough to reveal a broad overconfidence pattern, but some per-model comparisons have wide confidence intervals. The paper is careful about this, especially when discussing pre-execution versus post-execution discrimination and adversarial shift decomposition.
Third, the uncertainty agents are prompt-based, not trained verifiers. A verifier trained specifically to detect agentic failure patterns might perform better. That remains an open path rather than a result.
Fourth, the study uses same-family model assessment. It does not establish what happens when a stronger model reviews a weaker model, when specialized tools are added, or when independent static analysis and test generation are integrated into the review loop.
Finally, the cost numbers are specific to the experimental setup. The relative point is still useful: adversarial review uses more steps and costs more than standard review. But production economics will depend on model pricing, task complexity, and the cost of failure.
These boundaries do not weaken the article’s main business lesson. They define it. The result is not “never use agent confidence.” The result is: do not confuse a confidence estimate with a control system.
The real cost of overconfidence is false delegation
Agentic overconfidence is not just a calibration problem. It is a delegation problem.
When a company asks an AI agent, “Are you likely to succeed?” it is often asking a hidden second question: “Can I stop paying attention?” The paper’s answer is mostly no. At least, not from the agent’s own confidence estimate, not yet.
The useful path is more modest and more operational. Ask before execution to route work. Ask adversarially after execution to find failure modes. Use disagreement as a reason to escalate. Keep external verification for high-stakes decisions. Treat ordinary post-execution approval with suspicion, especially when the patch looks beautifully boring.
That last part is where many failures hide. The obviously broken output is easy to catch. The dangerous one is clean, local, idiomatic, and wrong.
AI agents are getting better at doing work. This paper reminds us that they are also getting better at sounding certain that the work is done. Those are not the same capability. A serious automation strategy must price the difference.
Cognaptus: Automate the Present, Incubate the Future.
-
Jean Kaddour, Srijan Patel, Gbètondji Dovonon, Leo Richter, Pasquale Minervini, and Matt J. Kusner, “Agentic Uncertainty Reveals Agentic Overconfidence,” arXiv:2602.06948, 2026, https://arxiv.org/abs/2602.06948. ↩︎