Code review has a comforting ritual.

A developer submits a patch. A reviewer inspects it. The reviewer says it looks good. Everyone feels slightly better, because at least someone checked. In AI-agent workflows, this ritual becomes even more tempting: let one agent write the patch, let another agent review it, then ask the reviewer how confident it is.

Neat. Scalable. Cheap. Also, according to a new paper on agentic uncertainty, very capable of fooling itself with excellent posture.1

The paper, Agentic Uncertainty Reveals Agentic Overconfidence, asks a simple question that will matter more as companies delegate longer workflows to AI agents: can an agent estimate whether another agent built on the same underlying model will succeed?

This is not ordinary confidence in a single answer. The authors call the setting agentic uncertainty, or P(IS): the probability that “I succeed.” The object being evaluated is not one token, one classification, or one factual answer. It is a multi-step trajectory: reading files, reasoning about a bug, editing code, possibly running tools, and producing a final patch.

That distinction matters. A model can sound calibrated when answering trivia and still be a terrible judge of whether its own agentic workflow has actually worked. The agent may understand the task, generate a plausible plan, and still miss the exact failure condition. Software is rude that way. Tests do not care whether the patch had a confident narrative arc.

The paper’s most useful result is not merely “agents are overconfident.” That is now almost a genre. The sharper result is comparative: where and how you ask for uncertainty changes what the confidence score means.

The authors compare several oversight regimes:

Oversight regime What the uncertainty agent sees Operational question Likely role in the paper
Pre-execution Task description and repository access before any solution attempt Should we attempt this task, route it, or escalate early? Main evidence for task-difficulty discrimination
Mid-execution Partial trajectory at 25%, 50%, and 75% of execution Can the agent detect that work is going badly before completion? Main evidence on early-warning signals
Standard post-execution Task, repository, and proposed patch Can an agent verify a finished solution? Main evidence for post-hoc overconfidence
Adversarial post-execution Same as post-execution, but prompted to find bugs and failure modes Does falsification improve calibration? Main mitigation test
Ensemble variants Combinations of pre- and post-execution estimates Can multiple weak signals be combined into a better routing signal? Exploratory extension
Self-preference test Own-model versus cross-model patches Is overconfidence just model self-favoritism? Ablation / alternative-explanation test

That table is the article’s spine. The paper is not best read as a linear story from “before” to “during” to “after.” It is better read as a comparison of governance choices.

The uncomfortable lesson is this: more information does not automatically create better oversight. Seeing a finished patch may give the reviewer more context, but it can also give the reviewer something seductive to believe in.

The paper measures confidence in whole workflows, not isolated answers

The experiment uses 100 randomly sampled tasks from SWE-bench Pro, a benchmark designed around long-horizon software engineering tasks. These are not toy problems where the correct answer is a word, a number, or a multiple-choice option. The tasks require substantial code modifications: the paper reports a mean of 107 changed lines across 4.1 files.

The authors generate task-solving trajectories with three frontier models:

Model used for task-solving and uncertainty estimation True success rate on the 100 tasks
GPT-5.2 Codex 35%
Gemini 3 Pro 22%
Claude Opus 4.5 27%

The uncertainty agent uses the same underlying model as the task-solving agent. That design choice is important. The paper is not asking whether a much stronger model can audit a weaker one. It is asking whether a model family can estimate the success of its own agentic work under different information conditions.

The uncertainty agents are given tool access to inspect the repository, but the setup blocks direct peeking at test outcomes. Otherwise the task would collapse into “run the tests and report what happened,” which is not uncertainty estimation. That would be measurement. Useful, yes. But not the problem businesses face when ground truth is unavailable, delayed, expensive, or ambiguous.

The paper evaluates uncertainty estimates using several metrics:

Metric What it asks Business interpretation
AUROC Can the estimate distinguish successful from failed tasks? Useful for routing, prioritization, and escalation
ECE Are predicted probabilities calibrated against observed outcomes? Useful for risk thresholds and acceptance policies
Brier score How accurate are probability estimates overall? Useful when probability quality matters, not just ranking
Overconfidence Mean predicted success minus true success rate Useful for detecting dangerous optimism

The distinction between discrimination and calibration is not academic housekeeping. A confidence score can rank tasks reasonably while being numerically inflated. In that case, it may help decide which tasks deserve human review, but it should not be interpreted as “85% chance this is correct.” The paper repeatedly forces that distinction. Good. Someone had to.

Standard post-execution review is the governance trap

The most business-relevant failure mode is the standard post-execution reviewer.

This is the workflow many companies will naturally adopt: an AI coding agent produces a patch, then another AI agent reviews the patch and estimates whether it is correct. It feels like a scalable version of human review. The problem is that the reviewer can become anchored on the patch’s surface plausibility.

The headline numbers are not subtle:

Model True success rate Standard post-execution mean estimate Overconfidence gap
GPT-5.2 Codex 35% 73% +39 pp
Gemini 3 Pro 22% 77% +55 pp
Claude Opus 4.5 27% 61% +34 pp

Gemini is the most extreme case. The paper reports that its estimates cluster near 100% in some settings, leaving almost no room to distinguish easy tasks from hard ones. That is not merely bad calibration. It is a kind of probability theater: everything is nearly certain, including the failures.

The paper also reports an asymmetric error pattern. Across models and methods, 62% of predictions on failing instances are overconfident, while only 11% of predictions on passing instances are underconfident. In plainer language: agents are much more likely to confidently bless a bad outcome than to unfairly doubt a good one.

That asymmetry is exactly what makes naive agent review dangerous in production. A conservative reviewer that sometimes blocks good work is annoying. An optimistic reviewer that waves through broken work is operationally expensive. In code, it means regressions. In finance, it means bad signals with polished explanations. In compliance, it means an audit trail that looks sensible until someone asks whether it was true.

The paper’s post-execution result therefore has a clear business meaning:

What the paper directly shows Cognaptus interpretation Boundary
Standard post-execution agents overestimate success across all three models A same-family AI reviewer should not be treated as an acceptance authority Tested on coding tasks with objective pass/fail labels
Confidence distributions overlap heavily between successes and failures High confidence alone may carry little signal about correctness Signal quality varies by model and elicitation regime
Overconfidence is stronger on failures than underconfidence on successes The risk is not merely inefficiency; it is false clearance Human or external verification remains necessary for high-stakes work

The seductive part of post-execution review is that it sees the patch. The dangerous part is also that it sees the patch.

A plausible patch gives the reviewer a story: the change is focused, the implementation resembles nearby code, the explanation sounds coherent. The reviewer then evaluates the story rather than the hidden failure condition. This is how a governance layer becomes a rubber stamp with better grammar.

Pre-execution has less information but often better discrimination

The paper’s counterintuitive finding is that pre-execution assessment often separates likely successes from failures better than standard post-execution review.

Pre-execution agents see the task description and repository, but not a generated patch. They cannot execute code or modify files. They must judge difficulty abstractly: how complex is the codebase, how specific is the issue, how likely is the fix to depend on obscure framework behavior, and whether the change requires touching fragile areas.

Despite having strictly less information, pre-execution achieves better AUROC than standard post-execution for all three models:

Model Pre-execution AUROC Standard post-execution AUROC
GPT-5.2 Codex 0.62 0.58
Gemini 3 Pro 0.53 0.51
Claude Opus 4.5 0.64 0.55

The paper is careful here. The bootstrap confidence intervals are wide. For GPT, pre-execution AUROC is 0.62 with a 95% interval of [0.50, 0.75], while post-execution is 0.58 with [0.46, 0.70]. For Claude, pre-execution is 0.64 [0.49, 0.78], while post-execution is 0.55 [0.42, 0.67]. The individual pairwise differences are not statistically significant.

So the right reading is not “pre-execution definitely beats post-execution.” The better reading is: the consistent direction across all three models is suggestive, and it undermines the lazy assumption that a finished patch must make review more reliable.

The mechanism is plausible. Before the patch exists, the uncertainty agent must reason about the task itself. After the patch exists, the agent can be pulled toward the patch’s apparent coherence. That is not more knowledge in the useful sense. It is more material for motivated agreement.

The paper gives an example involving a qutebrowser task requiring exposure of a new QtWebEngine dark-mode policy. The pre-execution agent assigned a lower success probability because it recognized that the task depended on knowing an exact Chromium key or enum. After seeing a plausible patch, the post-execution agent became much more confident. The patch failed.

This is the core misconception the article should kill quietly: verification is not automatically easier than generation when the verifier is vulnerable to the same representational blind spots as the generator.

For business use, pre-execution confidence is most valuable as a routing signal:

Use case How pre-execution estimates help What not to do
Task triage Identify tasks likely to waste agent budget or require human context Treat the score as a guarantee of failure or success
Workflow selection Route low-risk tasks to automation and high-risk tasks to expert review Use one universal threshold across domains
Budget control Avoid spending long agent runs on tasks that look structurally fragile Assume the estimate is well-calibrated numerically
SLA planning Flag tasks likely to need longer review cycles Replace final verification

Pre-execution is not magic. Its AUROC values are modest. Gemini’s pre-execution behavior is especially weak because the estimates are too compressed at the optimistic end. But pre-execution has one virtue that post-execution often lacks: it is less exposed to the “this patch looks reasonable” trap.

Sometimes ignorance is not wisdom. Sometimes it is just fewer opportunities to be impressed by nonsense.

Mid-execution doubt is not yet a reliable early-warning system

Mid-execution monitoring sounds like the most operationally attractive regime. If an agent is going off track, why wait until the end? Ask another agent to monitor progress at 25%, 50%, and 75% of the trajectory. If confidence drops for doomed tasks, stop early or escalate.

The paper tests exactly this idea. The result is more awkward than the product slide would prefer.

Model Checkpoint AUROC Mean estimate Overconfidence ECE
GPT-5.2 Codex 25% 0.53 0.67 +0.32 0.32
GPT-5.2 Codex 50% 0.51 0.63 +0.28 0.32
GPT-5.2 Codex 75% 0.53 0.47 +0.12 0.19
Gemini 3 Pro 25% 0.49 0.87 +0.65 0.65
Gemini 3 Pro 50% 0.64 0.80 +0.58 0.58
Gemini 3 Pro 75% 0.64 0.67 +0.45 0.54
Claude Opus 4.5 25% 0.62 0.58 +0.31 0.31
Claude Opus 4.5 50% 0.52 0.37 +0.10 0.19
Claude Opus 4.5 75% 0.52 0.17 -0.10 0.21

The authors call the central pattern “cold feet.” Confidence decreases as execution progresses for 71% of GPT instances and 97% of Claude instances. That sounds promising until the important part arrives: the doubt is often uninformative. Confidence drops for both successful and failed trajectories, and the success and failure curves track closely.

This matters because a mid-execution monitor needs more than emotional movement. It needs differential movement. If confidence falls for everything, the system has detected fatigue, complexity, or late-stage anxiety—not necessarily impending failure.

There is one partial exception. The paper reports that Claude’s confidence drops correlate weakly with outcome, while GPT and Gemini do not show significant relationships. That is useful, but not enough to justify a general early-stopping policy.

A business should therefore treat mid-execution uncertainty as diagnostic telemetry, not as an autonomous kill switch.

What mid-execution monitoring can support What the paper does not yet support
Collecting traces for later analysis Automatically stopping tasks based only on falling confidence
Identifying moments where agents become uncertain Assuming uncertainty decline means failure
Triggering softer escalation, such as requesting more tests or a second reviewer Treating mid-run confidence as a calibrated probability
Studying process-level failure modes Claiming process monitoring is already solved

The practical lesson is that process supervision needs better signals than “the agent seems less confident now.” Anyone who has watched a model debug its own mess already knows this feeling. The agent becomes solemn, says the issue is subtle, and continues directly into a wall.

Adversarial review helps because it changes the job, not because it adds new evidence

The strongest mitigation in the paper is adversarial post-execution review.

The adversarial reviewer sees the same kind of information as the standard post-execution reviewer: task, repository, and patch. The difference is the prompt. Instead of asking whether the patch is correct, the agent is instructed to find bugs, edge cases, failure modes, and gaps before estimating success.

That shift matters. Standard review invites confirmation. Adversarial review invites falsification.

The calibration improvements are meaningful:

Model Standard post-execution ECE Adversarial post-execution ECE Interpretation
GPT-5.2 Codex 0.42 0.30 Better calibration, but mostly from lowering estimates
Gemini 3 Pro 0.66 0.53 Better calibration and some discrimination improvement
Claude Opus 4.5 0.37 0.24 Best calibration and better discrimination

The paper also reports that adversarial post-execution reduces the overconfident-failure rate from 72% under standard review to 45%. That is still not safe enough for blind acceptance, but it is a real improvement.

Discrimination is more mixed:

Model Standard post-execution AUROC Adversarial post-execution AUROC
GPT-5.2 Codex 0.58 0.55
Gemini 3 Pro 0.51 0.57
Claude Opus 4.5 0.55 0.64

This is where interpretation needs discipline. Adversarial prompting is not a universal signal amplifier. For GPT, the paper finds that the adversarial shift is nearly identical for passing and failing instances: confidence drops, but not in a way that better separates success from failure. In fact, post-hoc Platt scaling of standard post-execution predictions achieves better calibration for GPT than adversarial prompting, suggesting that adversarial review added little new discriminatory signal in that case.

For Gemini and Claude, the shift is larger on failing instances than on passing ones. That means adversarial review does more than lower confidence across the board; it surfaces some genuine failure signal. The paper notes that these pairwise differences are not individually significant, so the result should be read as suggestive rather than final.

The OpenLibrary example in the paper is a useful miniature. A standard reviewer sees a small, plausible addition and assigns high confidence. The adversarial reviewer inspects the surrounding output-shaping logic and notices that the added field still may not appear in the final response. The patch fails. The difference is not mystical reasoning ability. It is attention allocation.

In a production setting, this distinction matters:

Review framing Likely reviewer behavior Operational risk
“Is this patch correct?” Looks for coherence, style, and local consistency High risk of rubber-stamping plausible patches
“Find bugs and failure modes.” Searches for missing paths, edge cases, and integration gaps Lower risk, but still overconfident
“Estimate success after adversarial search.” Produces a more skeptical probability Useful for escalation, not automatic acceptance

Adversarial review is also more expensive. The paper reports 23.4 steps and $0.52 per instance versus 12.7 steps and $0.23 for standard post-execution review. This is not a scandal. Better review costs more. Shocking discovery. Someone notify procurement.

The real question is where to spend that cost. The answer is not “run adversarial review on everything.” The answer is to use cheaper pre-execution routing to identify tasks that deserve more expensive falsification.

Ensembles help only when they match the decision problem

The paper also evaluates simple ensembles of pre- and post-execution estimates: averaging, conservative selection, and aggressive selection.

The results are unsurprising in the useful way. A conservative ensemble, taking the lower of pre- and post-execution estimates, improves calibration over vanilla post-execution for GPT and Claude. The aggressive ensemble can improve AUROC for GPT and Claude, but at the cost of worse overconfidence and calibration.

Ensemble choice What it optimizes psychologically When it may be useful Main risk
Average Compromise between difficulty estimate and patch review General routing when no single signal dominates Can preserve shared optimism
Conservative minimum Trust the more skeptical estimate Risk control, escalation, compliance-sensitive workflows May over-escalate good work
Aggressive maximum Trust the more optimistic estimate Finding candidates worth attempting when false positives are cheap Dangerous for acceptance decisions

For businesses, this is the governance version of choosing a loss function. If the cost of false acceptance is high, the conservative estimate is more rational. If the cost of false rejection is high and human review is abundant, a more aggressive signal might be acceptable. There is no neutral confidence aggregation rule. There is only a rule whose errors you have decided to tolerate.

The paper’s ensemble section is best read as an exploratory extension, not as a deployment recipe. The broader message is that pre-execution and post-execution estimates contain different kinds of information and fail in different ways. Combining them can help, but only if the business knows whether it is optimizing cost, speed, quality, risk, or auditability.

“Use an ensemble” is not a strategy. It is a way to postpone saying what you are afraid of.

Self-preference is not the main explanation

One possible explanation for agentic overconfidence is self-preference. Perhaps models favor their own patches because the style, reasoning pattern, or implementation approach feels familiar. The paper tests this with a smaller ablation: judges estimate success on own-model patches and cross-family patches.

The result does not rescue the simple story.

Judge Patch source Mean estimate AUROC ECE Brier
GPT GPT self 0.74 0.55 0.38 0.36
GPT Gemini cross 0.51 0.39 0.44 0.37
Gemini Gemini self 0.72 0.57 0.56 0.59
Gemini GPT cross 0.91 0.55 0.56 0.55

GPT does show self-preference: it assigns higher estimates to its own patches. Gemini shows the opposite pattern, assigning even higher estimates to GPT patches than to its own. Yet overconfidence appears regardless of direction.

So self-preference may be part of the story for some models, but it cannot explain the main phenomenon. The deeper issue is broader: agentic reviewers are bad at mapping plausible-looking work to actual task success.

That is a useful ablation because it prevents a comfortable but narrow conclusion. The fix is not simply “use a different model as the judge.” Cross-model review may help in some cases, especially when model errors are diverse, but the paper does not show that model diversity alone solves overconfidence.

A practical governance design: route early, falsify late, escalate disagreement

The paper’s business value is not that it gives companies a ready-made confidence score. It does almost the opposite. It shows why confidence scores must be embedded inside a decision protocol.

A reasonable agentic workflow inspired by the paper would look like this:

Task arrives
Pre-execution uncertainty estimate
Route:
  - low-risk / routine → agent attempt
  - high-risk / unclear → human-assisted setup or narrower task decomposition
Agent produces output
Adversarial post-execution review
Decision:
  - strong evidence + low stakes → proceed with monitoring
  - disagreement or low adversarial confidence → human review
  - high stakes → external verification regardless of confidence

The key is to separate the jobs.

Governance job Better-suited signal Why
Decide whether to attempt a task Pre-execution estimate Less exposed to patch anchoring
Decide whether to spend more review budget Disagreement between pre- and post-execution estimates Disagreement identifies uncertainty worth investigating
Decide whether to accept an output Adversarial post-execution plus external checks Falsification is better than confirmatory review
Decide whether to remove humans Not supported by this paper Overconfidence remains substantial

This architecture is not glamorous. It does not say, “The agent knows when it is right.” It says, “The agent sometimes knows enough to tell us where the next review dollar should go.” That is less magical and more useful.

For AI coding tools, the immediate implication is clear: do not expose a confidence percentage as if it were a quality guarantee. Use it as a workflow feature: task triage, review prioritization, escalation, and audit logging.

For business-process automation, the analogy is broader. Many agentic workflows do not produce testable code patches. They produce reports, reconciliations, emails, compliance summaries, CRM updates, or procurement recommendations. In those settings, ground truth may be delayed or subjective. That makes overconfidence more dangerous, not less, because failure is harder to detect.

The paper’s direct evidence is coding-specific. The business inference is that any agentic workflow with multi-step hidden failure modes should avoid naive self-certification.

Domain Similar failure mode Safer interpretation of agent confidence
Software engineering Plausible patch fails tests Use confidence for review routing, not merge approval
Finance research Polished thesis rests on bad data join Use confidence to trigger data lineage checks
Operations automation Agent completes process but misses exception case Use adversarial review on exception handling
Compliance summaries Report sounds complete but omits material clause Require source-grounded verification
Customer support automation Response is fluent but policy-inconsistent Escalate when confidence conflicts with policy complexity

The general principle is simple: confidence is an input to governance, not a substitute for governance.

The boundaries are narrow enough to matter

The paper’s limitations are not decorative. They affect how the result should be used.

First, the experiments are on software engineering tasks. This is a strength because tests provide objective success labels. It is also a boundary because other domains have fuzzier definitions of success. In creative work, strategy, or legal interpretation, calibration itself may be harder to define.

Second, the sample size is 100 tasks. That is enough to reveal a broad overconfidence pattern, but some per-model comparisons have wide confidence intervals. The paper is careful about this, especially when discussing pre-execution versus post-execution discrimination and adversarial shift decomposition.

Third, the uncertainty agents are prompt-based, not trained verifiers. A verifier trained specifically to detect agentic failure patterns might perform better. That remains an open path rather than a result.

Fourth, the study uses same-family model assessment. It does not establish what happens when a stronger model reviews a weaker model, when specialized tools are added, or when independent static analysis and test generation are integrated into the review loop.

Finally, the cost numbers are specific to the experimental setup. The relative point is still useful: adversarial review uses more steps and costs more than standard review. But production economics will depend on model pricing, task complexity, and the cost of failure.

These boundaries do not weaken the article’s main business lesson. They define it. The result is not “never use agent confidence.” The result is: do not confuse a confidence estimate with a control system.

The real cost of overconfidence is false delegation

Agentic overconfidence is not just a calibration problem. It is a delegation problem.

When a company asks an AI agent, “Are you likely to succeed?” it is often asking a hidden second question: “Can I stop paying attention?” The paper’s answer is mostly no. At least, not from the agent’s own confidence estimate, not yet.

The useful path is more modest and more operational. Ask before execution to route work. Ask adversarially after execution to find failure modes. Use disagreement as a reason to escalate. Keep external verification for high-stakes decisions. Treat ordinary post-execution approval with suspicion, especially when the patch looks beautifully boring.

That last part is where many failures hide. The obviously broken output is easy to catch. The dangerous one is clean, local, idiomatic, and wrong.

AI agents are getting better at doing work. This paper reminds us that they are also getting better at sounding certain that the work is done. Those are not the same capability. A serious automation strategy must price the difference.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jean Kaddour, Srijan Patel, Gbètondji Dovonon, Leo Richter, Pasquale Minervini, and Matt J. Kusner, “Agentic Uncertainty Reveals Agentic Overconfidence,” arXiv:2602.06948, 2026, https://arxiv.org/abs/2602.06948↩︎