When Accuracy Lies: From Smart Models to Ready Teams

A dashboard says the model is accurate. The pilot team says the interface is clear. The post-training survey says users trust the system. Everyone nods, because this is the part of AI deployment where organizations prefer numbers that look clean and verbs that sound finished: validated, launched, adopted.

Then the system enters a real workflow.

A manager accepts an AI recommendation because the deadline is ugly. A clinician overrides the right alert because the last alert was noisy. A loan officer follows an explanation that feels plausible but points in the wrong direction. The model did not necessarily become worse. The human did not necessarily become careless. The problem is more awkward: the organization measured a model, then deployed a team.

Min Hun Lee’s paper, From Accuracy to Readiness: Metrics and Benchmarks for Human–AI Decision-Making, argues that human–AI evaluation needs to move from model-centric performance to team readiness: whether people can recognize failure modes, calibrate reliance, recover from errors, and use governance mechanisms during real decisions.¹ That sounds like a small measurement adjustment. It is not. It changes what “ready to deploy” means.

The central misconception is familiar: if the model is accurate, explanations are available, and users report trust, the AI system must be ready. The paper dismantles that comfort. Accuracy can be true and still operationally misleading. Trust can be positive and still behaviorally irrelevant. Explanations can be informative and still encourage overconfidence. A lovely little trio of false reassurance, dressed in enterprise vocabulary.

The failure starts when prediction becomes behavior

Offline evaluation treats AI output as the end of the story. Human–AI deployment makes it the middle of the story.

In a decision-support workflow, a case often has at least four observable points:

the human’s initial judgment;
the AI recommendation;
the human’s final decision after seeing the AI;
any later governance action, such as escalation, rollback, or audit review.

The paper’s mechanism-first insight is that failure can enter at any transition between these points. The model may be wrong and the user may accept it. The model may be right and the user may reject it. The user may begin with the right answer, see a wrong AI recommendation, and change to the wrong answer. Or the team may avoid the immediate error but fail to escalate a case that policy says should receive review.

A standard accuracy score sees very little of this. It knows whether the model prediction matched ground truth. It does not know whether the AI improved the final decision, corrupted an initially correct judgment, delayed intervention, or created a near-miss that only escaped harm because a human happened to catch it.

That is why “high model accuracy” is not the same as “safe team performance.” The paper is not saying accuracy is useless. Accuracy remains necessary. But it is the wrong unit of analysis once AI advice passes through human judgment, workflow pressure, institutional incentives, and governance rules.

A simpler way to see the mechanism is this:

Evaluation lens	What it sees	What it misses
Model accuracy	Whether AI predictions match ground truth	Whether humans use those predictions appropriately
Trust survey	What users say they feel about AI	What users actually do under time pressure or accountability pressure
Explanation quality	Whether the system gives an intelligible reason	Whether the reason improves decisions or merely legitimizes bad reliance
Team readiness	How human judgment, AI advice, and governance interact over time	Still needs domain validation and benchmark design

The shift is subtle but brutal. A model can be “good” in isolation while the human–AI system is immature in use.

Readiness is learned behavior, not a launch checklist

The paper reframes onboarding as a measurable learning intervention rather than a one-time training session, documentation package, or cheerful demo. In its framing, users become ready by developing four practical competencies.

First, they learn reliability boundaries. They need to understand where the AI tends to work, where it tends to fail, and which cases deserve suspicion. A user who treats model quality as uniform across all cases is not calibrated; they are merely polite to software.

Second, they learn reliance calibration. Appropriate use means accepting AI advice more often when it is correct and rejecting it more often when it is wrong. That sounds obvious until one remembers that many enterprise workflows reward speed, conformity, and procedural safety more than independent judgment.

Third, users learn safe control and contestability. They need to know how to override, escalate, inspect, or roll back AI-influenced decisions when something looks wrong. A policy that says “humans remain accountable” means very little if the interface makes contesting the model slow, socially awkward, or invisible to the audit trail.

Fourth, users learn delegation and autonomy boundaries. Decision support, selective deferral, and automated action are not the same operating mode. Each redistributes responsibility differently. Readiness depends on whether users and organizations understand those boundaries before the first incident report politely calls them “lessons learned.”

This is where the paper’s Understand–Control–Improve lifecycle matters. It gives the readiness idea a time structure:

Lifecycle stage	What users are learning	What measurement should reveal
Understand	Model behavior, limits, failure patterns	Whether users can identify when AI is likely reliable or unreliable
Control	When to accept, reject, override, or escalate	Whether reliance changes appropriately with AI correctness and risk
Improve	How collaboration changes after failures and updates	Whether calibration, retention, transfer, and governance improve over time

This is not just UX language. It is an operational claim: readiness should be observable in traces of behavior. If users cannot detect unreliable cases, cannot reject wrong advice, cannot recover from AI-induced errors, or cannot use governance controls, the team is not ready—regardless of the model’s benchmark score.

The metric taxonomy turns a vague concern into an audit trail

The paper’s main contribution is a four-part taxonomy of metrics for human–AI decision-making. The taxonomy is useful because it avoids a common trap in AI evaluation: replacing one magic number with another. Instead, it asks four different questions.

Metric family	Core question	Example signals	Business interpretation
Outcome metrics	What happened?	Team accuracy, team gain, oracle gap, error recovery, error amplification	Did AI-assisted work improve final decisions, or merely move errors around?
Reliance & interaction metrics	How was AI used?	Accept-on-correct, accept-on-wrong, reject-on-wrong, override timing, reliance slope	Are users sensitive to when AI is right or wrong?
Safety & harm metrics	What went wrong?	AI-harm, AI-help, missed-help, near-misses, rollback, escalation, rule–behavior contradiction	Which failures are caused or amplified by AI influence and weak governance?
Learning & readiness metrics	What changed over time?	Calibration gap, retention, transfer, time-to-calibration	Does onboarding create durable capability, or just short-term compliance theater?

The business value is not that every firm should immediately compute every metric in the appendix. That would be a fine way to create a dashboard nobody reads. The point is to separate several problems that are usually blended together.

A poor final decision could come from a weak model. It could come from a strong model that users ignore. It could come from a wrong model that users over-trust. It could come from an interface that makes override costly. It could come from a governance rule that exists on paper but not in behavior. These are different problems. They require different fixes.

That distinction is the paper’s practical strength.

Outcome metrics ask whether collaboration created value

Outcome metrics begin with a deceptively simple question: did the human–AI team perform better than the human alone or the AI alone?

The paper discusses measures such as human accuracy, AI accuracy, team accuracy, and team gain. These are familiar enough. The more interesting concept is the oracle upper bound, or “best possible” performance if the system could choose the correct agent—human or AI—on each case. Regret relative to that oracle captures avoidable collaboration failure: cases where either the human or the AI had the right answer, but the final team decision still went wrong.

This matters because it separates model limitations from coordination failure.

If both human and AI are wrong, the problem may be task difficulty, data quality, model capability, or missing information. If one of them is right and the final decision is wrong, the problem is not simply intelligence. It is collaboration design.

The appendix also distinguishes error recovery from error amplification. Error recovery occurs when AI helps correct an initially wrong human decision. Error amplification occurs when AI causes a correct human decision to become wrong. A team can show decent aggregate accuracy while hiding too many harmful flips. Aggregate performance is polite that way: it averages away the embarrassing parts.

For enterprise AI, this suggests a more useful deployment review. Do not only ask whether the AI improved average accuracy. Ask where it changed decisions, whether those changes were beneficial or harmful, and which workflow conditions produced avoidable errors.

Reliance metrics expose the dangerous middle between trust and use

Trust is psychologically interesting. Reliance is operationally expensive.

The paper emphasizes that self-reported trust does not reliably predict behavior. Users may claim low trust and still follow AI under pressure. They may claim high trust and still ignore AI when a case feels risky or unfamiliar. The relevant question is not “Do users trust the system?” but “When the AI is right or wrong, how does user behavior change?”

The taxonomy therefore includes metrics conditioned on AI correctness:

Behavior	Interpretation
Accept-on-correct	The user benefits from correct AI advice
Reject-on-wrong	The user resists incorrect AI advice
Reject-on-correct	The user misses useful AI support
Accept-on-wrong	The user over-relies on incorrect AI advice

A team with good readiness should not blindly accept or blindly reject. It should discriminate. That discriminating behavior can be summarized through reliance slope: how strongly agreement with AI changes depending on whether AI is correct.

The paper also highlights decision-change behavior: whether users change their initial decision after seeing AI output, and whether that change moves them toward or away from correctness. This is especially important because many AI systems influence judgment without formally owning the final decision. On paper, the human decides. In practice, the AI may have quietly rewritten the human’s confidence.

Timing matters too. Override frequency tells only part of the story. Intervention latency can reveal hesitation, friction, or uncertainty. A user who eventually overrides a bad recommendation after a long struggle is not in the same operational state as a user who recognizes the failure quickly and acts confidently.

The most interesting part is local versus global updating. A user may reject one visibly bad recommendation but learn nothing durable from it. The paper suggests measuring whether behavior changes after observed AI failures and whether that change persists across later cases. That is the difference between a local reaction and a real mental-model update.

This is a useful idea for AI onboarding. Training should not merely teach users to pass a tutorial. It should change future reliance behavior.

Safety metrics treat harm as AI-influenced behavior, not just wrong prediction

Safety is where the framework becomes less comfortable for organizations that prefer governance as documentation.

The paper’s safety and harm metrics include AI-help, AI-harm, missed-help, correct-ignore, near-miss rate, rollback rate, escalation rate, and rule–behavior contradiction. These metrics matter because they attribute risk to interaction patterns rather than merely counting final errors.

Consider two cases.

In the first, the AI is wrong, the human notices, and the case is escalated. That is not success in the simple sense—the AI was wrong—but the governance system behaved well.

In the second, the AI is wrong, the human accepts it, and the final decision is wrong. That is AI-harm.

In the third, policy requires escalation for a high-risk disagreement, but the user simply proceeds. The final answer might even be correct. Still, the governance system failed in use.

This is the paper’s sharpest business implication: governance is not what the organization wrote down. Governance is what users actually do when the AI recommendation collides with uncertainty, workload, incentives, and accountability.

A model card can describe limitations. A policy can require escalation. A compliance deck can be impressively beige. None of that proves contestability exists in the workflow. Contestability becomes real only when users can challenge AI outputs, record the reason, escalate when needed, and reverse decisions when later review reveals problems.

The paper calls this governance-in-use. For businesses, it suggests that AI governance should be instrumented like an operational process, not archived like a PDF.

Learning metrics decide whether readiness survives Monday morning

Many AI pilots look good because the test is short, the users are attentive, and everyone knows they are being observed. The awkward question is what happens after the pilot glow fades.

The paper’s learning and readiness metrics ask whether calibration improves over time, whether the improvement is retained, and whether it transfers across tasks, datasets, or model versions. This is essential because AI systems do not stay still. Models are updated. Interfaces change. Data shifts. Users develop habits, shortcuts, and folk theories about when the system is “usually right.” Some of those folk theories are useful. Some are expensive superstition.

A readiness framework therefore needs measures such as:

Readiness signal	What it checks	Practical use
Calibration gap	Whether user confidence matches actual correctness	Identify overconfidence or underconfidence in AI-assisted decisions
Retention	Whether calibration persists across sessions	Decide whether onboarding produces durable learning
Transfer	Whether skills carry across tasks or model versions	Test whether users understand principles or only memorized cases
Time-to-calibration	How long users need before reliance stabilizes	Personalize onboarding length and identify slow-to-calibrate groups

This is where the paper connects evaluation to organizational learning. If a team needs repeated exposure before it stops accepting wrong AI advice, then onboarding length is not a fixed HR module. It is an empirical variable. If users lose calibration after a model update, then release management needs readiness regression tests, not just software regression tests.

That is the more mature interpretation of AI adoption: every deployment teaches the organization something about its own ability to collaborate with automation.

What the paper directly shows—and what Cognaptus infers

The paper is a conceptual and measurement framework. It does not present a new experiment proving that a specific onboarding program increases ROI. It does not claim that one universal readiness score is already validated across healthcare, finance, public services, and internal enterprise workflows. That boundary matters.

What the paper directly contributes is a structured way to define and compute readiness-related metrics from interaction traces. It synthesizes prior human–AI interaction research into a taxonomy and maps those metrics to the Understand–Control–Improve lifecycle.

What Cognaptus infers for business use is more applied: firms deploying decision-support AI should treat readiness measurement as part of deployment infrastructure. That means logging the right events, designing workflows that make escalation and rollback observable, and evaluating whether users improve in their ability to accept helpful AI and reject harmful AI.

A practical implementation could begin with a minimal trace schema:

Trace element	Why it matters
Initial human decision	Establishes the baseline before AI influence
AI recommendation and confidence cue	Records what advice shaped the user
Final human decision	Shows whether the recommendation changed the outcome
Decision change direction	Distinguishes beneficial from harmful AI influence
Accept / reject / override event	Measures reliance behavior
Time to accept, reject, or override	Reveals friction, hesitation, or automatic deference
Escalation / rollback / audit event	Makes governance visible in practice
Case risk label or policy trigger	Identifies where governance should have been activated

This does not require philosophical agreement on whether AI is a “teammate.” It requires enough instrumentation to answer a concrete operational question: did the AI system make the human decision process better, worse, or merely more confident?

The framework is strongest as diagnosis, not certification

The natural temptation is to turn readiness into a certification badge: AI-ready team, green checkmark, procurement smiles, everyone goes home. That would be the least useful reading of the paper.

Readiness is more valuable as a diagnostic system.

If accept-on-wrong is high, users may need better failure examples, uncertainty cues, or regions-of-no-use. If reject-on-correct is high, the organization may be wasting useful AI capability because users do not know when to rely on it. If changed-to-wrong is frequent, the interface may be giving AI advice too much persuasive authority. If rule–behavior contradiction is high, governance exists in policy but not in workflow. If calibration fades after two weeks, onboarding was performance theater.

The paper’s tables are best read in this diagnostic spirit. They map metrics to data sources, lifecycle stages, and design responses. They are not experimental results, ablations, or proof that every proposed metric generalizes. They are implementation scaffolding: a way to convert vague governance language into observable signals and corrective actions.

This also explains why the mechanism-first reading is better than a plain summary. A list of metrics sounds administrative. The mechanism shows why those metrics matter: AI systems fail through interaction, so the evidence must come from interaction.

Boundaries: useful framework, unfinished standard

The paper is careful about its own status. The taxonomy is a starting point, not a finalized standard. It will need domain-specific validation, community refinement, and benchmark design before organizations can compare readiness across settings with confidence.

Several boundaries follow.

First, trace-based measurement depends on workflow instrumentation. Many organizations do not currently log initial human judgments, final decisions, override timing, or escalation behavior in a clean way. Without those traces, readiness metrics become aspiration.

Second, domains differ. A useful reliance pattern in customer support may be unacceptable in clinical triage or credit approval. “Appropriate” reliance is not a universal constant; it depends on stakes, expertise, regulation, and the cost of different errors.

Third, measurement can distort behavior. If employees know that overrides, escalation, or latency are being evaluated, they may optimize for the metric rather than the decision. Yes, people gaming KPIs is still undefeated.

Fourth, readiness is not a substitute for model quality. A well-trained human–AI team cannot fully compensate for a model that is unreliable, biased, or deployed outside its valid context. Readiness evaluation complements model evaluation; it does not replace it.

These limitations do not weaken the framework. They prevent it from becoming another overconfident AI management slogan.

Accuracy was the easy part

The paper’s most useful message is not that accuracy is bad. It is that accuracy is early.

Accuracy tells us whether a model can predict. Readiness tells us whether people can work with that prediction under real constraints. That includes knowing when to trust it, when to challenge it, when to escalate it, and when to learn from the mess afterward.

For businesses, the implication is straightforward: AI deployment should not end at model validation. It should include readiness validation. The organization should be able to answer questions such as:

Are users more likely to accept AI when it is correct than when it is wrong?
Does AI correct human errors more often than it induces harmful flips?
Do users detect boundary cases and escalate high-risk disagreements?
Does onboarding improve calibration beyond the first session?
Do governance policies appear in behavior, not only in documents?

These are less glamorous than leaderboard scores. They are also closer to where deployment succeeds or fails.

The next phase of enterprise AI will not be won only by firms with smarter models. It will be won by firms that can build ready teams: people, interfaces, policies, and feedback loops that turn prediction into accountable action.

Because AI rarely fails alone.

It fails in a workflow.

Cognaptus: Automate the Present, Incubate the Future.

Min Hun Lee, “From Accuracy to Readiness: Metrics and Benchmarks for Human–AI Decision-Making,” arXiv:2603.18895, 2026, https://arxiv.org/abs/2603.18895. ↩︎

The failure starts when prediction becomes behavior#

Readiness is learned behavior, not a launch checklist#

The metric taxonomy turns a vague concern into an audit trail#

Outcome metrics ask whether collaboration created value#

Reliance metrics expose the dangerous middle between trust and use#

Safety metrics treat harm as AI-influenced behavior, not just wrong prediction#

Learning metrics decide whether readiness survives Monday morning#

What the paper directly shows—and what Cognaptus infers#

The framework is strongest as diagnosis, not certification#

Boundaries: useful framework, unfinished standard#

Accuracy was the easy part#