Order in the Court: Why XIL Doesn’t Panic Over Human Bias

Review queue.

That is where many enterprise AI governance dreams quietly become manual work. A model makes a decision. An explanation highlights the evidence. A human reviewer approves it, rejects it, or corrects it. The system then learns from that feedback. In theory, this is how explainable AI becomes operational governance rather than a dashboard for admiring colorful heatmaps.

The fear is equally simple: humans are not neutral instruments. Show reviewers a stream of good model behavior first, and they may become too trusting. Show them failures first, and they may become too skeptical. Either way, the feedback signal becomes contaminated. And if the AI system is learning from that signal, congratulations: the company has just automated cognitive bias, which is at least efficient, if not wise.

A recent paper by Dario Pesenti, Alessandro Bogani, Katya Tentori, and Stefano Teso asks whether that fear actually breaks explanation-based interactive learning, or XIL.¹ The answer is refreshingly inconvenient for anyone selling panic as strategy: order effects exist, but in this study they are small, narrow, and mostly not damaging.

That does not mean human bias disappears. It means the specific bias tested here—whether the sequence of correct and incorrect explanations changes user feedback—looks less catastrophic in a realistic XIL-style task than earlier worries suggested.

The result is not “humans are unbiased.” It is “the damage was limited.”

The paper studies explanatory interactive learning, where users do not merely see model outputs. They interact with the model’s explanations and provide corrective feedback. In a typical XIL loop, a model selects items, shows predictions and explanations, receives human corrections, and is then updated.

That sequence naturally creates a psychological question. If the model shows failures early, does the user become anchored on distrust? If it shows successes early, does the user start approving weak explanations too easily? This is the order-effect problem.

The authors test it through two controlled user studies with 713 total participants. Participants were shown blurred images containing human faces. A fictional AI model placed a bounding box around the face; the bounding box functioned as the model’s explanation. Participants had six seconds per image either to confirm the box or move it to the correct face location.

The experiment deliberately avoids an overly abstract “Do you trust this AI?” survey game. It measures behavior. Did participants correct the explanation accurately? Did they agree with the model’s box? Did their self-reported trust move after exposure to different sequences of model success and failure?

That distinction matters. Trust surveys are cheap. Behavioral evidence is expensive. Naturally, governance teams often prefer the cheap one. The paper gently punishes that habit.

The task turns trust into geometry

The study’s design is unusually clean because the explanation is spatial. The model’s explanation is a box. The ground truth is another box. The participant’s feedback is a final box. This allows the authors to measure two behavioral outcomes using overlap:

Measure	What it captures	Practical interpretation
Feedback accuracy	Overlap between the participant’s final box and the ground-truth face box	Whether the human provided useful corrective feedback
Agreement with the model	Overlap between the model’s box and the participant’s final box	Whether the human behaviorally accepted the model’s explanation
Trust/perceived accuracy questionnaire	Four 7-point Likert items after the task	Whether the user consciously reported confidence in the model

The study also separates image difficulty. Some blurred faces were easy to locate; others were more ambiguous. This is not a minor detail. Order effects are more likely to matter when the user is uncertain. Nobody needs a cognitive-bias seminar to reject a box placed on the wall instead of the face. Ambiguity is where trust sneaks in wearing a fake badge.

The model explanations had three placement types: correct, partially wrong, and wrong. In the main statistical analysis, partially wrong and wrong explanations were grouped as “incorrect,” while the appendix breaks them out in supplementary plots. That appendix is not a second thesis. It mainly helps identify where the small observed effects come from: the fuzzier middle cases, especially difficult images with partially wrong boxes.

Experiment 1: early failure made users slightly more cautious, not useless

The first experiment tests within-session order effects. All participants saw one debugging session of 40 images. The model’s overall accuracy was always 60%, but the sequence differed.

In the Increasing condition, the model began worse and improved across the session, moving from 40% to 80% accuracy. In the Constant condition, it stayed at 60%. In the Decreasing condition, it began better and worsened from 80% to 40%.

The central question: does the distribution of early successes and failures change how users behave later in the same session?

The headline result is not dramatic. Participants’ feedback accuracy stayed high and nearly identical across order conditions: 0.76 in Increasing, 0.75 in Constant, and 0.76 in Decreasing. The expected stimulus effects appeared: users were more accurate on easy images than difficult ones, and more accurate when the model’s box was correct than incorrect. That is not a scandal; that is Tuesday.

The order effect on feedback accuracy was statistically detectable only as an interaction with model placement. Participants in the Decreasing condition showed a slightly smaller accuracy gap between correct and incorrect placements than those in the Constant condition. The authors interpret this cautiously: perhaps users improved at the task by the time more errors appeared later, or perhaps the difference is noise.

The clearer order effect appears in agreement with the model. When the model’s box was incorrect, users were generally more likely to agree with it on difficult images than easy ones. That is sensible: ambiguity makes bad explanations harder to reject. But this tendency was weaker in the Increasing condition, where users saw more model failures early. In plain English, early exposure to mistakes made participants slightly less willing to follow the model later, especially when the image was ambiguous.

That is a primacy effect, but not a meltdown. Early failures nudged behavior toward skepticism. They did not destroy feedback quality.

Self-reported trust did not notice what behavior revealed

After the within-session task, participants rated the model’s accuracy and trustworthiness. The three conditions were statistically indistinguishable. The average trust/perceived accuracy index was 3.06 in Increasing, 3.11 in Constant, and 3.13 in Decreasing on a 7-point scale.

This is one of the more useful business lessons in the paper. Users may adjust their behavior without reporting a corresponding change in trust. In governance workflows, this means “How much do you trust the AI?” is a weak substitute for observing what reviewers actually do.

A reviewer may say the model is no more or less trustworthy while quietly becoming less likely to approve borderline explanations. That behavioral adjustment is precisely where operational risk lives.

So the practical takeaway is not “never survey users.” It is: do not confuse survey sentiment with feedback reliability. The former is a mood ring. The latter is the thing that trains or validates the system.

Experiment 2: the model-update boundary seems to reset expectations

The second experiment tests between-session order effects. Participants completed two debugging sessions of 40 images each. In the first session, model accuracy varied by condition: 40% in Increasing, 60% in Constant, and 80% in Decreasing. Then participants waited five seconds and were told the model was being updated based on feedback from various users. The second session was identical across all conditions: same images, same order, same 60% model accuracy.

This design tests whether first-session experience carries over after a model-update boundary. In enterprise terms: if reviewers see a weak model today, will they remain biased against a supposedly improved model tomorrow? Or if they see a strong model first, will they overtrust the next version?

The manipulation worked. In the first session, agreement with the model was much lower in the 40% condition than the 60% condition, and lower in the 60% condition than the 80% condition. People noticed, behaviorally, whether the model was performing well.

But in the second session, when everyone saw the same model behavior, agreement converged. Overall agreement was 0.63 in Increasing, 0.62 in Constant, and 0.61 in Decreasing. There was no significant order-condition effect on agreement.

Feedback accuracy also converged in broad terms: 0.78 in all three conditions. There was a statistically significant interaction between order condition and placement, but its magnitude was small. The Increasing group showed a slightly larger drop from correct to incorrect placements than the other groups. The authors suggest one possible interpretation: participants who saw the model improve may have become a little more reliant on it. But they also warn that the effect is small enough that noise remains a plausible explanation.

Self-reported trust again stayed unmoved. After the second session, the perceived trust/accuracy index was 3.38, 3.43, and 3.34 across the three conditions, with no meaningful difference.

The paper’s most business-relevant finding sits here: once users are told the model has been updated, earlier exposure does not appear to strongly bias later agreement or trust in this task. The update boundary acts like a cognitive reset. Not a magical reset, but enough to prevent obvious carryover.

What the evidence supports—and what it does not

The paper is strongest when read as a controlled correction to an overbroad fear. It does not prove that all human-in-the-loop systems are safe from cognitive bias. It shows that, in this specific XIL-style debugging task, order effects did not seriously undermine behavioral feedback.

A useful way to read the evidence is by purpose:

Paper component	Likely purpose	What it supports	What it does not prove
Experiment 1 within-session design	Main evidence	Presentation order can slightly affect agreement during one debugging session	That order effects meaningfully damage feedback quality
Experiment 1 questionnaire	Main evidence, but weaker behavioral relevance	Conscious trust ratings do not change across order conditions	That users’ behavior is unaffected
Experiment 2 manipulation check	Implementation validation	Participants behaviorally registered first-session model quality	That first-session experience causes lasting bias
Experiment 2 second session	Main evidence	Between-session order effects are limited after an announced update	That all model updates reset user expectations in real deployments
Appendix split by correct / partially wrong / wrong placements	Exploratory breakdown / sensitivity support	The clearest within-session effect is concentrated in ambiguous partially wrong cases	A separate general theory of partial-error perception

The “not prove” column is not decorative caution. It is where implementation decisions live. Businesses should not take this paper as permission to ignore user-interface design, reviewer training, or audit logging. They should take it as permission to stop treating order effects as an automatic death sentence for XIL.

The business interpretation: randomize queues, mark update boundaries, monitor ambiguity

For enterprise teams building AI review systems, model-debugging interfaces, or human-in-the-loop feedback platforms, the paper suggests a balanced operating model.

First, sequential review is not inherently broken. If users inspect explanations one by one, the order of good and bad outputs may influence behavior a little, especially within a session. But this study does not support the claim that such ordering broadly corrupts feedback.

Second, randomization remains cheap insurance. Within-session effects were small, but they appeared where one would expect them: difficult and ambiguous cases. If a review queue can randomize item order without harming operational priority, it should. This is not because the paper screams danger. It is because low-cost controls do not require existential threats to be worthwhile.

Third, model-update boundaries should be explicit. In Experiment 2, participants were told the model had been updated before the second session. That boundary may have helped them reset expectations. In business workflows, version changes should be visible: “This is model version 3.2, updated after last week’s feedback.” Hidden updates invite old mental models to haunt new systems. Ghosts are bad governance artifacts.

Fourth, monitor borderline explanations separately. The strongest within-session signal came from difficult, partially wrong cases. These are the cases where reviewers are neither confidently accepting nor confidently rejecting the model. In production, that suggests a specific metric: track disagreement, correction distance, and response latency by ambiguity level. The risky zone is not obvious failure. It is plausible-but-wrong explanation.

The operational framework

A governance team can translate the paper into a simple XIL review protocol:

Design choice	Paper-informed rationale	Practical implementation
Randomize non-urgent review items	Reduces within-session ordering artifacts	Shuffle items within risk buckets rather than across all priorities
Separate model versions clearly	Users may reset expectations after announced updates	Show model version, update time, and change summary
Track behavior, not only trust surveys	Self-reported trust did not move even when agreement behavior did	Log approvals, corrections, correction distance, and skipped items
Flag ambiguous cases	Order effects were most visible where users were uncertain	Add uncertainty labels, second review, or expert escalation
Avoid overcorrecting for small effects	Evidence does not support expensive between-session bias controls	Do not recruit fresh reviewers for every model iteration unless domain risk demands it

This is the sort of conclusion that sounds boring until one prices the alternatives. If between-session order effects were large, organizations might need separate reviewer pools for each model iteration, strict counterbalancing designs, or costly review resets. This paper suggests those heavy controls may be unnecessary for similar XIL tasks, provided the workflow is transparent and the cases are not high-stakes edge conditions disguised as routine review.

The boundary: this is not a universal human-bias vaccine

The study uses Prolific participants, blurred face images, bounding-box explanations, and a fictional model. The task is accessible by design. Participants need visual judgment, not radiology expertise, legal knowledge, or fraud-investigation experience.

That matters. In medical diagnostics, credit underwriting, insurance claims, or regulatory compliance, explanation review may involve domain concepts, asymmetric risk, organizational pressure, and professional liability. A reviewer correcting a bounding box in six seconds is not the same as a clinician deciding whether a saliency map supports a cancer diagnosis. One hopes this sentence is unnecessary. Experience suggests otherwise.

The explanation type also matters. Bounding boxes make feedback concrete. Other explanation families—concept-level explanations, example-based explanations, counterfactuals, long natural-language rationales—may trigger different cognitive dynamics. A reviewer can move a box. Correcting a vague concept attribution or a persuasive textual rationale is a different sport.

Finally, the paper studies order effects arising from the distribution of model errors. It does not exhaust all sequence effects. The order of easy versus difficult items, the order of familiar versus unfamiliar cases, or the order of emotionally charged cases may produce different outcomes.

So the correct interpretation is bounded: for explanation-based debugging tasks resembling this study, order effects appear manageable. For high-stakes expert workflows, the paper is useful evidence, not a compliance shield.

The quiet lesson: human oversight is less fragile when the task is well-shaped

The paper’s deeper message is not just that order effects are small. It is that the structure of the task matters.

Participants were not asked to form vague impressions of the model. They had a concrete job: confirm or move a box. The feedback target was visible. The correction action was constrained. The evaluation metric was behaviorally measurable. Those design choices probably helped limit the damage of cognitive bias.

This is the part enterprise AI teams should remember. Human-in-the-loop governance fails when “oversight” means asking people to stare at model outputs and feel responsible. It works better when the system gives users a precise correction task, captures the correction in machine-readable form, and separates behavioral evidence from trust theater.

Order effects deserve attention. They do not deserve panic. Panic is what happens when governance teams discover psychology after product launch.

Pesenti and colleagues give us a more useful posture: test the human factors, measure behavior, randomize where cheap, mark update boundaries, and watch the ambiguous cases. The court is not dismissing cognitive bias. It is merely refusing to treat every sequence effect as a capital offense.

Cognaptus: Automate the Present, Incubate the Future.

Dario Pesenti, Alessandro Bogani, Katya Tentori, and Stefano Teso, “Human Cognitive Biases in Explanation-Based Interaction: The Case of Within and Between Session Order Effect,” arXiv:2512.04764, 2025. https://arxiv.org/abs/2512.04764 ↩︎

The result is not “humans are unbiased.” It is “the damage was limited.”#

The task turns trust into geometry#

Experiment 1: early failure made users slightly more cautious, not useless#

Self-reported trust did not notice what behavior revealed#

Experiment 2: the model-update boundary seems to reset expectations#

What the evidence supports—and what it does not#

The business interpretation: randomize queues, mark update boundaries, monitor ambiguity#

The operational framework#

The boundary: this is not a universal human-bias vaccine#

The quiet lesson: human oversight is less fragile when the task is well-shaped#