Scorecards look objective until a user reads the explanation
Scorecards are comforting. They turn a messy judgment into a neat row of numbers: sparsity, proximity, plausibility, trust score, completeness. The model team can rank explanation methods. The governance team can file the validation report. The product team can say the system is explainable. Everyone gets to leave the meeting before dinner.
Then a user looks at the explanation and thinks: What exactly am I supposed to do with this?
That gap is the subject of Felix Liedeker, Basil Ell, Philipp Cimiano, and Christoph Düsing’s paper, “Do Metrics for Counterfactual Explanations Align with User Perception?”1 The paper asks a blunt question that explainable AI teams often avoid: when we score counterfactual explanations with standard automated metrics, do those scores actually match how humans judge explanation quality?
The answer is not a tidy “no.” It is more operationally annoying than that. Some metrics show weak signal in some datasets. Some directions flip across contexts. Combining metrics does not rescue the evaluation. Nonlinear models do a little better in one representative case, then run out of road. The useful lesson is not that metrics are worthless. The useful lesson is that a metric scorecard is not a substitute for human-centered validation. It is a technical diagnostic pretending to be a user study. Naturally, the spreadsheet looks more confident than it should.
The paper compares two worlds that XAI often keeps separate
Counterfactual explanations are popular because they sound close to how people reason. Instead of saying “feature X contributed 0.27 to the decision,” a counterfactual says something like: if these input values had changed, the model would have produced a different outcome. That format is attractive in credit, healthcare, insurance, hiring, compliance, and decision support because it gives the user a contrast: the current case versus a nearby alternative.
Research and engineering practice then evaluate these explanations using automated metrics. The paper studies seven of them:
| Metric | What it tries to capture | The business-friendly interpretation people are tempted to make |
|---|---|---|
| Sparsity | How many features changed | Fewer changes should be easier to understand |
| Proximity | How far the counterfactual is from the original input | Smaller changes should feel more realistic |
| Closeness | Distance to nearby training examples | Explanations near observed data should seem plausible |
| Diversity | Independence among changed features | More varied changes may offer richer alternatives |
| Oracle Score | Agreement between the base model and an oracle model | Cross-model agreement may suggest a more credible target class |
| Trust Score | Distance to the predicted class versus other classes | The counterfactual should sit safely inside the target class region |
| Completeness | Whether changed features overlap with important model features | Explanations should modify what the model actually cares about |
These are not silly metrics. They encode reasonable engineering instincts. A counterfactual that changes twenty variables is probably hard to use. A counterfactual far away from the training distribution may be nonsense. A counterfactual that changes irrelevant features is not much of an explanation; it is a magic trick with tabular data.
But the paper’s point is precisely that reasonable engineering instincts are not the same as user perception. Users do not experience an explanation as a geometry problem. They judge whether it feels accurate, understandable, plausible, sufficiently detailed, and satisfying. Those are cognitive and contextual judgments, not just distances in feature space.
The study is therefore a comparison between two evaluation worlds: machine-facing metrics and human-facing ratings. That comparison is more valuable than another leaderboard of counterfactual methods because businesses do not deploy explanations into metric dashboards. They deploy them into workflows where people must interpret, trust, challenge, or act on them.
The experiment is small enough to read carefully, and large enough to embarrass simple proxies
The authors use three tabular classification datasets from the UCI repository: Mushroom, Obesity Levels, and Heart Disease. The choice matters. These are not random benchmark names thrown into a table for decoration. They create three different explanation contexts:
| Dataset | Task type | Valid counterfactuals generated | Why it matters for interpretation |
|---|---|---|---|
| Mushroom | Binary classification, intuitive observable attributes | 755 | Users can reason about visible attribute changes without specialist knowledge |
| Obesity Levels | Seven-class classification, lifestyle and physical-condition variables | 211 | Multi-class outcomes may reward richer explanations because the target space is less binary |
| Heart Disease | Binary classification, clinical measurements | 25 | Medical context can be harder for lay users, and the small number of explanations limits detection of subtler effects |
The model behind the explanations is XGBoost, selected after the authors found it performed best among evaluated classifiers, with F1 scores of 1.00 for Mushroom, 0.95 for Obesity, and 0.85 for Heart Disease. Counterfactuals are generated using Counterfactuals Guided by Prototypes, implemented through Alibi Explain. This is important because the paper is not comparing many counterfactual-generation methods. It is asking whether standard metrics computed on one set of generated counterfactuals line up with human judgments.
From the valid explanations, the authors select 85 counterfactuals for the user study: 30 Mushroom, 30 Obesity, and 25 Heart Disease. The sampling procedure is not a side quest. They cluster explanations using the seven automated metrics and sample in a way that preserves the structure of the metric space. In practical terms, they are trying not to show participants only one narrow kind of explanation. This is an implementation detail that supports the main evidence; it is not itself a second thesis.
Participants on Prolific then rate the explanations. Each explanation appears as a table comparing the original instance with the counterfactual, with changed features highlighted and accompanied by a brief text description. Participants rate five dimensions on a four-point Likert scale: perceived accuracy, understandability, plausibility, sufficiency of detail, and user satisfaction. After excluding one participant who failed attention checks, the analysis uses 167 participants and 2,004 individual ratings, averaging 23.58 complete rating sets per explanation.
The paper then aggregates the five dimensions into a Combined Quality Score, or CQS. That aggregation is justified by a high internal consistency among the five dimensions, with Cronbach’s $\alpha = 0.88$, and a principal component analysis where the first component explains 74.1% of the variance. This does not mean human judgment is simple. It means the measured dimensions move together enough to support a combined perception-quality score for analysis.
A useful way to read the empirical design is this:
| Analysis component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Metric-rating correlations | Main evidence | Whether individual automated metrics align with human ratings | That a metric has causal influence on user judgment |
| Predictive modeling over all metric subsets | Main evidence plus stress test of the “combine more metrics” assumption | Whether combinations of metrics can predict human perception out of sample | That all possible human-aligned metrics would fail |
| Power analysis | Design justification | The study can detect large effects likely to be practically useful as proxies | That small or medium effects do not exist |
| CQS reliability analysis | Measurement validation | The five human rating dimensions form a coherent aggregate | That every user or professional role values the same explanation qualities |
| Figures 2 and 3 on $R^2$ and metric count | Sensitivity and model-complexity evidence | More metrics do not automatically improve prediction | That every nonlinear model class has been exhausted |
That distinction matters because the paper’s results are easy to overstate. It does not prove that automated explanation metrics can never be useful. It shows that these widely used counterfactual metrics, under this experimental setup, do not behave like reliable proxies for perceived explanation quality.
For business use, that is already enough damage.
One metric weakly works overall, but “overall” is where bad governance hides
The first result is the correlation analysis. The authors compute Pearson correlations between each of the seven automated metrics and the five human rating dimensions, plus the CQS, separately by dataset.
Across all explanations, only Trust Score has a statistically significant association with CQS: $r = 0.307$, $p = 0.004$. All other metrics have negligible overall correlations, with $|r| < 0.1$.
At first glance, this sounds like a small victory for Trust Score. It is not much of one. A correlation around 0.3 is a weak-to-moderate signal, not a reliable substitute for asking users. More importantly, the aggregate result hides dataset-specific behavior. This is exactly the kind of “overall metric” that looks respectable in a validation report and then quietly collapses when the product enters a different domain.
The paper’s dataset comparison is the core of the story.
In Mushroom, several metrics — including sparsity, diversity, proximity, and closeness — show moderate to strong negative correlations with sufficiency of detail, satisfaction, and CQS, in the range of $r = -0.38$ to $-0.64$. The authors interpret this as evidence that, in this domain, users prefer counterfactuals involving fewer and smaller changes. That is intuitive: if a mushroom is classified as poisonous because of a small number of visible properties, a compact explanation is likely easier to digest.
In Obesity, the pattern changes. Diversity, Trust Score, and Completeness correlate positively with several rating dimensions, including plausibility, satisfaction, and CQS, with correlations in the range of $r = 0.37$ to $0.52$. Here, users appear to favor more comprehensive or information-rich explanations. Again, that is plausible. Obesity-level classification is multi-class and tied to lifestyle or physical-condition variables. A one-feature “fix” may feel less credible than an explanation that touches a richer cluster of relevant factors.
In Heart Disease, the metrics mostly stop cooperating. Correlations are weak, non-significant, mixed in direction, and small. No consistent relationship emerges.
This is the paper’s most useful business lesson: the same metric can behave differently depending on the decision context. In one domain, compactness may feel clear. In another, compactness may feel under-explained. In a medically flavored dataset, lay users may lack the domain knowledge needed to translate metric-friendly counterfactuals into perceived quality.
That does not make users irrational. It means explanation quality is contextual. A metric that ignores context is not objective; it is incomplete with impressive formatting.
The authors summarize this instability with a mean cross-dataset standard deviation of correlation coefficients of 0.31. In plain language: the relationship between automated metric values and human ratings moves around enough that one universal scorecard is a bad bet.
The “just combine the metrics” escape route also fails
The natural defense of automated evaluation is aggregation. One metric may be weak, but a portfolio of metrics should capture different parts of the user experience. Sparsity covers simplicity. Closeness covers plausibility. Completeness covers model relevance. Trust Score covers classification reliability. Put them together and, surely, the human judgment should emerge.
This is the spreadsheet version of hope.
The paper tests it directly. The authors examine all 127 non-empty subsets of the seven metrics. For each subset, they train five model classes — linear regression, k-nearest neighbors, Random Forest, XGBoost, and generalized additive models — across three datasets and six target variables: the five rating dimensions plus CQS. Performance is evaluated using five-fold cross-validated $R^2$, where $R^2$ is judged against a mean baseline.
The results are not kind.
| Model class | Mean cross-validated $R^2$ reported across settings | Interpretation |
|---|---|---|
| Linear regression | $-1.253$ | Linear metric combinations are worse than the mean baseline |
| XGBoost | $-1.874$ | Performs even worse, likely overfitting in the low-sample regime |
| kNN | $-0.887$ | Slightly better than some alternatives, still below baseline |
| Random Forest | $-0.474$ | Best overall among the tested classes, but still weak |
| GAMs | Frequent convergence failure | Not reliable enough here to rescue the metric-proxy story |
Negative $R^2$ is worth pausing over. It means the model predicts worse than simply using the average target value. When the goal is to replace human evaluation with automated proxies, “worse than the average” is not a small calibration problem. It is the polite statistical version of “please stop.”
The paper then gives a representative case: predicting user satisfaction for the Heart Disease dataset. Linear regression models remain negative, with mean $R^2 = -0.972$. Random Forest performs better, producing positive $R^2$ values in 95 of 127 metric combinations, with a range from $-0.209$ to $0.331$ and a mean of $0.067$.
That sounds better until you interpret the magnitude. A mean $R^2$ of 0.067 explains only a small fraction of the variance. Even the best observed value, about 0.33, appears in a representative subset/model setting, not as a general solution. Nonlinearity finds some signal, but not enough to justify treating the metric bundle as a user-perception machine.
The model-complexity analysis is even more useful. For both linear regression and Random Forest, adding more metrics does not monotonically improve performance. In the Random Forest case, the best performance peaks around three to four metrics, with a maximum $R^2 = 0.33$ for three metrics, and then declines as more metrics are added. The paper’s interpretation is straightforward: more metrics may add noise rather than complementary information.
This matters for enterprise XAI because internal model governance often rewards metric abundance. A longer metric table looks more rigorous. It may be less rigorous if the added metrics are not validated against the human judgment they are supposed to approximate. Seven weakly grounded numbers do not become one human-centered assessment by standing near each other.
The mismatch is not between “math” and “feelings”; it is between proxy and task
It is tempting to read this paper as another argument that human judgment is messy and metrics are too cold. That framing is too lazy. The issue is not that humans have feelings and metrics have formulas. The issue is that the formulas optimize properties that only sometimes overlap with the user’s task.
A counterfactual explanation has several possible jobs. It can help a user understand why a decision happened. It can show a path to recourse. It can support trust calibration. It can help an auditor inspect whether the model behaves reasonably. It can provide evidence in a dispute. These jobs are related, but they are not the same.
Sparsity may help when the user needs a compact reason. It may hurt when the user needs enough detail to believe the explanation. Proximity may help when the counterfactual should feel realistic. It may be irrelevant when the changed variables are hard to interpret or not actionable. Completeness may align with model internals, but a user does not necessarily care that the changed features match SHAP importance if the explanation still feels impractical.
This is why the Mushroom-versus-Obesity contrast is so revealing. In Mushroom, fewer and smaller changes appear more aligned with user ratings. In Obesity, richer explanations appear more aligned. The metric did not suddenly become good or bad. The task changed the meaning of “good.”
For business teams, the replacement belief should be:
| Common belief | Better replacement |
|---|---|
| “A good explanation has high scores on standard XAI metrics.” | “A good explanation must satisfy the user’s decision task in its domain.” |
| “Multiple metrics approximate human judgment.” | “Multiple metrics are useful diagnostics only if validated against user judgments.” |
| “Plausibility is distance to training data.” | “Plausibility also depends on domain knowledge, actionability, and presentation.” |
| “Completeness means touching important model features.” | “Completeness for a user means receiving enough relevant information to act or judge.” |
| “The same scorecard can govern every deployment.” | “Metric interpretation must be calibrated by domain, user role, and explanation purpose.” |
That last row is the governance point. A generic XAI scorecard may be acceptable as a development diagnostic. It should not be sold internally as evidence that explanations are meaningful to users.
What this means for business XAI evaluation
The paper directly shows three things. First, seven common counterfactual metrics have weak and dataset-dependent alignment with human ratings. Second, metric combinations do not reliably predict perceived quality. Third, adding more metrics can degrade predictive performance rather than improve it.
Cognaptus would infer three operational consequences from that.
First, separate engineering diagnostics from user validation. Metrics such as sparsity, proximity, closeness, and completeness still help developers inspect counterfactual generation. They can catch absurd explanations, compare algorithms, and identify failure modes. But they should be labeled as engineering diagnostics, not user-perception evidence.
Second, design validation around explanation purpose. A credit applicant, a claims adjuster, a clinician, and an internal auditor do not need the same explanation. Before choosing metrics, define what the explanation is supposed to help the user do:
| Explanation purpose | What to validate with users | Automated metrics that may help, but cannot substitute |
|---|---|---|
| Understanding a decision | Can users correctly restate the reason and identify changed variables? | Sparsity, proximity, completeness |
| Acting on recourse | Can users identify feasible next steps? | Proximity, actionability-oriented constraints, plausibility checks |
| Trust calibration | Do users trust appropriate outputs and challenge suspicious ones? | Trust Score, model uncertainty, disagreement signals |
| Audit and compliance | Can reviewers trace the logic and identify unacceptable dependencies? | Completeness, feature attribution alignment, validity checks |
| Safety-critical override | Can users decide when to intervene? | Confidence, counterfactual stability, domain-specific risk flags |
Third, run lightweight human-centered testing before treating explanations as production-ready. That does not always require a giant academic user study. A practical enterprise protocol could start with 20 to 40 target users or domain reviewers, task-based comprehension checks, satisfaction and plausibility ratings, and qualitative comments on confusing or unactionable explanations. The key is to test the explanation as used in the workflow, not as an isolated artifact admired in a notebook.
This is not anti-metric. It is pro-labeling. Metrics should say what they measure. Humans should judge what humans experience. Confusing those two is how governance decks acquire beautiful tables and very little evidence.
Boundaries: what the paper does not settle
The study is useful, but its limits matter.
The evidence comes from 167 Prolific participants, not domain professionals making real decisions under organizational constraints. That means the findings most directly apply to lay user perception. The direction for experts may differ, especially in clinical, legal, financial, or compliance settings where background knowledge changes what “plausible” and “sufficient” mean.
The study uses three datasets and one counterfactual-generation method. The authors note that a valid counterfactual should in principle be evaluated independently of the generation method, but generalization would be stronger if alternative generators were tested. The Heart Disease condition also has only 25 explanations, which limits the ability to detect small or medium effects.
The power analysis is explicit about this boundary. With 25 to 30 explanations per dataset, the study is powered to detect large effects, such as correlations of $r \geq 0.50$ or predictive models with $R^2 \geq 0.40$. It is not designed to confidently rule out smaller effects. The authors argue, reasonably, that weak effects below about $r < 0.30$ are of limited practical utility if the goal is to use metrics as meaningful human-judgment proxies.
Finally, the paper studies perceived explanation quality, not downstream business outcomes. It does not measure whether explanations improve loan repayment behavior, medical decision quality, claim-resolution speed, or audit accuracy. Those are the outcomes companies eventually care about. The paper is one step earlier in the chain: do the standard scores line up with what users say is good? Mostly, no.
The scorecard should become a conversation starter, not the verdict
The cleanest takeaway is not “stop using XAI metrics.” That would be theatrical and unhelpful. Use them. But use them honestly.
Automated counterfactual metrics are useful for diagnosing explanation generators. They are weak evidence for user-perceived quality unless validated in context. Reporting more of them does not automatically make the evaluation more human-aligned. In this paper, it often just adds noise with better column labels.
For businesses deploying explainable AI, the practical rule is simple: every explanation metric should have a user-facing hypothesis attached to it.
If you report sparsity, say what you believe fewer changes will help the user do. If you report proximity, say why a smaller feature-space move should feel realistic in that domain. If you report completeness, say why alignment with model-important features will improve user understanding or trust. Then test that hypothesis with the people who actually have to read the explanation.
A scorecard without that loop is not governance. It is numerology with corporate formatting.
The deeper lesson is that explainability is not achieved when a model emits an explanation-shaped object. It is achieved when the right user, in the right context, can use that explanation for the right task. Metrics can help build that object. They cannot certify the experience by themselves.
That is the uncomfortable part. Also the useful part.
Cognaptus: Automate the Present, Incubate the Future.
-
Felix Liedeker, Basil Ell, Philipp Cimiano, and Christoph Düsing, “Do Metrics for Counterfactual Explanations Align with User Perception?”, arXiv:2603.15607v1, 16 March 2026, https://arxiv.org/abs/2603.15607. ↩︎