Opening — Why this matters now
Explainable AI (XAI) has quietly become a compliance requirement rather than a research curiosity. If your model touches finance, healthcare, or hiring, “explainability” is no longer optional—it is audited.
And yet, most teams still evaluate explanations using automated metrics that look mathematically clean but are rarely questioned.
This paper (fileciteturn0file0) does something mildly uncomfortable: it asks whether those metrics actually align with how humans judge explanations.
Short answer: they don’t.
Long answer: not only do they fail—they fail in ways that are inconsistent, dataset-dependent, and structurally flawed.
Background — The illusion of measurable explainability
Counterfactual explanations (CFs) have become the darling of XAI. They answer questions like:
“What would need to change for this decision to be different?”
This aligns nicely with human reasoning. It also fits neatly into optimization frameworks.
So naturally, researchers built metrics:
| Metric Type | What it Measures | Hidden Assumption |
|---|---|---|
| Sparsity | Fewer feature changes | Simpler = better |
| Proximity | Small numerical change | Closer = more realistic |
| Closeness | Distance to training data | Data manifold = plausibility |
| Diversity | Independence of changed features | Variety = informativeness |
| Oracle / Trust | Model agreement | Consensus = correctness |
| Completeness | Alignment with feature importance | Importance = relevance |
Each metric encodes a design belief about what “good explanation” means.
The problem is obvious in hindsight: humans were not consulted.
Analysis — What the paper actually tests
The authors run a controlled experiment across three datasets:
- Mushroom (binary, intuitive features)
- Obesity (multi-class, lifestyle variables)
- Heart disease (clinical data)
They generate counterfactual explanations using a standard method and then:
-
Ask 167 human participants to rate explanations across five dimensions:
- Accuracy
- Understandability
- Plausibility
- Sufficiency
- Satisfaction
-
Compute seven standard metrics for the same explanations
-
Compare:
- Metric ↔ human correlation
- Metric combinations → predictive models
To simplify interpretation, they aggregate human ratings into a Combined Quality Score (CQS).
This is where things begin to unravel.
Findings — The numbers refuse to cooperate
1. Correlation is weak (and inconsistent)
Across datasets, most metrics show near-zero correlation with human perception.
| Metric | Correlation with CQS (overall) |
|---|---|
| Trust Score | ~0.30 (only meaningful signal) |
| Others | < 0.10 (negligible) |
Even worse, direction flips depending on context:
- In the Mushroom dataset, users prefer simpler explanations (fewer changes)
- In the Obesity dataset, users prefer richer explanations (more features, more detail)
- In the Heart dataset, nothing consistently works
Same metric. Opposite preference.
That is not noise. That is structural mismatch.
2. Combining metrics does not fix it
A reasonable assumption:
“If one metric is weak, combining several should approximate human judgment.”
The paper tests this exhaustively (127 metric combinations, multiple models).
Results:
| Model Type | Mean R² (Predicting Human Ratings) |
|---|---|
| Linear Regression | ~ -1.25 (worse than guessing) |
| kNN | ~ -0.89 |
| Random Forest | ~ 0.07 (barely useful) |
| XGBoost | ~ -1.87 (overfits, fails) |
Even the best model explains only a small fraction of variance.
Adding more metrics actually degrades performance after ~3–4 features.
Translation: the metrics are not complementary—they are collectively misaligned.
3. Human perception is multi-dimensional (metrics are not)
The study shows strong internal consistency in human ratings (Cronbach’s α = 0.88), meaning:
Humans do have a coherent notion of explanation quality.
But it is:
- Context-dependent
- Task-dependent
- Psychologically grounded
Metrics, by contrast, are:
- Static
- Optimization-driven
- Detached from user cognition
You are measuring geometry. Users are judging meaning.
Implications — This is not a minor calibration issue
This is where most teams underestimate the problem.
1. Evaluation pipelines are fundamentally misaligned
If your XAI evaluation relies on:
- Sparsity
- Proximity
- Plausibility (as defined by distance)
Then you are optimizing for:
“What is easy to compute”
Not:
“What users actually understand or trust”
2. Compliance risk is quietly increasing
Regulators increasingly require:
- Transparency
- Justifiability
- User-understandable explanations
If your internal metrics do not reflect human perception, then:
Your system may pass internal validation while failing external scrutiny.
That is not a technical bug. That is governance failure.
3. Agentic AI systems will amplify this gap
In static models, bad explanations are tolerable.
In agentic systems (decision loops, autonomous workflows), explanations become:
- Feedback signals
- Decision justifications
- Human override triggers
If those explanations are misaligned with human perception, you get:
- Misplaced trust
- Delayed intervention
- Systemic risk accumulation
In other words: explanation quality becomes a control problem.
4. The next frontier is not better metrics—it is human-grounded metrics
The paper subtly suggests a shift:
| Old Paradigm | Emerging Direction |
|---|---|
| Metric-driven evaluation | Human-centered evaluation |
| Proxy optimization | Perception-aligned validation |
| Static scoring | Context-aware explanation quality |
Future metrics will likely incorporate:
- Actionability (can I do something with this?)
- Cognitive load (can I understand it quickly?)
- Trust calibration (does it feel reliable?)
Notice how none of these are purely mathematical.
That is the point.
Conclusion — Stop trusting your explanation metrics
This paper does not argue that metrics are useless.
It argues something more uncomfortable:
Metrics are currently answering the wrong question.
They measure:
- Distance
- Sparsity
- Agreement
Humans care about:
- Meaning
- Plausibility
- Actionability
Until those two converge, “explainable AI” will remain—ironically—misunderstood.
And if your system relies on explanations for trust, compliance, or control, that mismatch is not academic.
It is operational risk.
Cognaptus: Automate the Present, Incubate the Future.