Metrics vs Minds: Why Your XAI Scorecard Lies to Your Users

Opening — Why this matters now

Explainable AI (XAI) has quietly become a compliance requirement rather than a research curiosity. If your model touches finance, healthcare, or hiring, “explainability” is no longer optional—it is audited.

And yet, most teams still evaluate explanations using automated metrics that look mathematically clean but are rarely questioned.

This paper (fileciteturn0file0) does something mildly uncomfortable: it asks whether those metrics actually align with how humans judge explanations.

Short answer: they don’t.

Long answer: not only do they fail—they fail in ways that are inconsistent, dataset-dependent, and structurally flawed.

Background — The illusion of measurable explainability

Counterfactual explanations (CFs) have become the darling of XAI. They answer questions like:

“What would need to change for this decision to be different?”

This aligns nicely with human reasoning. It also fits neatly into optimization frameworks.

So naturally, researchers built metrics:

Metric Type	What it Measures	Hidden Assumption
Sparsity	Fewer feature changes	Simpler = better
Proximity	Small numerical change	Closer = more realistic
Closeness	Distance to training data	Data manifold = plausibility
Diversity	Independence of changed features	Variety = informativeness
Oracle / Trust	Model agreement	Consensus = correctness
Completeness	Alignment with feature importance	Importance = relevance

Each metric encodes a design belief about what “good explanation” means.

The problem is obvious in hindsight: humans were not consulted.

Analysis — What the paper actually tests

The authors run a controlled experiment across three datasets:

Mushroom (binary, intuitive features)
Obesity (multi-class, lifestyle variables)
Heart disease (clinical data)

They generate counterfactual explanations using a standard method and then:

Ask 167 human participants to rate explanations across five dimensions:
- Accuracy
- Understandability
- Plausibility
- Sufficiency
- Satisfaction
Compute seven standard metrics for the same explanations
Compare:
- Metric ↔ human correlation
- Metric combinations → predictive models

To simplify interpretation, they aggregate human ratings into a Combined Quality Score (CQS).

This is where things begin to unravel.

Findings — The numbers refuse to cooperate

1. Correlation is weak (and inconsistent)

Across datasets, most metrics show near-zero correlation with human perception.

Metric	Correlation with CQS (overall)
Trust Score	~0.30 (only meaningful signal)
Others	< 0.10 (negligible)

Even worse, direction flips depending on context:

In the Mushroom dataset, users prefer simpler explanations (fewer changes)
In the Obesity dataset, users prefer richer explanations (more features, more detail)
In the Heart dataset, nothing consistently works

Same metric. Opposite preference.

That is not noise. That is structural mismatch.

2. Combining metrics does not fix it

A reasonable assumption:

“If one metric is weak, combining several should approximate human judgment.”

The paper tests this exhaustively (127 metric combinations, multiple models).

Results:

Model Type	Mean R² (Predicting Human Ratings)
Linear Regression	~ -1.25 (worse than guessing)
kNN	~ -0.89
Random Forest	~ 0.07 (barely useful)
XGBoost	~ -1.87 (overfits, fails)

Even the best model explains only a small fraction of variance.

Adding more metrics actually degrades performance after ~3–4 features.

Translation: the metrics are not complementary—they are collectively misaligned.

3. Human perception is multi-dimensional (metrics are not)

The study shows strong internal consistency in human ratings (Cronbach’s α = 0.88), meaning:

Humans do have a coherent notion of explanation quality.

But it is:

Context-dependent
Task-dependent
Psychologically grounded

Metrics, by contrast, are:

Static
Optimization-driven
Detached from user cognition

You are measuring geometry. Users are judging meaning.

Implications — This is not a minor calibration issue

This is where most teams underestimate the problem.

1. Evaluation pipelines are fundamentally misaligned

If your XAI evaluation relies on:

Sparsity
Proximity
Plausibility (as defined by distance)

Then you are optimizing for:

“What is easy to compute”

Not:

“What users actually understand or trust”

2. Compliance risk is quietly increasing

Regulators increasingly require:

Transparency
Justifiability
User-understandable explanations

If your internal metrics do not reflect human perception, then:

Your system may pass internal validation while failing external scrutiny.

That is not a technical bug. That is governance failure.

3. Agentic AI systems will amplify this gap

In static models, bad explanations are tolerable.

In agentic systems (decision loops, autonomous workflows), explanations become:

Feedback signals
Decision justifications
Human override triggers

If those explanations are misaligned with human perception, you get:

Misplaced trust
Delayed intervention
Systemic risk accumulation

In other words: explanation quality becomes a control problem.

4. The next frontier is not better metrics—it is human-grounded metrics

The paper subtly suggests a shift:

Old Paradigm	Emerging Direction
Metric-driven evaluation	Human-centered evaluation
Proxy optimization	Perception-aligned validation
Static scoring	Context-aware explanation quality

Future metrics will likely incorporate:

Actionability (can I do something with this?)
Cognitive load (can I understand it quickly?)
Trust calibration (does it feel reliable?)

Notice how none of these are purely mathematical.

That is the point.

Conclusion — Stop trusting your explanation metrics

This paper does not argue that metrics are useless.

It argues something more uncomfortable:

Metrics are currently answering the wrong question.

They measure:

Distance
Sparsity
Agreement

Humans care about:

Meaning
Plausibility
Actionability

Until those two converge, “explainable AI” will remain—ironically—misunderstood.

And if your system relies on explanations for trust, compliance, or control, that mismatch is not academic.

It is operational risk.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The illusion of measurable explainability#

Analysis — What the paper actually tests#

Findings — The numbers refuse to cooperate#

1. Correlation is weak (and inconsistent)#

2. Combining metrics does not fix it#

3. Human perception is multi-dimensional (metrics are not)#

Implications — This is not a minor calibration issue#

1. Evaluation pipelines are fundamentally misaligned#

2. Compliance risk is quietly increasing#

3. Agentic AI systems will amplify this gap#

4. The next frontier is not better metrics—it is human-grounded metrics#

Conclusion — Stop trusting your explanation metrics#