Opening — Why this matters now

Explainable AI (XAI) has quietly become a compliance requirement rather than a research curiosity. If your model touches finance, healthcare, or hiring, “explainability” is no longer optional—it is audited.

And yet, most teams still evaluate explanations using automated metrics that look mathematically clean but are rarely questioned.

This paper (fileciteturn0file0) does something mildly uncomfortable: it asks whether those metrics actually align with how humans judge explanations.

Short answer: they don’t.

Long answer: not only do they fail—they fail in ways that are inconsistent, dataset-dependent, and structurally flawed.

Background — The illusion of measurable explainability

Counterfactual explanations (CFs) have become the darling of XAI. They answer questions like:

“What would need to change for this decision to be different?”

This aligns nicely with human reasoning. It also fits neatly into optimization frameworks.

So naturally, researchers built metrics:

Metric Type What it Measures Hidden Assumption
Sparsity Fewer feature changes Simpler = better
Proximity Small numerical change Closer = more realistic
Closeness Distance to training data Data manifold = plausibility
Diversity Independence of changed features Variety = informativeness
Oracle / Trust Model agreement Consensus = correctness
Completeness Alignment with feature importance Importance = relevance

Each metric encodes a design belief about what “good explanation” means.

The problem is obvious in hindsight: humans were not consulted.

Analysis — What the paper actually tests

The authors run a controlled experiment across three datasets:

  • Mushroom (binary, intuitive features)
  • Obesity (multi-class, lifestyle variables)
  • Heart disease (clinical data)

They generate counterfactual explanations using a standard method and then:

  1. Ask 167 human participants to rate explanations across five dimensions:

    • Accuracy
    • Understandability
    • Plausibility
    • Sufficiency
    • Satisfaction
  2. Compute seven standard metrics for the same explanations

  3. Compare:

    • Metric ↔ human correlation
    • Metric combinations → predictive models

To simplify interpretation, they aggregate human ratings into a Combined Quality Score (CQS).

This is where things begin to unravel.

Findings — The numbers refuse to cooperate

1. Correlation is weak (and inconsistent)

Across datasets, most metrics show near-zero correlation with human perception.

Metric Correlation with CQS (overall)
Trust Score ~0.30 (only meaningful signal)
Others < 0.10 (negligible)

Even worse, direction flips depending on context:

  • In the Mushroom dataset, users prefer simpler explanations (fewer changes)
  • In the Obesity dataset, users prefer richer explanations (more features, more detail)
  • In the Heart dataset, nothing consistently works

Same metric. Opposite preference.

That is not noise. That is structural mismatch.


2. Combining metrics does not fix it

A reasonable assumption:

“If one metric is weak, combining several should approximate human judgment.”

The paper tests this exhaustively (127 metric combinations, multiple models).

Results:

Model Type Mean R² (Predicting Human Ratings)
Linear Regression ~ -1.25 (worse than guessing)
kNN ~ -0.89
Random Forest ~ 0.07 (barely useful)
XGBoost ~ -1.87 (overfits, fails)

Even the best model explains only a small fraction of variance.

Adding more metrics actually degrades performance after ~3–4 features.

Translation: the metrics are not complementary—they are collectively misaligned.


3. Human perception is multi-dimensional (metrics are not)

The study shows strong internal consistency in human ratings (Cronbach’s α = 0.88), meaning:

Humans do have a coherent notion of explanation quality.

But it is:

  • Context-dependent
  • Task-dependent
  • Psychologically grounded

Metrics, by contrast, are:

  • Static
  • Optimization-driven
  • Detached from user cognition

You are measuring geometry. Users are judging meaning.

Implications — This is not a minor calibration issue

This is where most teams underestimate the problem.

1. Evaluation pipelines are fundamentally misaligned

If your XAI evaluation relies on:

  • Sparsity
  • Proximity
  • Plausibility (as defined by distance)

Then you are optimizing for:

“What is easy to compute”

Not:

“What users actually understand or trust”


2. Compliance risk is quietly increasing

Regulators increasingly require:

  • Transparency
  • Justifiability
  • User-understandable explanations

If your internal metrics do not reflect human perception, then:

Your system may pass internal validation while failing external scrutiny.

That is not a technical bug. That is governance failure.


3. Agentic AI systems will amplify this gap

In static models, bad explanations are tolerable.

In agentic systems (decision loops, autonomous workflows), explanations become:

  • Feedback signals
  • Decision justifications
  • Human override triggers

If those explanations are misaligned with human perception, you get:

  • Misplaced trust
  • Delayed intervention
  • Systemic risk accumulation

In other words: explanation quality becomes a control problem.


4. The next frontier is not better metrics—it is human-grounded metrics

The paper subtly suggests a shift:

Old Paradigm Emerging Direction
Metric-driven evaluation Human-centered evaluation
Proxy optimization Perception-aligned validation
Static scoring Context-aware explanation quality

Future metrics will likely incorporate:

  • Actionability (can I do something with this?)
  • Cognitive load (can I understand it quickly?)
  • Trust calibration (does it feel reliable?)

Notice how none of these are purely mathematical.

That is the point.

Conclusion — Stop trusting your explanation metrics

This paper does not argue that metrics are useless.

It argues something more uncomfortable:

Metrics are currently answering the wrong question.

They measure:

  • Distance
  • Sparsity
  • Agreement

Humans care about:

  • Meaning
  • Plausibility
  • Actionability

Until those two converge, “explainable AI” will remain—ironically—misunderstood.

And if your system relies on explanations for trust, compliance, or control, that mismatch is not academic.

It is operational risk.


Cognaptus: Automate the Present, Incubate the Future.