Counterfactuals, Concepts, and Causality: XAI Finally Gets Its Act Together

Explanations should answer the question people actually ask

Audit meeting. A model has made a decision. Someone projects a heatmap.

The highlighted pixels are around a chin, an eye, a forehead, or some other facial region that looks important because the model says it is important. Everyone nods carefully. Nobody is much wiser. The model has technically been “explained,” in the same way a smoke alarm explains fire by making noise.

The problem is not that heatmaps are useless. The problem is that many explanation tools answer a weaker question than the human question. They say, “This part of the input was associated with the prediction.” The person asking for the explanation usually wants something closer to: “Would the prediction have changed if this meaningful thing had been different?”

That is a causal question. More precisely, it is often a counterfactual causal question. And it cannot be answered just by naming a concept, drawing a colored mask, or discovering a direction in a neural network layer.

The paper “A Framework for Causal Concept-based Model Explanations” by Anna Rodum Bjøru, Jacob Lysnæs-Larsen, Oskar Jørgensen, Inga Strümke, and Helge Langseth gives that distinction a formal backbone.¹ Its central move is simple, and inconvenient in exactly the right way: if we want explanations in human concepts, and we want those explanations to carry causal meaning, then the explanation system must explicitly model how concepts cause data and how concepts relate to one another.

In other words, “concept-based” is not automatically “causal.” A model explanation that says “gray hair matters” may still be only a polished correlation. To say “changing gray hair would have been sufficient to change this model’s prediction,” the explanation method needs a causal story that survives more than a slide deck.

The mechanism has three moving parts, not one magic explainer

The paper’s framework is post-hoc: it explains an already trained black-box model. It does not require redesigning the predictor into an interpretable architecture. That matters commercially, because most real systems are not rebuilt from scratch just because the governance team has discovered principles.

The framework has three components:

Component	Role in the explanation system	Why it matters
Black-box model $h$	Takes input features $x$ and outputs a prediction $\hat{y}$	This is the model being explained; the framework does not assume it is correct or fair
Concept-to-data mapping $\alpha$	Maps human-meaningful concepts $z$ plus unexplained variation $w$ into model input $x$	This lets the explainer ask what the input would look like if a concept changed
Causal model $M$ over concepts	Specifies causal relationships among concepts in $z$	This determines what else should change when one concept is intervened on

That third component is where many casual versions of XAI quietly fail. Human concepts are not independent tokens in a spreadsheet. In the CelebA examples used by the paper, changing the concept Young can affect the probabilities of Gray Hair and Glasses. Changing Makeup can affect appearance-related concepts such as Pointy Nose or Arched Eyebrows in the paper’s chosen causal graph. Whether one agrees with every edge in that graph is not the point. The point is that the graph makes the assumptions visible enough to criticize. Very rude of it, academically speaking.

The framework treats concepts as indirect causes of the prediction. The causal flow is:

$$ u \rightarrow z,\quad (z,w) \rightarrow x,\quad x \rightarrow \hat{y} $$

Here, $z$ is the explanation vocabulary, $w$ captures information in the input not represented by the chosen concepts, $x$ is the model input, and $\hat{y}$ is the model prediction. The explainer intervenes on concepts, not on raw pixels and not on the prediction itself. That is important because intervening directly on pixels usually destroys the human vocabulary, while intervening directly on the prediction is not an explanation. It is just vandalism with notation.

The real explanation unit is probability of sufficiency

The paper uses probability of sufficiency as the core explanatory quantity.

For XAI, the practical question becomes:

Given this model decision, what is the probability that changing a concept would have been sufficient to change the model’s output?

For a local explanation, the paper writes this in the form:

$$ p(\hat{y}_{do(\bar{z}=\bar{z}') } = \hat{y}' \mid x, \hat{y}) $$

In plain language: for this specific instance $x$, where the model predicted $\hat{y}$, how likely is it that the model would have predicted $\hat{y}’$ if we had intervened to set concept subset $\bar{z}$ to new value $\bar{z}’$?

That gives a much more useful explanation sentence:

“For this image, the model would have classified the person as old with probability $p$ if the relevant age concept had been intervened to old.”

This is not the same as saying “the model looked at the eye region.” It is also not the same as saying “the concept Young is correlated with the prediction.” It is a counterfactual claim about whether a change in a human-meaningful concept would have been sufficient to move the model across its decision boundary.

That distinction becomes especially useful for business-facing AI governance. A compliance, risk, or model validation team rarely wants a poetic tour of latent space. It wants to know whether a decision is sensitive to concepts that are relevant, irrelevant, prohibited, unstable, or suspiciously proxy-like.

The paper also uses probability of sufficiency for two explanation families:

Explanation type	How the framework uses probability of sufficiency	Operational interpretation
Concept attribution	Treats sufficiency as an importance score for an intervention	“How strongly would changing this concept move the model toward another output?”
Contrastive explanation	Produces a counterfactual-style explanation with an explicit probability	“The decision would have been different with probability $p$ if these concepts had been different.”

This is cleaner than many concept attribution methods because the score refers to crossing a decision boundary under a modeled intervention. It is not merely a directional sensitivity score computed from observed correlations.

Local examples show why counterfactuals beat decorative heatmaps

The paper tests the framework on two CelebA image classification tasks: whether a face is classified as Young, and whether a face is classified as Attractive. Each classifier is a ResNet-50, and the authors report accuracies above 80%. The explanation system uses a selected subset of CelebA binary attributes as the concept vocabulary, a causal model over those concepts, and a StarGAN-based generator to create concept-level counterfactual images.

The local examples are not the main “benchmark result.” They are proof-of-concept demonstrations. Their likely purpose is to show the mechanism in action: how a concept intervention becomes a generated counterfactual image, how that image is passed through the black-box classifier, and how the result becomes a probability of sufficiency.

In one example, adding glasses is not sufficient to change the classifier’s “young” prediction, while changing hair color to gray is sufficient for one male subject to be classified as old. When the intervention is broader—setting Young = 0—the causal model implies changes in descendant concepts such as Gray Hair and Glasses, and the framework marginalizes over those downstream uncertainties. For two example subjects, the paper reports:

$$ p(\hat{y}_{do(Young=0)} = 0 \mid x, \hat{y}=1) = 1 $$

That means the model would classify both subjects as old if they were counterfactually made old under the explanation model. This is exactly the sort of sentence a non-technical reviewer can understand, provided the assumptions behind “made old” are documented.

A second local example is more interesting because it exposes the cost of causal assumptions. For one image classified as young, the paper generates several counterfactual versions consistent with Young = 0. One counterfactual version remains incorrectly classified as young, while others are classified as old. Different causal model choices produce different sufficiency estimates:

Model representation	Reported value for changing prediction from young to old under $do(Young=0)$	Likely purpose of test
Fully specified SCM	0.121	Main causal calculation
Canonical partially specified SCM	[0.121, 0.138]	Uncertainty interval when the full mechanism is not fixed
Independent concepts	0.000	Sensitivity/comparison against a simplifying assumption
CBN interventional query	0.137	Comparison with an intervention-only approximation, not the same counterfactual query

This table is doing more than reporting numbers. It shows why causal concept explanation cannot be reduced to “change one concept and observe.” If we pretend concepts are independent, the estimate can collapse to zero. If we use a Causal Bayesian Network without a full SCM, we can compute an interventional query, but not the same counterfactual query. In this local case, the CBN value falls inside the PSCM interval, but toward the opposite end from the FSCM value. The paper’s interpretation is appropriately restrained: interventional approximations are not always reliable substitutes for counterfactual explanation.

That is a useful governance lesson. A simplified explainer may still produce a number. The number may even look satisfyingly precise. The question is whether it answers the same question.

Grad-CAM enters the paper as a comparison with a familiar attribution method, not as an enemy to be ceremonially defeated. The paper shows Grad-CAM highlighting regions such as the chin, eyes, forehead, or chin again for relevant images. The authors’ point is not that saliency is always bad. It is that saliency is non-causal and pixel-based. It can indicate where the model found evidence, but it does not say what meaningful concept would need to change for the decision to change.

That is the gap this framework is built to fill.

Global explanations turn one-off counterfactuals into subgroup diagnostics

Local explanations are useful when a single decision must be reviewed. Global and subgroup explanations are more useful when a business wants to audit model behavior before deployment.

The paper’s global examples ask whether changing a single concept would be sufficient to change the model’s prediction from young to old. For singleton interventions in the age classifier, the reported sufficiency values include:

Intervention	Probability of sufficiency for changing prediction from young to old
Gray Hair = 1	0.318
Glasses = 1	0.002
Makeup = 0	0.000
Young = 0	0.972

The interpretation is not subtle. In this setup, changing Young itself almost always moves the model from young to old. Gray Hair also has a meaningful effect. Glasses and Makeup do not, at least under the tested conditions.

The paper then breaks down the Gray Hair intervention by gender and glasses status. This is where the business relevance becomes obvious. The sufficiency effect is much stronger for males than females: 0.580 for males overall versus 0.181 for females overall. Among those wearing glasses, the estimate is 0.814 for males and 0.400 for females, though the paper correctly notes the limited data for some subgroups.

This is not merely an explanation. It is a bias investigation workflow.

A model validation team could use the same structure to ask:

Governance question	Causal explanation version
Is the model sensitive to a protected or proxy concept?	“Would changing this concept be sufficient to alter the decision?”
Is sensitivity concentrated in one subgroup?	“Does the sufficiency probability differ by subgroup?”
Is the model using a concept that domain experts consider irrelevant?	“Does an irrelevant concept produce a high sufficiency score?”
Is a mitigation attempt actually changing behavior?	“Do sufficiency scores fall after retraining or data correction?”

This is where the paper’s contribution becomes more than academic tidying. A heatmap can be inspected. A concept sensitivity score can be ranked. But a probability-of-sufficiency table can be turned into a model risk control: threshold, subgroup slice, investigate, remediate, retest.

No, it does not make governance easy. It makes the hard part finally visible.

The attractiveness classifier shows why TCAV and causal sufficiency can disagree

The paper’s attractiveness examples compare causal sufficiency attributions with TCAV-style concept attribution scores. This comparison is important because TCAV is a well-known concept-based XAI method, and it already sounds more human-friendly than pixels. It uses concept directions in a model’s internal representation and measures directional sensitivity toward an output class.

But concept direction is not causal intervention.

For young people classified as unattractive, the paper estimates which singleton concept interventions would be sufficient to change the prediction to attractive, split by gender. Some examples:

Concept intervention	Gender = 0	Gender = 1
Add 5 o’Clock Shadow	0.139	0.117
Remove 5 o’Clock Shadow	—	0.000
Add Arched Eyebrows	0.034	0.003
Remove Arched Eyebrows	0.085	0.214
Add Bushy Eyebrows	0.177	0.109
Remove Bushy Eyebrows	0.000	0.000
Add Big Nose	0.000	0.004
Remove Big Nose	0.574	0.147
Add Pointy Nose	0.350	0.306
Remove Pointy Nose	0.000	0.000
Add Smiling	0.242	0.181
Remove Smiling	0.000	0.009

The authors interpret this as showing, for example, that females in the tested setup improve their chance of being classified as attractive most by removing Big Nose, while the strongest listed move for men is adding Pointy Nose if they do not already have one.

That sentence is awkward. It should be awkward. We are talking about a classifier trained on a celebrity face dataset and concepts like attractiveness. If a business system produced comparable patterns about customers, borrowers, employees, or patients, the right reaction would not be “cool demo.” It would be “who approved this pipeline?”

The TCAV comparison sharpens the point. The TCAV scores identify Arched Eyebrows as highly beneficial for both genders, followed by removing Bushy Eyebrows. The causal sufficiency scores tell a different story. The paper explains why: TCAV concept directions are learned from labeled data without causal considerations, so they are more vulnerable to correlated concepts. Sufficiency scores are based on controlled concept interventions and the probability of crossing the model’s decision boundary.

The important business reading is not “TCAV is wrong.” The better reading is:

Method	What it is good at	What it does not guarantee
TCAV-style concept attribution	Finds representation-level sensitivity associated with human concepts	Does not isolate causal effects or guarantee decision-boundary crossing under intervention
Causal probability of sufficiency	Estimates whether intervening on a concept would be sufficient to change the model decision	Depends on concept vocabulary, causal graph, structural equations, and generator validity
Grad-CAM-style saliency	Shows spatial evidence associated with a prediction	Does not explain the concept-level change needed to alter the decision

This is the paper’s most practical lesson: different explanation tools answer different questions. Treating them as interchangeable because they all produce “interpretability” is how dashboards become expensive wallpaper.

Non-leaf concepts reveal why direct effects can mislead

One of the paper’s more subtle points appears in the Makeup example for the attractiveness classifier. The intervention is not on a leaf concept. In the chosen causal graph, Makeup can influence other appearance-related concepts such as Pointy Nose and Arched Eyebrows.

For young females without makeup classified as unattractive, the paper reports:

Model representation	Probability of sufficiency for $do(Makeup=1)$ changing prediction to attractive
Fully specified SCM	0.410
Canonical PSCM	[0.401, 0.442]
Independent concepts	0.353

The independent intervention underestimates the effect because it treats Makeup as isolated. The FSCM-based calculation accounts for downstream causal influence. In the paper’s example, this includes the way Makeup affects appearance-related concepts that themselves affect predicted attractiveness.

This is not a side detail. It is the core reason mechanism-first interpretation matters.

If a model governance team only asks, “What is the direct effect of changing this concept while holding everything else fixed?” it may miss the total causal effect under the chosen abstraction. The paper argues that total causal influence is more stable across abstraction levels than direct influence. If the vocabulary includes only Young, the effect from Young to image is direct. If the vocabulary includes Gray Hair and Glasses, some of the same effect is mediated. If the vocabulary includes all physical features, the direct effect of Young could shrink further while the total effect remains conceptually the same.

So the business question should not always be “what happens if we freeze the universe except one variable?” That question is tidy, spreadsheet-friendly, and often wrong. The better question is: “Under the causal story we are willing to defend, what changes downstream when this concept is intervened on?”

What the evidence supports, and what it does not

The examples in the paper are best read as a proof of concept plus methodological stress test, not as a claim that CelebA attractiveness classification has been made safe, fair, or socially meaningful. The paper is careful about that, and the article should be too.

Evidence item	Likely purpose	What it supports	What it does not prove
Local counterfactual images for age classification	Main mechanism demonstration	Shows how concept interventions can generate instance-level causal explanations	Does not prove generator quality is sufficient for real deployment
FSCM vs PSCM vs independence comparisons	Robustness/sensitivity to causal assumptions	Shows causal modeling choice can materially affect explanation values	Does not identify the true causal model
Grad-CAM comparison	Comparison with prior XAI style	Shows pixel saliency lacks counterfactual concept-level meaning	Does not imply saliency methods are useless for all diagnostic purposes
Global singleton intervention tables	Main subgroup-level evidence	Shows sufficiency scores can reveal concept sensitivity and subgroup differences	Does not prove those sensitivities are socially valid or acceptable
TCAV comparison	Comparison with concept attribution	Shows causal sufficiency and representation sensitivity can rank concepts differently	Does not prove TCAV is “bad”; it answers a different question
StarGAN and CelebA assumption discussion	Implementation boundary	Shows where the proof-of-concept can fail if assumptions break	Does not solve concept isolation or dataset bias

The magnitude interpretation matters. A sufficiency value of 0.318 for Gray Hair = 1 in the age classifier is not a vague “gray hair is important.” It says that, in the tested subgroup and under the explanation model, setting gray hair to true is estimated to be sufficient to flip the model from young to old about 31.8% of the time. The gender breakdown then shows that this effect is not evenly distributed.

A value of 0.972 for Young = 0 is closer to a sanity check that changing the core age concept almost always flips the age prediction. But even there, the comparison across FSCM, PSCM, and independence assumptions matters because it shows when simplifications are harmless and when they are not.

The paper’s strongest evidence is therefore not a leaderboard number. It is the way the framework forces explanations to carry a declared causal interpretation, then shows where the interpretation bends under weaker assumptions.

The business value is diagnosis, not comforting explanations

For business use, the obvious temptation is to turn this into a better explanation UI. That would be the least interesting version.

The stronger application is model governance and diagnosis. The framework suggests an audit pattern:

Define a concept vocabulary that domain experts, regulators, and users can actually understand.
Build or select a causal model over those concepts.
Generate counterfactual inputs using a concept-to-data mapping.
Estimate whether concept interventions are sufficient to change predictions.
Slice the results by subgroup.
Investigate high-sufficiency concepts that are prohibited, suspicious, unstable, or inconsistent with the intended decision logic.

This creates a different operating model for XAI. Instead of asking analysts to stare at explanations and “see whether they look reasonable,” the organization can ask targeted causal questions.

For example:

Business setting	Bad explanation habit	Better causal-concept question
Credit scoring	“Which features had high importance?”	“Would changing a protected proxy concept be sufficient to alter approval?”
Hiring model	“Which résumé tokens were highlighted?”	“Would changing gender-coded or school-prestige concepts flip the shortlist decision?”
Medical triage	“Which pixels or terms were salient?”	“Would changing clinically irrelevant artifacts alter the risk category?”
Fraud detection	“Which variables contributed to the score?”	“Does the model depend on operational artifacts that only exist for one customer segment?”
Customer targeting	“Which traits correlate with conversion?”	“Which concept interventions actually move the decision boundary, and for whom?”

The ROI is not “more explainability.” That phrase has been overused enough to need adult supervision. The ROI is cheaper failure discovery before deployment, more disciplined remediation after deployment, and more defensible communication when a model decision is challenged.

Still, this is not a plug-and-play compliance machine. The method requires concept labels or concept discovery, a credible concept-to-data generator, and expert causal knowledge. In many business domains, those are non-trivial assets. If the concepts are badly chosen, the causal graph is lazy, or the generator entangles the very concepts it is supposed to isolate, the explanation becomes beautifully formatted uncertainty.

A bad causal model is not magically redeemed by using the word “causal.”

The assumptions are the product

The paper is unusually useful because it does not hide the assumptions in a polite appendix and hope nobody reads them. The assumption section is where much of the practical intelligence lives.

First, the concept vocabulary must be comprehensive enough to expose the model’s decision logic. In the CelebA example, the dataset has 40 binary attributes, but only a subset is used for the explanation vocabulary. The remaining information is pushed into $w$, the unexplained variation. That is acceptable only if the chosen concepts are independent of what is left in $w$, and if the selected concepts include those that matter near the decision boundary.

This has a governance consequence. You should not choose concepts only because they are the concepts the model is supposed to use. That can blind the audit to unwanted behavior. If a model is secretly using a proxy, and the proxy concept is excluded because it is “not part of the intended logic,” congratulations: you have built an explanation method that protects your assumptions from evidence. Very efficient. Very dangerous.

Second, the causal model matters. The paper discusses edges such as Gender → Smiling and Gender → Makeup, which may represent simplified causal abstractions rather than direct physical causation. If the dependence comes from an omitted mediator, the simplified edge may be acceptable at the chosen abstraction level. If it comes from data selection bias or an unobserved confounder, the model should be adjusted.

That is exactly the kind of argument businesses will need to document. A causal explanation system is not just software. It is a governance artifact containing domain judgments.

Third, the generator matters. The proof-of-concept uses StarGAN to perform concept interventions. But the implementation sets $w=x$, which violates the framework’s independence assumption $w \perp z$. The authors rely on the generator design and visual inspection to argue that concept information is replaced appropriately. They also note the risks: generated gray hair may be poor quality because of dataset imbalance; binary concepts such as Young can compress complex age variation into crude categories; and a generator may accidentally change age while supposedly changing another concept.

These are not minor engineering details. They define the validity of the explanation. If a generated counterfactual crosses the model’s decision boundary because the generator accidentally changed an unmentioned concept, the explanation will attribute the flip to the wrong cause.

Fourth, training the predictor, generator, and parts of the explanation model from the same dataset can import the same biases into all three. The paper notes that expert causal knowledge can help detect and remove spurious correlations, such as the age-gender correlation in CelebA that the example treats as spurious and models as independent. But if data drives the explanation model too heavily, the explainer can inherit the model’s blind spots.

For business readers, the boundary is clear:

Requirement	Why it matters
Defensible concept vocabulary	Determines what can and cannot be explained
Valid concept-to-data interventions	Determines whether counterfactual inputs mean what they claim
Explicit causal graph	Determines what downstream changes follow an intervention
Structural assumptions or interval bounds	Determines whether counterfactual probabilities are point estimates or uncertainty ranges
Subgroup evaluation	Determines whether explanations reveal uneven model behavior
Documentation of abstraction level	Determines whether stakeholders interpret the explanation correctly

The phrase “the assumptions are the product” is not rhetorical flourish here. It is operational truth.

Where Cognaptus reads this paper for enterprise AI

What the paper directly shows: a formal framework can combine concept vocabularies, causal models, concept-to-data generators, and black-box predictors into a post-hoc explanation method. Using probability of sufficiency, it can generate local, global, attribution-style, and contrastive explanations. In CelebA examples, it produces explanations that differ meaningfully from Grad-CAM, TCAV, independence assumptions, and CBN-style interventional approximations.

What we infer for business use: this framework points toward a more serious model governance stack. Explanation should become a structured causal diagnostic process, not a gallery of feature importance plots. The most valuable outputs are not the prettiest counterfactual images, but tables that reveal whether model decisions are sensitive to concepts the business can defend.

What remains uncertain: scaling this approach beyond controlled image settings will be difficult. Many enterprise data domains lack clean concept labels, validated causal graphs, and reliable generators. Text, multimodal workflows, customer behavior models, and agentic systems all raise harder concept-boundary questions. The paper itself leaves optimal contrastive explanation search for future work, and notes that the number of counterfactuals grows exponentially with the number of binary concepts. Concept attribution methods may help guide search, but they do not remove the computational burden.

There is also a cultural issue. Businesses like explanations that are cheap, stable, and flattering. Causal explanations are not always any of those. They may reveal that the model uses an embarrassing proxy, that a subgroup is affected differently, or that the organization cannot justify its causal assumptions. This is not a bug. It is the point.

XAI grows up when it stops pretending correlation is enough

The paper is valuable because it pushes XAI toward a more adult standard. It does not say that every explanation must be causal in the same way. It does not claim that a proof-of-concept on CelebA solves enterprise explainability. It says something narrower and more useful: if an explanation is meant to be understood causally, then the machinery generating it must support that interpretation.

That requires a concept vocabulary. It requires a mapping from concepts to data. It requires a causal model over concepts. It requires a probability statement that distinguishes “this was associated with the decision” from “this would have been sufficient to change the decision.”

This is why the framework matters. It gives model governance teams a way to ask sharper questions:

Which concepts actually move decisions?
Which concepts move decisions only for certain subgroups?
Which explanation values depend heavily on structural assumptions?
Which simplifications are harmless, and which change the answer?
Which concepts are missing from the explanation vocabulary?

The answer will not always be comfortable. Good. Comfort was never the purpose of XAI. Understanding was.

Cognaptus: Automate the Present, Incubate the Future.

Anna Rodum Bjøru, Jacob Lysnæs-Larsen, Oskar Jørgensen, Inga Strümke, and Helge Langseth, “A Framework for Causal Concept-based Model Explanations,” arXiv:2512.02735, 2025, https://arxiv.org/abs/2512.02735. ↩︎

Explanations should answer the question people actually ask#

The mechanism has three moving parts, not one magic explainer#

The real explanation unit is probability of sufficiency#

Local examples show why counterfactuals beat decorative heatmaps#

Global explanations turn one-off counterfactuals into subgroup diagnostics#

The attractiveness classifier shows why TCAV and causal sufficiency can disagree#

Non-leaf concepts reveal why direct effects can mislead#

What the evidence supports, and what it does not#

The business value is diagnosis, not comforting explanations#

The assumptions are the product#

Where Cognaptus reads this paper for enterprise AI#

XAI grows up when it stops pretending correlation is enough#