Training Models to Explain Themselves: Counterfactuals as a First-Class Objective

Rejected.

That is where counterfactual explanations usually enter the story. A loan applicant is declined by an automated system. A hiring candidate is filtered out. An insurance customer is priced into an unfavorable category. The counterfactual explanation is supposed to answer a practical question: what would need to change for the model to give me the desired outcome?

In the clean textbook version, this sounds almost humane. “Increase income by X, reduce debt by Y, and the application would be approved.” Lovely. A small procedural kindness from the machine.

Then reality arrives, wearing compliance shoes.

A counterfactual can be valid but ridiculous. It can technically flip the model’s decision while placing the person in a region of feature space that looks nothing like real approved applicants. It can ask the user to change something immutable, such as age. It can be cheap only because the model is vulnerable to strange, adversarial perturbations. In other words, the explanation can satisfy the model while failing the human. Quite an achievement, if the goal was to automate disappointment.

The paper behind today’s article, Counterfactual Training: Teaching Models Plausible and Actionable Explanations, makes a useful move: it stops treating poor counterfactual explanations as merely a post-hoc explanation-generator problem.¹ The authors argue that the model itself should be trained so that meaningful counterfactuals become easier to produce. That shift is the important part. The paper is not just proposing a new counterfactual search trick. It is proposing that counterfactual quality belongs inside the training objective.

This matters because many business deployments still organize explainability as an afterthought: first train the predictive model, then attach an explanation method, then hope the output looks acceptable in front of users, auditors, and legal teams. Counterfactual training challenges that workflow. If the model has learned representations that make plausible, actionable recourse difficult, the explanation layer can only do so much polishing. And there is only so much lipstick one should apply to a decision boundary.

The usual workflow asks the explainer to clean up after the model

Counterfactual explanations are normally generated after the model has already been trained. A classifier maps an input $x$ to an output class. A counterfactual generator searches for a modified input $x’$ that changes the model prediction to a target class $y^+$. The generic optimization shape is:

$$ \min_{x'} ; \text{yloss}(M_\theta(x'), y^+) + \lambda \cdot \text{reg}(x') $$

The first term pushes the counterfactual toward the desired model output. The regularization term usually discourages large or unrealistic changes. Different counterfactual methods add different constraints or penalties: keep the change small, stay near the data distribution, respect immutable features, traverse a latent space, and so on.

That framework has created a productive literature. But it also creates a structural weakness. The explanation method is asked to find good recourse inside a landscape the model has already created. If that landscape encodes fragile, strange, or socially unusable directions, the generator may have to choose between validity, plausibility, actionability, and cost.

The paper’s replacement idea is simple enough to state: during training, generate counterfactuals on the fly and use them to shape the model. The model is no longer only optimized for classification accuracy. It is also optimized so that the counterfactual explanations faithful to the model are more plausible with respect to the data and more actionable under feature constraints.

This is the mechanism-first reading of the paper: the model is trained not only to predict, but to make its own useful explanations possible.

Counterfactual training uses two kinds of counterfactuals

The paper distinguishes between mature and nascent counterfactuals.

A mature counterfactual is one that has successfully reached the target class according to a probability threshold. It has crossed the decision boundary far enough to count as a valid explanation. These mature counterfactuals are used to align the model’s learned representation with real target-class data.

A nascent counterfactual is an intermediate point during the gradient-based search. It has not yet converged. It is still on the way. The paper repurposes these interim counterfactuals as adversarial examples when their perturbations are small enough.

That design is neat because it makes the training loop do double duty. The mature counterfactuals teach the model what plausible recourse should resemble. The nascent counterfactuals teach the model not to be easily flipped by small, fragile perturbations.

Counterfactual type	Where it appears	Training role	Business interpretation
Mature counterfactual	End of a successful counterfactual search	Align explanations with target-class data	Recourse should look like a realistic approved case, not a mathematical hallucination
Nascent counterfactual	Intermediate search step	Used like an adversarial example	The model should not be easily manipulated by tiny unnatural changes
Feature-protected counterfactual	Generated under mutability constraints	Push sensitivity toward mutable features	Explanations should not quietly depend on age, protected attributes, or other non-actionable variables

The paper’s full objective combines ordinary classification loss with three additional pressures: a contrastive divergence term, an adversarial loss term, and a ridge-style energy regularizer. In plain language, the model is trained to classify correctly, pull real target-class samples into favorable energy regions, push implausible counterfactual samples away, and become less vulnerable to adversarially small changes.

The exact implementation is more technical, but the business-level mechanism is clear: counterfactual quality is no longer outsourced to the explanation method. It becomes part of what the model is optimized to support.

Plausibility is trained by contrasting explanations with real target-class data

The first major component of counterfactual training is contrastive divergence.

The authors reinterpret a classifier as something close to a joint energy-based model. In this view, the model assigns energy to input-output combinations. Lower energy corresponds to more compatible configurations. Counterfactual training compares two things for a given target class:

real training samples from the target class;
mature counterfactuals that claim to belong to that target class.

The training objective lowers the energy of real target-class samples and raises the energy of counterfactuals that are not yet aligned with the target distribution. Over time, the intended effect is that counterfactuals generated for the model become more like genuine target-class examples.

This is the paper’s answer to a common failure mode. A counterfactual may say, “change these numbers and the decision flips,” while still landing in a region where real-world examples are rare or nonsensical. In lending, that could mean a rejected applicant receives advice that technically satisfies the classifier but does not resemble the financial profile of real approved borrowers. In hiring, it could mean a candidate is told to alter a feature combination that has no coherent path in real labor-market data.

The paper evaluates plausibility using two measures. The first, IP, measures how far counterfactuals are from observed samples in the target domain using distance. The second, IP*, uses a maximum mean discrepancy adaptation to compare distributions of counterfactuals and target-class data. Both are trying to answer the same practical question: do the generated explanations look more like realistic members of the desired class?

The results are mostly favorable. With the strongest generator configuration discussed in the main results, counterfactual training reduces implausibility by about 15.6% on average under IP and about 25.3% on average under IP*. The strongest gains are on the synthetic Circles dataset, where reductions reach roughly 58.9% under IP and 93.8% under IP*. On real-world tabular datasets, the gains are more modest but still often meaningful: California Housing, Credit, and GMSC show statistically significant IP reductions of about 10%, while GMSC shows about 24.8% reduction under IP*. MNIST shows a smaller IP improvement of about 6.4%, while its IP* result is too uncertain to support a strong claim.

The exceptions matter. Adult and Overlapping do not show significant IP improvement in the main table. The authors suggest that Adult’s large proportion of categorical features may inhibit the generation of large numbers of valid counterfactuals during training. Overlapping data is simply harder because class separation is weak; if counterfactuals rarely mature, the model receives fewer useful examples from them.

That is the correct interpretation: counterfactual training is not magic dust. It needs enough mature counterfactuals to learn from. A training regime cannot learn much from examples that fail to become examples.

Actionability is not just “make the explanation smaller”

Actionability is the second core mechanism, and it is easy to misunderstand.

In counterfactual explanation work, cost is often approximated by distance: how far does $x’$ move from $x$? Smaller movement looks cheaper. But a small change to an immutable feature is not actionable. A tiny change to age is still impossible. A small change to a proxy for a protected attribute may be legally or ethically useless. A cheap adversarial tweak may be computationally impressive and humanly absurd, which is becoming something of an AI tradition.

The paper handles actionability by applying domain and mutability constraints during the training-time counterfactual search. Domain constraints keep features within plausible bounds. Mutability constraints prevent or restrict changes to features that users cannot practically alter.

The distinctive feature is how the model handles immutable features inside the contrastive divergence term. The authors protect immutable features so the model does not penalize counterfactuals merely because those features cannot move toward the target-class sample. Instead, the model is pushed to seek plausibility through mutable features.

This is important. If age helps distinguish approved from rejected applicants, a naive model may rely heavily on age. But a counterfactual explanation that says “be older” is not useful recourse. Counterfactual training attempts to reduce the model’s relative sensitivity to such protected features and shift explanatory burden toward features that can actually be changed.

The paper supports this with both theory and experiments. The theoretical result is shown for a linear classifier under Gaussian class-density assumptions with common diagonal covariance. Under those conditions, protecting an immutable feature from the contrastive divergence penalty reduces classifier sensitivity to that feature relative to mutable discriminative features.

The empirical results are more mixed, and therefore more useful. With mutability constraints imposed, the average cost reduction is about 18.5% across datasets. Some gains are large: GMSC shows a cost reduction of about 66%, California Housing about 44%, and Overlapping about 41%. Synthetic datasets also show positive reductions. But Credit and MNIST show cost increases, and Adult’s cost result is not significant.

The paper gives a sensible explanation. Improved plausibility can remove cheap but bad counterfactuals from the solution space. Once the model stops accepting strange or adversarial shortcuts, some valid counterfactuals become harder to find or require more movement. That is not necessarily failure. It is the price of refusing to call nonsense “recourse.”

The integrated-gradient analysis adds another layer. For protected features, counterfactual training reduces model sensitivity in most datasets. Adult’s protected age variable shows roughly a one-third reduction in sensitivity. California Housing shows about a 20% reduction. Credit reduces sensitivity to age to zero in the reported table. MNIST, where top and bottom pixel rows are protected, shows more than half reduction. GMSC is the major negative case: sensitivity to the protected feature increases, which the authors attribute to possible violated assumptions, interaction with other objective components, or baseline-choice issues.

This is one of the paper’s best practical messages: actionability is not a property of the explanation alone. It is also a property of the model’s learned sensitivities.

Robustness comes almost as a side effect, but not an accidental one

Counterfactual training also improves adversarial robustness.

This might sound like a second thesis, but it is actually part of the same mechanism. Counterfactual explanations and adversarial examples are closely related: both involve input perturbations that change model output. The difference is usually in intent, magnitude, and interpretation. A counterfactual asks for meaningful recourse. An adversarial example exposes fragility. But the search process can produce intermediate points that are useful for robustness training.

The paper uses nascent counterfactuals as adversarial examples when their perturbation magnitude remains below a threshold. This means the model receives adversarial-style training signals without separately running a full adversarial data-generation process. The authors describe these adversarial examples as coming essentially “for free” from the counterfactual search.

In the experiments, robustness improves strongly on real-world datasets under both FGSM and PGD attacks. Clean test accuracy is largely unaffected in most cases, while robust accuracy remains much higher for counterfactual-trained models as perturbation size increases. On some baseline models, robust accuracy drops to nearly zero under sufficiently large perturbations; counterfactual-trained models remain far more stable.

The GMSC case is again worth reading carefully. The counterfactual-trained model has lower clean accuracy there, but the baseline’s accuracy collapses rapidly under adversarial perturbation. The authors interpret this as evidence that the baseline may be relying on fragile or meaningless associations. That is not proof of business superiority, but it is a useful warning: the model with the prettier clean score may be the less trustworthy model.

The ablation studies clarify the mechanism. The paper tests partial objectives: one using the adversarial robustness component without contrastive divergence, and one using contrastive divergence without adversarial loss. Both partial objectives can improve plausibility and robustness. But the full objective is more consistent, especially for adversarial robustness at higher perturbation sizes. In other words, the two components are not decorative. They each do work.

The appendix tests robustness and tuning sensitivity, not a second story

The supplementary sections are not just technical leftovers. They tell us how fragile the method is.

First, the choice of counterfactual generator matters. The paper tests three gradient-based generators: Generic, REVISE, and ECCCo. ECCCo generally works best as the backbone for counterfactual training because it targets faithfulness, which aligns with the training objective. Generic is simpler. REVISE depends on a surrogate variational autoencoder, so its counterfactuals may be less faithful to the model being trained. The grid-search results show that REVISE can produce weaker or even worse plausibility outcomes.

That matters operationally. If a company treats counterfactual training as a plug-and-play wrapper around any explanation generator, it may be disappointed. The generator is not an implementation detail. It determines the quality of the training signal.

Second, the number and maturity of counterfactuals matter. Increasing the maximum number of counterfactual-search steps generally improves results because more counterfactuals reach maturity. A low decision threshold, often $\tau = 0.5$, works well across many datasets because it increases the share of mature counterfactuals. Higher thresholds can help when maturity is not a problem, as in some synthetic settings, but can hurt when too few counterfactuals qualify.

Third, regularization matters. The energy regularization strength $\lambda_{reg}$ is particularly influential. Too little regularization can lead to instability and poor outcomes. The penalties on contrastive divergence and adversarial loss are less obviously sensitive, though extreme combinations can increase variability.

Fourth, counterfactual training does not necessarily need to run from the first epoch. The authors report that applying it after an initial conventional training phase can still improve explainability. For business adoption, that is one of the more interesting details. It suggests a possible fine-tuning route: start from an existing classifier, then apply counterfactual training to improve recourse behavior. That inference should be treated carefully, because the paper reports preliminary evidence rather than a full deployment study, but the direction is practical.

What the paper directly shows, and what business readers should infer

The paper directly shows that, across a range of classification datasets, counterfactual training can improve the plausibility of generated counterfactual explanations, often reduce recourse cost under mutability constraints, and increase adversarial robustness. It also shows that these gains depend on the generator, the maturity of counterfactuals, and regularization choices.

For business use, the most relevant inference is about workflow design. In high-stakes automated decision systems, explanation quality should not be treated only as a reporting-layer problem. If recourse quality matters, the training process itself may need to encode recourse constraints.

Paper result	What it directly supports	Business interpretation	Boundary
Implausibility drops under CT in most datasets	Counterfactuals better match target-class data	User advice may become more realistic	Gains vary; Adult and Overlapping show weaker results
Protected-feature sensitivity often falls	Mutability constraints can reshape learned sensitivity	Recourse can rely less on non-actionable variables	Proxy features may still carry protected information
Robust accuracy improves under FGSM and PGD	CT reduces adversarial fragility	Better explanations may coincide with more stable models	Not a replacement for specialized robust training
Validity sometimes drops	The valid solution space becomes narrower	Fewer cheap bad explanations may remain	Search settings may need adjustment to recover validity
Fine-tuning appears possible	CT may work after conventional pretraining	Existing models might be improved rather than rebuilt	Evidence is preliminary

For lenders, insurers, HR platforms, and public-sector decision systems, the practical lesson is not “use this paper tomorrow and declare compliance.” Please do not do that. The lesson is narrower and more valuable: if recourse is a product requirement, then training objectives should be designed with recourse in mind.

A compliance team may ask whether rejected applicants receive explanations. A better technical governance team asks whether the model has learned decision boundaries that make realistic, actionable explanations possible.

The deployment boundary: CT helps, but it does not solve recourse governance

Counterfactual training still inherits hard problems from algorithmic recourse.

First, immutable features may have proxies. Protecting age does not remove all information correlated with age. Education, employment history, income pattern, location, or credit history may carry proxy information. Some of these features are theoretically mutable but practically very hard to change. The model can shift burden from an explicitly protected feature to a socially loaded proxy. That is not unique to counterfactual training, but CT does not make it disappear.

Second, actionability can be unequal. A feature may be mutable for one person and unrealistic for another. “Increase income,” “change job type,” or “complete further education” may sound actionable in a feature table and still impose very different burdens across social groups. The paper explicitly notes the risk of unfairly assigned burden of recourse. This is where business governance must go beyond model metrics.

Third, plausibility and cost can conflict. More plausible explanations may be farther from the original input. CT sometimes reduces cost, but it can also increase cost when cheap invalid shortcuts are removed. Businesses should not optimize a single metric and call the result humane. A low-cost explanation is not useful if it is implausible; a plausible explanation is not useful if it is socially unreachable.

Fourth, training becomes more expensive. The method generates counterfactuals during training, and the paper’s experiments used distributed CPU resources. The authors note that CT is parallelizable and may amortize some costs, but it is still more resource-intensive than conventional training. For small tabular classifiers, this may be manageable. For large industrial models, the engineering story is not yet written.

Finally, the paper is focused on classification. Many business decisions involve ranking, scoring, pricing, recommendation, or continuous prediction. Counterfactual explanations exist for some of these settings, but the conceptual and technical consensus is weaker. The paper’s method relies on target classes and non-target classes, so direct extension to regression-style systems remains future work.

The useful shift is accountability at training time

The most valuable part of this paper is not the largest percentage improvement in a table. The useful shift is conceptual and operational: explanation quality should be a model-training responsibility, not only a post-hoc interface responsibility.

That shift changes how we would design trustworthy AI systems.

In the old workflow, the model is trained for predictive performance, and explanation is attached afterward. If the explanations are strange, the explanation method is blamed. In the counterfactual-training workflow, the model is asked during training to support plausible, actionable counterfactuals. If the explanations are strange, the training objective itself is part of the investigation.

This is a healthier accountability structure. Not perfect. Not complete. But healthier.

For business leaders, the practical question is therefore not “Does this produce nicer explanations?” The sharper question is: “Are we training models whose decision boundaries make usable recourse possible in the first place?”

Counterfactual training gives one technical route toward that answer. It combines mature counterfactuals for plausibility, nascent counterfactuals for robustness, and feature-protection logic for actionability. The evidence is promising across several datasets, but uneven enough to keep serious readers awake. Good. Sleepy governance is how bad systems get procurement approval.

The broader lesson is simple: when explanations matter, the model should not be allowed to shrug and leave the mess to the explainer.

Cognaptus: Automate the Present, Incubate the Future.

Patrick Altmeyer, Aleksander Buszydlik, Arie van Deursen, and Cynthia C. S. Liem, “Counterfactual Training: Teaching Models Plausible and Actionable Explanations,” arXiv:2601.16205, 2026. ↩︎

The usual workflow asks the explainer to clean up after the model#

Counterfactual training uses two kinds of counterfactuals#

Plausibility is trained by contrasting explanations with real target-class data#

Actionability is not just “make the explanation smaller”#

Robustness comes almost as a side effect, but not an accidental one#

The appendix tests robustness and tuning sensitivity, not a second story#

What the paper directly shows, and what business readers should infer#

The deployment boundary: CT helps, but it does not solve recourse governance#

The useful shift is accountability at training time#