When Fairness Fails in Groups: From Lone Counterexamples to Discrimination Clusters

Imagine two fairness bugs.

In the first, changing a protected attribute while holding everything else constant shifts a model’s output enough to trigger one unfair decision.

In the second, the same underlying applicant profile can fracture into nineteen meaningfully different score bands as protected attributes change.

A conventional pairwise fairness test records both as violations. One counterexample each. Very tidy. Also not especially useful.

The first case shows that fairness has failed. The second shows that the model has entered a region of profound decision-making arbitrariness. Treating them as equivalent is rather like recording both a dripping tap and a burst water main as “one plumbing incident.”

The paper Uncovering Discrimination Clusters: Quantifying and Explaining Systematic Fairness Violations introduces HyFair, a framework designed to make this distinction measurable.¹ It begins with familiar individual-fairness testing, but then extends the audit into a diagnostic chain:

find a pairwise fairness violation;
measure how severely outcomes fragment around it;
identify the non-protected conditions associated with severe fragmentation;
use those conditions to guide mitigation;
test again, because mitigation can create a fresh set of fairness problems.

The important contribution is therefore not merely a better detector. It is a proposed workflow for moving from proof of failure to severity assessment, explanation, and regression testing.

A Pairwise Violation Says Fairness Broke, Not How Badly

Individual fairness usually asks whether similar individuals receive similar outcomes.

Suppose an input contains non-protected features such as income, education, working hours, and employment history, alongside protected attributes such as gender, race, or age. An auditor keeps the non-protected features fixed, changes the protected attributes, and checks whether the model’s output changes beyond an accepted tolerance.

When a sufficiently different outcome appears, the auditor has found an individual discriminatory instance.

This formulation is valuable because it produces a concrete counterexample. The model cannot hide behind aggregate performance: two otherwise equivalent inputs were treated differently because a protected attribute changed.

But the counterexample remains binary. It answers:

Does a fairness violation exist?

It does not answer:

How unstable does the model become once the violation is found?

That missing question matters because counterexamples do not all describe the same risk. One may reveal a narrow decision boundary where two outputs barely exceed the audit threshold. Another may expose a region where protected-attribute changes repeatedly scatter outcomes across much of the model’s score range.

Counting both as one violation conceals the difference.

It can also produce a misleading operational picture. A model generating many mild pairwise violations is not necessarily more dangerous than one generating fewer violations concentrated around extreme, repeatable outcome fragmentation. Raw counts measure how often the search found something. They do not, by themselves, measure how badly the model behaves when it does.

k-Discrimination Turns a Counterexample into a Severity Measure

HyFair generalizes pairwise discrimination into k-discrimination.

For a fixed set of non-protected features, the framework generates a counterfactual neighborhood by varying protected attributes. It then groups the resulting model outputs into meaningfully distinct clusters. Outputs separated by more than a chosen tolerance are placed into different buckets.

In simplified notation:

$$ k(x) = \left| \operatorname{Cluster}_{\epsilon} \left( {f(x,p): p \in P} \right) \right| $$

Here, $x$ represents the shared non-protected features, $p$ represents possible protected-attribute values, $f$ is the model, and $\epsilon$ defines when two outputs are different enough to belong to separate clusters.

A value of $k=2$ captures the familiar pairwise fairness violation. Higher values indicate that changing protected attributes produces increasingly fragmented outcomes.

This gives the audit a severity dimension:

Audit result	What it establishes	What remains unknown
No 2-discriminant counterexample found	The encoded individual-fairness property survived the completed search or verification procedure	Whether the property, domain, timeout, and counterfactual assumptions reflect the real deployment
A 2-discriminant counterexample exists	Protected-attribute changes can materially alter the outcome	Whether the violation is isolated or part of a severe pattern
A high-k counterfactual neighborhood exists	Similar profiles can receive many distinct outputs because protected attributes change	How common, realistic, harmful, or legally significant that region is

The distinction in the final row is crucial. A high $k$ measures outcome fragmentation, not the number of real people affected. It also does not directly measure financial harm, legal liability, or moral severity.

It is better understood as a risk-prioritization signal: a high-$k$ region is a place where the model’s decision logic becomes unusually dependent on protected attributes and therefore deserves closer inspection.

There is another terminological trap. “Discrimination clusters” may sound like population-level demographic groups. That is not what the paper primarily discovers. The cluster is formed from model outputs within a counterfactual neighborhood whose members share non-protected features. HyFair is mapping regions of model behavior, not directly estimating demographic-parity gaps across an observed population.

HyFair Uses Formal Search to Find the Door, Then Random Search to Map the Room

Directly verifying high levels of k-discrimination is expensive.

To establish that a model is k-discriminant through formal verification, an auditor may need to encode multiple copies of the neural network, hold their non-protected inputs constant, vary their protected attributes, and determine whether their outputs occupy at least $k$ distinct clusters. As $k$ grows, so does the computational burden.

HyFair avoids treating this as one enormous verification problem.

Instead, it divides the work between two techniques with different strengths.

Stage 1: Formal analysis finds pairwise violations or certifies the encoded property

HyFair represents the neural network as a mixed-integer linear program. The solver searches for two inputs that share non-protected features but differ in protected attributes and produce outputs separated by more than the fairness tolerance.

When the solver completes successfully without finding a violating pair, it can certify the encoded 2-fairness property. When it finds a counterexample, that case becomes a seed for deeper investigation.

Formal analysis is useful here because it can systematically locate violations that unguided testing may miss. But repeatedly extending formal verification to much larger $k$ values would become prohibitively expensive.

Stage 2: Randomized search explores severity around the seeds

Once HyFair finds a pairwise counterexample, it evaluates the full protected-attribute neighborhood and measures the resulting $k$. It then searches nearby non-protected inputs for cases with greater outcome fragmentation.

The framework compares random walks, simulated annealing, and simulated annealing augmented with nearest-neighbor search.

Simulated annealing is particularly useful because it does not always insist on moving immediately toward a better candidate. It can temporarily accept worse solutions, helping the search escape flat regions and local optima before locating more severe discrimination clusters.

HyFair also returns to the formal solver for fresh seeds when the randomized search stalls. Formal search discovers promising entrances; randomized exploration investigates what lies behind them.

This division of labour is the paper’s core mechanism:

Formal pairwise search
        ↓
Counterexample seed
        ↓
Local randomized exploration
        ↓
Maximum observed k-discrimination
        ↓
Shared-condition explanation
        ↓
Mitigation and renewed search

The hybrid design is more meaningful than a simple contest between MILP and random search. Each component is assigned the part of the problem it handles reasonably well.

The Search Results Show Why Rare Peaks Matter More Than Raw Counts

The authors evaluate HyFair on 20 fully connected ReLU neural-network benchmarks trained on the Adult Census Income and Bank Marketing datasets. The architectures range from three to eleven layers and from ten to 318 neurons.

Throughout the experiments, the maximum counterfactual set size is fixed at 20, and outputs separated by more than five percentage points are treated as distinct. Formal solver runs use a 100-second timeout, while randomized searches run for four hours. Results are averaged across repeated runs.

These design choices matter because the reported values of $k$ are properties of the model under the chosen audit configuration, not timeless constants engraved into the network.

The solver comparison supports seed quality, not the entire thesis

Against Fairify, an SMT-based fairness-verification baseline, HyFair’s MILP formulation finds more individual discriminatory instances in 17 of the 20 benchmarks. It finds higher maximum cluster counts in 80% of the cases and reaches its first discriminatory instance faster in 90%.

This supports the first stage of the workflow: HyFair often supplies more and better seeds for subsequent severity analysis.

It does not establish that MILP dominates SMT for every model. The paper notes that the two approaches can explore different regions. In some benchmarks, only one tool finds discriminatory instances. Fairify’s Z3-based search tends to return nearby counterexamples, while HyFair’s MILP formulation often identifies cases from more varied regions, including boundaries between safe and unsafe areas.

The practical interpretation is not “replace every fairness tool with MILP.” It is that seed diversity affects what the later audit can discover.

Simulated annealing is strongest at finding extremes, not at every search metric

Among the randomized strategies, simulated annealing finds instances at the maximum observed $k$ in 94% of the benchmark cases. It also reaches the first discriminatory instance faster than the alternatives in 72% of cases.

However, the nearest-neighbor variant performs best on average cluster count in 56% of the benchmarks. Random walks also remain competitive on some measures.

This comparison is best read as a component-selection experiment. It shows that simulated annealing is especially useful when the audit objective is to locate severe peaks. It does not show that one search method universally dominates every definition of fairness-testing performance.

AC2 demonstrates the difference between prevalence and severity

The Adult Census AC2 benchmark offers the paper’s clearest example.

Using simulated annealing, HyFair finds an average of approximately 2,245 individual discriminatory instances. Yet only about 3.8 of them, on average, reach the observed maximum of $k=19$.

The extreme cases are therefore rare within the detected violation set.

A count-based audit might concentrate on the thousands of discovered instances. A severity-oriented audit would also isolate the few cases where protected-attribute changes fragment nearly the entire available outcome range.

Neither metric makes the other irrelevant. They describe different aspects of risk:

Metric	Operational question
Number of individual discriminatory instances	How many violating cases did the search encounter?
Search success rate	How densely does the search encounter violations?
Average $k$ among violations	How fragmented is a typical violating neighborhood?
Maximum observed $k$	How severe is the worst discovered neighborhood?
Number of maximum-$k$ instances	How repeatable is the observed worst case?

A serious fairness audit should resist compressing all five into one reassuring score. Convenient dashboards have caused enough trouble already.

Decision Trees Explain Where the Model Becomes Sensitive to Protected Attributes

Discovering a high-$k$ case tells an auditor where the model behaves badly. It does not yet explain the conditions associated with that behavior.

HyFair addresses this through local perturbation and decision-tree learning.

For the discovered high-$k$ inputs, the framework generates 5,000 nearby samples and calculates their k-discrimination values. It labels the top 5% as highly discriminatory and trains a decision tree to distinguish them from lower-$k$ cases.

Paths leading to high-$k$ leaves become candidate explanation rules. HyFair then tests those rules by generating cases that violate their conditions and checking whether $k$ falls meaningfully.

The resulting rules describe combinations of non-protected attributes under which the model becomes unusually sensitive to protected attributes.

For the AC2 benchmark, one reported explanation includes conditions involving working hours, marital status, work class, native country, and capital gain. The rule identifies a hyper-rectangular region of the feature space associated with high discrimination.

That kind of explanation is operationally useful because it can be translated into a review condition:

When an input falls inside this region, the model’s output is unusually unstable under protected-attribute changes.

This is more actionable than receiving hundreds of isolated counterexamples with no shared structure.

The LIME comparison tests task fit, not universal explanation quality

HyFair’s explanations are compared with LIME. Across the benchmarks, HyFair produces more robust explanations in more than 78% of cases and covers a larger portion of the input space in every reported comparison. Its rule sets are often smaller or equal in size.

Those results support the claim that HyFair’s explanation method fits the discrimination-cluster task well.

But the comparison has an important boundary: LIME is designed to explain a model’s output for one input. It does not natively explain a collection of high-k counterfactual neighborhoods. The authors adapt it by extracting common features from multiple local explanations.

HyFair is therefore being compared with a baseline operating outside its natural job description. Winning that comparison is useful evidence of task suitability, not proof that decision-tree explanations are generally superior to LIME.

The explanations also remain associational. A decision-tree path can identify conditions that reliably coincide with high discrimination. It does not establish that those features cause the discriminatory behavior, nor that changing them would be an ethically or legally acceptable intervention.

The Mitigation Experiment Shows Why Retraining Can Move the Worst Case

After generating explanation rules, the paper tests two mitigation strategies:

use the decision-tree rules as guardrails that refuse queries associated with severe arbitrariness;
retrain the neural network using curated discriminatory instances as additional training data.

The mitigation experiments are exploratory rather than a complete production playbook. Still, they reveal the paper’s most important operational lesson.

Across the benchmarks, adding decision-tree guardrails to the original model performs best at reducing maximum k-discrimination in 67% of cases. Retrained models with decision-tree rules perform best in 22%, while retraining without the rules performs best in 11%.

The retrained models lose no more than two percentage points of predictive accuracy. That sounds encouraging until the fairness metrics are inspected separately.

In several cases, retraining reduces the number or search frequency of individual violations while increasing the maximum observed $k$. The model appears fairer more often, but the worst surviving cases become more severely fragmented.

For example, on AC8, retraining without decision-tree rules increases the observed maximum $k$ from 16.9 to 20, raises average $k$ from 12.3 to 18.3, and increases both the number and search rate of individual discriminatory instances.

The apparent fix has not merely failed. It has altered the fairness-risk landscape.

This is why mitigation cannot be evaluated with a single before-and-after count. A retrained model may reduce prevalence, increase peak severity, or create severe violations in regions not covered by the original explanation rules.

The correct sequence is iterative:

Discover severe region
        ↓
Explain shared conditions
        ↓
Apply guardrail or retrain
        ↓
Repeat the full search
        ↓
Compare peak severity, prevalence, and newly exposed regions

Fairness regression testing should be treated much like security regression testing. A patch that closes one vulnerability and opens another is not ready merely because the first test now passes.

Guardrails Reduce Model Exposure, but They Do Not Automatically Resolve Harm

Decision-tree rules are attractive because they can be deployed without changing the underlying neural network. A system can identify inputs inside a high-risk region and refuse to produce the normal automated decision.

This can reduce exposure to severe model arbitrariness. It can also create a new operational problem.

A refusal is not the same as a fair outcome.

In a lending, hiring, benefits, or service-access process, rejecting automated processing may delay decisions, impose additional documentation burdens, or systematically route certain cases into a slower channel. A guardrail can therefore relocate harm from discriminatory output to unequal access or service quality.

That does not make guardrails useless. It means the business process around them matters.

A more defensible deployment would route flagged cases to a defined fallback procedure, such as human review, a simpler validated policy, or a second independently audited model. The organization would then measure whether the fallback itself produces unequal burdens.

The paper directly shows that rules can reduce several k-discrimination metrics. The broader workflow implication—using them as monitored routing conditions rather than silent denials—is a business inference, not an experimentally demonstrated result.

A Business Audit Should Track Peaks, Prevalence, and Fix Regression Separately

HyFair suggests a more mature structure for fairness assurance.

Instead of asking only whether a model has passed or failed, organizations can maintain a fairness incident portfolio.

Audit stage	Paper mechanism	Business use	Required control
Establish the baseline	Formal search for pairwise violations	Determine whether the encoded fairness property can be violated	Document the protected attributes, domain constraints, tolerance, and solver outcome
Rank severity	Search for maximum and average $k$	Prioritize investigations by outcome fragmentation	Track both peak severity and prevalence
Diagnose regions	Decision-tree rules over high-$k$ neighborhoods	Identify operating conditions associated with severe model instability	Validate rule stability and counterfactual realism
Apply intervention	Guardrails or targeted retraining	Reduce exposure while preserving service delivery	Define fallback handling and acceptance criteria
Re-audit	Repeat search on the modified model	Detect relocated or newly created vulnerabilities	Require fairness regression testing before release

This workflow changes the questions asked during model approval.

Instead of:

Did the fairness metric improve?
Did the number of detected violations fall?
Did accuracy remain acceptable?

The review becomes:

Did maximum severity fall?
Did average severity fall?
Did violation prevalence fall?
Did the intervention create new severe regions?
Are the generated counterfactuals realistic?
What happens operationally when a guardrail activates?
Does the result hold under alternative thresholds and search budgets?

This is not a demand for more metrics for their own sake. It is recognition that fairness failures have different shapes, and those shapes imply different responses.

The Experiments Form One Diagnostic Chain, Not Four Independent Victories

The paper organizes its evaluation around four research questions. Their evidentiary roles are easier to interpret when placed inside the end-to-end mechanism.

Experiment	Likely purpose	What it supports	What it does not prove
MILP versus Fairify	Comparison with prior verification work	HyFair often finds more diverse and severe counterexample seeds faster	MILP is always the best verifier for every architecture or fairness definition
Random walk versus simulated annealing variants	Search-component comparison	Simulated annealing is effective for locating extreme observed k-discrimination	The observed maximum is the true global maximum under every timeout and seed
HyFair explanations versus adapted LIME	Explanation-method comparison	Cluster-oriented decision-tree rules are compact, robust, and broad within the tested setup	The rules are causal or generally superior to all explanation methods
Guardrails and retraining	Exploratory mitigation test	Explanation rules can reduce several discrimination metrics, while retraining can create new vulnerabilities	The interventions solve fairness in production or eliminate downstream harm

This distinction prevents the paper from being overread.

The search comparisons are useful because they validate components of the diagnostic workflow. The mitigation experiment is useful because it exposes failure modes. Neither should be inflated into a universal fairness guarantee.

What the Evidence Does Not Establish

The paper offers a useful audit framework, but its practical boundaries are substantial.

First, the experiments cover 20 fully connected ReLU neural networks trained on two tabular datasets. The results do not establish equivalent performance for transformers, tree ensembles, multimodal systems, foundation models, or models operating on richer unstructured inputs.

Second, the counterfactual-generation process tries combinations of protected attributes that may produce unrealistic or impossible profiles. The authors apply some rule-based corrections, but they acknowledge that more sophisticated generative methods may be needed to improve realism.

A high-$k$ region based largely on implausible counterfactuals is an audit artefact, not necessarily a real operational hazard.

Third, HyFair does not directly handle intersectional fairness. Repeating the procedure across combinations of protected attributes is possible, but computational cost and interpretation complexity will rise quickly.

Fourth, the decision-tree explanations divide the input space into rectangular regions. That makes the rules readable, but it may poorly approximate curved or highly entangled model behaviour. The rules identify associations and do not establish causal relationships.

Fifth, several central choices are fixed rather than extensively stress-tested: the five-percentage-point output threshold, the maximum counterfactual set size of 20, the four-hour randomized-search budget, the 100-second solver timeout, and the top-5% threshold used to label highly discriminatory samples.

The paper does not establish that benchmark rankings or mitigation conclusions remain unchanged under substantially different configurations.

Finally, certification remains bounded by the encoded fairness definition, domain constraints, model representation, timeout, and seed-dependent search process. Passing the procedure does not certify fairness in general. It certifies something narrower—and, unlike many corporate fairness claims, at least tells us what that narrower thing is.

From Fairness Counts to Fairness Incident Management

The most useful insight from HyFair is simple: once a model fails a fairness check, the next question should not merely be how many more failures can be found.

The audit should ask how severely outcomes fragment, where the fragmentation occurs, whether it repeats, and what happens after the model is changed.

k-discrimination supplies a severity signal. Formal verification supplies disciplined entry points. Randomized search maps severe regions. Decision trees translate those regions into inspectable conditions. Repeated testing reveals whether mitigation actually removed risk or merely moved it somewhere less convenient to notice.

The framework does not solve algorithmic fairness. It does something more practical: it turns fairness violations from an undifferentiated pile of counterexamples into incidents that can be ranked, investigated, and retested.

For organizations deploying consequential decision systems, that is a meaningful upgrade. A model with fewer detected violations is not necessarily safer. A model with better average fairness is not necessarily safer. A retrained model is certainly not safer merely because someone named the training job debias_final_v2.

Fairness assurance begins with finding a counterexample.

It becomes useful when the organization understands what surrounds it.

Cognaptus: Automate the Present, Incubate the Future.

Ranit Debnath Akash, Ashutosh Trivedi, Ashish Kumar, Gang Tan, Verya Monjezi, and Saeid Tizpaz-Niari, “Uncovering Discrimination Clusters: Quantifying and Explaining Systematic Fairness Violations,” arXiv:2512.23769, 2025. https://arxiv.org/abs/2512.23769 ↩︎

A Pairwise Violation Says Fairness Broke, Not How Badly#

k-Discrimination Turns a Counterexample into a Severity Measure#

HyFair Uses Formal Search to Find the Door, Then Random Search to Map the Room#

Stage 1: Formal analysis finds pairwise violations or certifies the encoded property#

Stage 2: Randomized search explores severity around the seeds#

The Search Results Show Why Rare Peaks Matter More Than Raw Counts#

The solver comparison supports seed quality, not the entire thesis#

Simulated annealing is strongest at finding extremes, not at every search metric#

AC2 demonstrates the difference between prevalence and severity#

Decision Trees Explain Where the Model Becomes Sensitive to Protected Attributes#

The LIME comparison tests task fit, not universal explanation quality#

The Mitigation Experiment Shows Why Retraining Can Move the Worst Case#

Guardrails Reduce Model Exposure, but They Do Not Automatically Resolve Harm#

A Business Audit Should Track Peaks, Prevalence, and Fix Regression Separately#

The Experiments Form One Diagnostic Chain, Not Four Independent Victories#

What the Evidence Does Not Establish#

From Fairness Counts to Fairness Incident Management#