XAI, But Make It Scalable: Why Experts Should Stop Writing Rules

Churn is a wonderfully inconvenient business problem. Customers do not leave in one elegant, universal way. Some leave because price finally annoyed them. Some leave because support failed at exactly the wrong moment. Some leave because a monthly contract made exit frictionless. Some leave because they were already mentally gone and the invoice merely made it official.

This is where explainable AI often starts pretending to be more organized than reality.

The usual XAI bargain is awkward. Post-hoc explainers such as LIME and SHAP can be applied broadly, but their explanations may shift when inputs shift slightly. Intrinsic explanation frameworks are more stable because explanations are part of the model design, but they often require expensive human rationales. In production language: one option gives you scalable stories, the other gives you stable paperwork. Neither is exactly a CFO’s dream.

A recent arXiv paper by Lawrence Krukrubo, Julius Odede, and Olawande Olusegun proposes a more useful compromise: let machines discover broad “safety” patterns, then let humans define the sparse “risk traps” that automated rule learners tend to miss.¹ The authors call the framework Hybrid LRR-TED. The interesting part is not merely that the hybrid model performs well. The interesting part is why it performs well: explanation work is asymmetric.

Machines are good at mapping the ordinary. Experts are good at naming the dangerous exception.

A shocking discovery, apparently: humans are more useful when we stop asking them to behave like documentation interns.

The false choice in XAI is scale versus stability

The paper starts from a familiar tension. Post-hoc explainers are attractive because they can sit on top of many black-box models. That makes them operationally convenient. But post-hoc explanations explain a prediction after the fact; they are not necessarily faithful to the model’s internal decision process. If the explanation changes under small perturbations, the business user receives something that looks like transparency but behaves like mood lighting.

The alternative is intrinsic interpretability. In this family, the explanation is not decoration added later. It is part of the model structure or training process. Teaching Explanations for Decisions, or TED, is the paper’s key example. TED does not train only on inputs and labels, $X \rightarrow Y$. It trains on inputs mapped to both labels and explanations, $X \rightarrow (Y, E)$. That gives the explanation a stronger role: the model is supervised to produce both the decision and the reason.

The problem is obvious. Someone must provide those explanations.

If each training example requires a human rationale, TED-style supervision can become a knowledge bottleneck. The model may be stable, but the organization needs experts to annotate thousands or millions of cases. In regulated industries, this is not merely expensive. It is politically painful. The best experts are usually busy doing the work that produces the revenue or manages the risk. They are not waiting lovingly beside a spreadsheet to label explanation codes.

So the paper asks a more practical question: what if experts do not need to write the whole explanation system?

The mechanism: machines map safety, experts mark traps

The Hybrid LRR-TED pipeline has three main stages. The sequence matters because the article’s business lesson lives inside the order of operations.

First, the authors use automated rule discovery. The customer churn data are transformed into a high-dimensional binary feature space using feature binarization. Continuous and categorical features become explicit logical propositions: for example, age thresholds, tenure thresholds, and their negations. This allows a rule model to work with interpretable logical conditions instead of raw continuous variables.

Then Linear Rule Regression is used to discover sparse rules. In simplified form, the objective balances prediction error, sparsity, and rule complexity:

$$ \min_w \frac{1}{2}|Y - Bw|_2^2 + \lambda_1|w|_1 + \lambda_2 \sum_j C(r_j) $$

Here, $B$ is the binarized rule feature matrix, $w_j$ is the learned weight for rule $r_j$, and $C(r_j)$ penalizes rule complexity. The point is not to produce a mysterious neural representation. The point is to find simple logical structures that carry predictive signal.

The authors use this automated phase primarily to identify “Safety Nets”: rules associated with customers likely to stay. Instead of mapping every discovered rule to a separate explanation, they group safety rules into three tiers based on coefficient magnitude. Stronger retention drivers receive one explanation code, weaker ones another. This reduces explanation sparsity and gives the model a structured explanation vocabulary for safe customers.

Second, humans enter—but narrowly. Experts define “Risk Traps”: specific churn triggers that are semantically meaningful and actionable. The paper describes these as manual domain predicates, assigned to explanation codes beyond the automated safety tiers. The full manual baseline contains eight expert rules, but the hybrid approach applies a Pareto-style selection process to choose a smaller, higher-value subset.

Third, the safety and risk explanations are fused into one explanation matrix, which initializes a TED-style support vector classifier. The model is trained not only to classify churn, but to align predictions with explanation codes. The paper also includes a default “Drift” state, Code 12, for customers who trigger neither a safety net nor a specific risk trap. That category matters because “no obvious safety” plus “no named risk” can itself be an operational signal. It is the model’s version of: nothing has broken yet, but do not relax.

The architecture can be summarized like this:

Component	Who creates it	Operational role	Why it matters
Safety Nets	Automated rule learner	Broad retention explanations	Captures dense, repeated patterns cheaply
Risk Traps	Human experts	Specific churn triggers	Captures sparse, actionable failure modes
Drift state	Framework design	Neither safe nor explicitly risky	Marks ambiguous pre-churn territory
TED-SVC classifier	Hybrid training process	Predicts label and explanation together	Makes explanation part of supervision, not post-hoc theater

This is the paper’s central mechanism. Not “AI replaces experts.” Not “experts supervise everything.” The mechanism is division of labor.

Machines cover the broad terrain. Humans place warning signs at cliffs.

Why four expert rules can beat eight

The paper’s most useful design move is the Pareto filter. The authors do not simply ask whether an expert rule sounds reasonable. That is how organizations end up with bloated policy manuals and fourteen approval buttons nobody understands. Instead, they evaluate the manual rules along two dimensions.

The first is coverage: does the rule capture enough of the churn population to matter? Rules covering less than 1% of the population are treated as candidates for automated handling rather than expert priority.

The second is orthogonality: does the rule capture something distinct, or is it redundant with other rules? The paper uses Jaccard similarity:

$$ J(A,B) = \frac{|A \cap B|}{|A \cup B|} $$

The selected “Golden Quartet” has low average overlap: an average pairwise similarity of 0.09 and a maximum overlap of 0.26. In plain business language, the four chosen rules do not keep pointing at the same problem with different labels. They cover different behavioral quadrants: financial risk, structural risk, interaction risk, and engagement risk.

That is the difference between expert knowledge and expert clutter.

An expert rule is valuable when it identifies a meaningful exception that the automated system would otherwise under-specify. It is less valuable when it merely decorates a pattern already captured elsewhere. The paper’s result is therefore not just “use fewer rules.” It is “use fewer non-redundant, high-coverage rules in the parts of the risk landscape where automation is weakest.”

That distinction matters. Otherwise, managers will read this paper and conclude that four is a magic number. Four is not magic. Four is the number that survived the paper’s coverage and uniqueness tests on this churn dataset. Please do not hold a meeting to rename your compliance framework “Golden Quartet.” The world has suffered enough.

What the numbers actually support

The headline result is strong. The hybrid model with four expert rules reaches 94.00% Y+E accuracy, outperforming both the fully automated baseline and the full eight-rule manual TED baseline.

The paper reports the following efficiency frontier:

Model configuration	Rule count	Y+E accuracy	Interpretation
LRR, full automation	0	75.15%	Scalable, but weak alignment with risk explanations
Hybrid, behavioral trio	3	90.05%	Large gain from a small amount of expert input
Manual TED benchmark	8	92.90%	Strong but more expert-intensive
Hybrid, Golden 4	4	94.00%	Best reported configuration

The first jump is the most important. Moving from zero human rules to three behavioral rules raises Y+E accuracy from 75.15% to 90.05%. The fourth rule, related to monthly contract structure, pushes the hybrid model to 94.00%, surpassing the eight-rule manual benchmark by 1.1 percentage points.

That does not mean experts are irrelevant. It means the marginal value of expert input is not linear. The first few good rules carry much more value than the long tail of additional rules.

The classification report for the four-rule hybrid is also useful for business interpretation:

Class	Precision	Recall	F1-score	Support
Stay	0.92	0.98	0.95	860
Churn	0.99	0.93	0.96	1140
Weighted average	0.96	0.95	0.95	2000

The churn precision of 0.99 is operationally meaningful. In retention campaigns, false positives are expensive. If the model wrongly flags many safe customers as likely churners, the company wastes discounts, support calls, and account-manager attention. High precision means the model is selective.

The churn recall of 0.93 means the model still captures most churners. It is not simply avoiding false positives by refusing to flag anyone. That balance is what makes the result commercially interesting: a retention team wants both budget discipline and broad risk capture.

Still, one nuance should not be lost. The paper’s main comparison uses Y+E accuracy, meaning combined performance over labels and explanations. That is the correct metric for a TED-style framework, but it is not the same as ordinary predictive accuracy in a conventional churn model. A business reader should not compare the 94.00% directly against every random churn model in a vendor deck. The question here is narrower and more valuable: can the model predict while staying aligned with a structured explanation system?

On that question, the paper’s evidence is persuasive within its experimental setting.

The Anna Karenina idea explains the asymmetry

The paper’s best phrase is the “Anna Karenina Principle of Churn.” Satisfied customers, like Tolstoy’s happy families, may resemble one another more than unhappy customers do. Retained customers form broad, dense behavioral patterns. Churning customers leave through scattered, specific pathways.

That asymmetry explains why automated rule learning naturally finds safety nets. It optimizes for simple, broad structures. If many retained customers share similar stable behaviors, the algorithm sees them. But churn triggers may be sparse and heterogeneous: payment delinquency here, support frustration there, contract structure somewhere else. Each one matters, but none may dominate the whole dataset cleanly enough for automated discovery to elevate it as a simple global rule.

This is also why the result is more interesting than a standard “human-in-the-loop improves AI” story. The loop is not generic. It is targeted.

Human input is not sprinkled over the model like regulatory parsley. It is concentrated where the data geometry makes automation weak: sparse disjoint risk traps.

That mechanism transfers conceptually to several business domains:

Domain	Machine-discovered “safety” patterns	Expert-defined “risk traps”
Customer churn	Stable usage, long tenure, regular payments	Payment delay, support escalation, contract downgrade
Credit risk	Consistent income, clean repayment history	Sudden utilization spike, employer instability, fraud indicators
Insurance	Ordinary claim behavior, stable policy profile	Suspicious claim timing, unusual provider pattern
Fraud detection	Normal transaction routines	Rare but high-risk behavioral combinations
B2B account health	Product adoption, routine engagement	Executive sponsor departure, unresolved implementation blocker

These are inferences for business use, not results directly tested in the paper. The paper tests a churn-style dataset using rule-binarized tabular data. But the design principle is broader: when negative outcomes are sparse and heterogeneous, expert effort should be allocated to the long-tail risks, not the obvious normal cases.

The business value is explanation-budget allocation

The practical lesson is not “replace expert judgment with rule learning.” That would be a cartoon version of the paper. The practical lesson is that explanation work has a budget, and most organizations spend it badly.

They ask experts to write exhaustive rules. Experts respond by writing rules for cases they remember, cases they fear, and cases that recently caused embarrassment. Some of those rules are useful. Others are cognitive scar tissue. The authors call this tendency “cognitive overfitting”: experts may overemphasize rare events and create complex “anxiety rules” while missing broad statistical regularities.

The hybrid framework offers a cleaner operating model.

Start with automated discovery to map broad regularities. Use that map to identify where the model has good coverage and where it does not. Then ask experts to define rules only for high-value gaps. Finally, train the model so explanations are part of the supervised target, not an after-the-fact narrative.

For a company building explainable decision systems, this changes the implementation workflow:

Old workflow	Hybrid workflow
Ask experts to define a complete rulebook	Let automation discover broad rule structure
Treat every expert rule as equally plausible	Filter expert rules by coverage and redundancy
Explain predictions after the model is trained	Train the model jointly on prediction and explanation
Measure accuracy alone	Measure outcome and explanation alignment
Add more rules when stakeholders feel nervous	Add rules only when they capture distinct risk traps

The ROI is not only lower annotation cost. It is lower cognitive cost. Fewer rules make the system easier to audit, easier to explain, and easier to maintain. In model governance, a shorter high-quality explanation vocabulary is often more valuable than a sprawling taxonomy that nobody can consistently apply.

There is also a retention-budget angle. A churn model with high precision can help avoid wasteful interventions. If the model flags churn risk with fewer false positives, retention resources can be directed toward customers who actually require attention. The paper’s reported churn precision of 0.99 is therefore not just a statistical nicety. It connects directly to campaign economics.

But that connection remains an inference. The paper does not run a live retention campaign, estimate discount savings, or measure downstream customer lifetime value. It shows a model architecture and experimental performance. Business value would need to be validated in deployment.

The evidence table is a benchmark, the framework is the real claim

It is tempting to treat the 94.00% result as the whole story. That would be convenient and slightly lazy, so naturally it will happen in many slide decks.

A better reading separates the paper’s tests by purpose:

Paper element	Likely purpose	What it supports	What it does not prove
LRR automated baseline at 75.15%	Main comparison	Automation alone struggles to align with full explanation needs	That all automated rule learners fail in all churn settings
Three-rule hybrid at 90.05%	Ablation-like efficiency test	A small amount of targeted domain input produces a large gain	That exactly three rules are generally sufficient
Manual TED at 92.90%	Benchmark against expert-heavy system	Exhaustive manual rules are not necessarily optimal	That human expertise is inferior
Four-rule hybrid at 94.00%	Main evidence for hybrid design	Selected expert exceptions plus automated safety rules can outperform the manual baseline	That hybrid XAI will beat all black-box or deep models
Precision/recall report	Operational interpretation	Churn predictions are selective and still broad enough to matter	That retention interventions will improve profit
Class imbalance discussion	Boundary condition	Real-world rarity may weaken automated safety discovery	Robustness under rare-event churn

The distinction matters because the paper is short and focused. It demonstrates a promising mechanism. It does not close every question a production deployment would raise.

That is not a weakness if read correctly. It is a warning against using the paper as a universal recipe.

Where this result should not be over-sold

The first boundary is class balance. The experimental churn sample is relatively balanced, with churn around 56%. Many real production churn problems are not like that. In some businesses, churn events may be rare relative to the active customer base. The authors explicitly note that in extreme imbalance, automated discovery could degrade into something close to an “always stay” logic. In that case, experts may need to define much more of the churn logic manually.

This is a major caveat. The Anna Karenina mechanism depends on the relationship between dense safety patterns and sparse risk patterns. If the data distribution changes radically, the division of labor may shift too.

The second boundary is data type. The framework relies on binarized tabular features and Boolean-style rule structure. That makes sense for churn, finance, telecom, insurance, and other structured decision domains. It is less obviously suitable for images, free-text sentiment, audio, or other perceptual domains where binarization can destroy important information.

The third boundary is causality. A safety net may be predictive without being causal. “High usage” may correlate with retention, but forcing usage through a clumsy campaign may not create loyalty. The authors mention causal inference as future work, and that is exactly right. Business teams should not treat every discovered safety rule as a lever. Some rules describe customers; others describe interventions.

The fourth boundary is effort measurement. The paper discusses reduced annotation effort through fewer human rules, especially the move from eight manual rules to four selected rules. That is a reasonable proxy, but real expert effort depends on more than rule count. It includes rule discovery workshops, validation meetings, compliance review, data engineering, monitoring, and periodic recalibration. Anyone who converts “50% fewer rules” into “50% lower project cost” deserves a spreadsheet intervention.

The useful lesson: make experts exception handlers

The strongest idea in the paper is not that four rules are better than eight. The strongest idea is that expert labor should be redeployed.

Experts should not be asked to describe the whole world. That is both expensive and philosophically suspicious. The whole world keeps changing; experts have meetings.

Instead, automated systems should first discover the broad structures. Then experts should audit the blind spots and define high-leverage exceptions. The final model should be trained so its explanations are part of its behavior, not a decorative PDF generated after the decision.

For Cognaptus readers, especially those thinking about business automation, this is the design pattern worth remembering:

Use automation to map what is common.
Use experts to name what is dangerous.
Filter expert input for coverage and uniqueness.
Train prediction and explanation together.
Validate the whole setup under the actual class balance and cost structure of the business.

That is a more mature version of Human-in-the-Loop AI. The human is not a moral ornament. The human is a scarce diagnostic resource.

The paper gives us a compact example of what scalable explainability could look like in structured business decisions: automated safety nets, expert risk traps, and a model trained to respect both. It does not solve every XAI problem. It does something more useful: it shows that the bottleneck may not be expert knowledge itself, but the naive way we ask experts to provide it.

So yes, experts should stop writing all the rules.

They should write the few rules that matter.

Cognaptus: Automate the Present, Incubate the Future.

Lawrence Krukrubo, Julius Odede, and Olawande Olusegun, “Augmenting Intelligence: A Hybrid Framework for Scalable and Stable Explanations,” arXiv:2512.19557, December 2025. https://arxiv.org/abs/2512.19557 ↩︎

The false choice in XAI is scale versus stability#

The mechanism: machines map safety, experts mark traps#

Why four expert rules can beat eight#

What the numbers actually support#

The Anna Karenina idea explains the asymmetry#

The business value is explanation-budget allocation#

The evidence table is a benchmark, the framework is the real claim#

Where this result should not be over-sold#

The useful lesson: make experts exception handlers#