Veto Later, Repair First

TL;DR for operators

Most decision systems treat hard constraints like a trapdoor. Candidate violates requirement, candidate disappears. Efficient, clean, and occasionally absurd.

The paper behind Repair-Augmented Constraint Learning, or RACL, argues that this is the wrong semantics for systems that already know how to modify an option before showing it to the user.¹ A flight missing a checked bag, a hotel missing breakfast, a product bundle missing an accessory, or a schedule slot needing a resource adjustment may not be a bad option. It may be a good option one repair away from being acceptable.

RACL’s move is simple but important: before vetoing a candidate, enumerate known repair operators, check whether each repair is applicable and affordable under the context budget, score the repaired candidate, and then decide. The output is not merely accept or reject. It also returns a structured credit category and, when accepted through repair, a repair plan.

The paper’s theoretical result is that terminal-veto systems have a structural blind spot: on the region of candidates that are initially infeasible but repairable into good options, no-repair semantics reject all of them. That is a conditional false-veto rate of 1.0 by construction, not because the model is undertrained, tired, or in need of another dashboard.

The empirical result is more nuanced. In controlled and schema-calibrated benchmarks, RACL removes false vetoes and recovers credit and repair plans cleanly. In the hardest raw DB1B-derived semi-synthetic tier, validation-selected RACL reduces false vetoes to 10 out of 4039 repairable-good test cases, or FVR 0.0025. But it does not win every metric. The strongest repair-search black-box baseline has lower effective decision regret, while RACL targets a much lower false-veto rate. That is not a bug. It is the operating choice the paper wants readers to notice.

For business use, the relevant question is not “Is RACL more accurate?” That is the laziest possible question, and naturally the one many dashboards would ask first. The better question is: “How much does it cost us to reject something we could have fixed?”

A rejected option is not always a bad option

Consider a travel recommender. A customer wants a short connection, refundable fare, and baggage included. One ticket violates the baggage requirement. A terminal-veto system rejects it. Another system notices that baggage can be added within the user’s budget, the final price still meets the customer’s threshold, and the route remains attractive. It accepts the repaired option and returns the plan: add checked bag.

That distinction looks operationally small. It is not.

A standard constraint pipeline usually separates the world into two roles. Constraints eliminate. Preferences rank. This is tidy when candidates are immutable. It is brittle when the platform can act.

RACL names the missing middle: the repair-before-veto problem. The input is a candidate, a context, and a finite repair ontology. The context includes a budget and an acceptability threshold. The repair ontology includes known operators with observable effects, applicability conditions, and costs. The output includes a decision, a credit category, and sometimes a selected repair.

This is narrower than “the AI discovers arbitrary fixes.” Good. Narrow is where deployable systems tend to live after the demo fog clears. Airlines know which ancillary services can be added. Retailers know which bundles can be completed. Configuration systems know which components can be swapped. Scheduling platforms know which rooms, people, or resources can be moved. The repair menu is often not mysterious. The question is whether the decision rule uses it before throwing the candidate away.

The paper’s main contribution is not a clever classifier that wins a benchmark. It is a change in what the classifier is allowed to mean.

The mechanism is repair enumeration before rejection

The core RACL rule is compact.

A candidate $x$ is evaluated in context $\psi$. The context contains a budget $b(\psi)$, an acceptability threshold $\tau(\psi)$, and possibly persona or user features. A finite repair ontology $R = {r_1,\ldots,r_K}$ contains the identity repair plus structured repair operators. A repair changes the candidate, has a cost, and may restore feasibility.

RACL first constructs the admissible repair set:

$$ A(x,\psi)={r \in R : r(x) \models \phi_{\psi},\ \rho(r,x,\psi)\leq b(\psi)}. $$

Then it computes the best post-repair value:

$$ V^0(x,\psi)=\max_{r\in A(x,\psi)} f(r(x)), $$

with the usual convention that the maximum over an empty set is negative infinity. The decision is:

$$ y(x,\psi)=\mathbf{1}[V^0(x,\psi)\geq \tau(\psi)]. $$

In plain language: find all known repairs that make the candidate feasible and affordable, score the repaired versions, and accept only if the best repaired candidate clears the contextual threshold.

Notice what is not happening. RACL is not merely lowering the penalty for a violation. It is not saying, “Well, the constraint was broken, but the score was high, so let us pretend.” The repair must restore feasibility. The cost must fit the context budget. The repaired candidate must still be preferred enough.

That is why the paper is careful to separate RACL from soft constraints and post-hoc recourse. Soft constraints let violation and value trade against each other. Recourse usually explains how to change an input after a model has rejected it. RACL moves the repair step inside the decision semantics before rejection occurs.

This is the difference between a bouncer and a service desk. The bouncer says no. The service desk says, “You are missing one document; upload it now and the application can proceed.” Revolutionary? No. Operationally useful? Annoyingly often.

Credit categories make the decision auditable

Binary labels are too blunt for repair-aware systems. RACL therefore returns structured credit categories.

The taxonomy includes accepted-already-good, accepted-repairable-good, rejected-non-repairable, rejected-repairable-over-budget, rejected-feasible-suboptimal, and rejected-repairable-suboptimal. The evaluation merges some of these into broader buckets, but the business logic is the point: rejection has causes.

Outcome type	What it means	Operational next step
Accepted-already-good	Candidate already satisfies requirements and clears the threshold	Present normally
Accepted-repairable-good	Candidate becomes feasible and valuable after an affordable repair	Present with repair plan or adjusted offer
Rejected-non-repairable	No known repair restores feasibility	Stop or escalate to broader search
Rejected-repairable-over-budget	Repair exists but exceeds the context budget	Offer upgrade, ask for budget flexibility, or reject
Rejected-feasible-suboptimal	Candidate is feasible but not valuable enough	Rank lower or suppress
Rejected-repairable-suboptimal	Repair exists, but repaired version still fails preference threshold	Do not waste repair action

This is where RACL becomes more than a classifier. It becomes a routing layer.

A rejected-non-repairable candidate should not be treated like a rejected-repairable-over-budget candidate. One says the option cannot be fixed with the known tools. The other says it can be fixed, but not under the current budget. Those are different commercial conversations. Collapsing them into “0” is cheaper only if nobody downstream cares why opportunities disappear. Which, in a real business, is a brave assumption.

Terminal veto has a structural false-veto gap

The paper’s cleanest theoretical point is also its most operationally uncomfortable one.

Define the repairable-good region as cases where the original candidate is infeasible, but some known repair is affordable and makes the repaired candidate valuable enough:

$$ G_{\text{repair}}={(x,\psi): x \not\models \phi_{\psi},\ \exists r\in R,\ \rho(r,x,\psi)\leq b(\psi),\ f(r(x))\geq \tau(\psi)}. $$

A no-repair rule whose acceptance requires the original candidate to satisfy the hard requirement will reject every case in this region. Therefore its conditional false-veto rate on $G_{\text{repair}}$ is 1.

This is not a claim that the no-repair model has a global error rate of 100 percent. The paper is precise about that. The claim is conditional: among candidates that are initially infeasible but repairable into acceptable options, terminal-veto semantics fail by definition.

That matters because many business metrics blur this away. A model may look fine on overall accuracy while systematically discarding the exact class of candidates where the platform could have acted. False vetoes are not just false negatives. They are lost repairable opportunities.

The second theoretical obstacle is identifiability. Binary labels alone cannot tell you whether a rejection came from non-repairability, over-budget repair, or insufficient post-repair value. A repairable-over-budget violation and a non-repairable violation can induce the same label at a fixed budget. Preference scale is also not identifiable from binary labels alone because affine transformations of score and threshold can preserve the label pattern.

The paper’s answer is not magical supervision. It requires contextual budget variation and feasible anchors. The model needs enough structure to separate “cannot fix,” “can fix but too expensive,” and “can fix but still not worth it.” The data has to contain the distinctions the decision rule is supposed to learn. Apparently the universe still refuses to provide semantic credit for free.

The algorithm is simple because the assumptions do real work

RACL’s algorithm is not trying to discover the repair universe from scratch.

At decision time, it enumerates the known repairs. For each repair, it checks whether the repaired candidate satisfies the hard requirements and whether the repair cost fits the context budget. If no repair is admissible, it rejects with a structure or budget credit. If repairs are admissible, it selects the repair with the highest learned preference score. It accepts only if the guarded calibrated score of that repaired candidate clears the threshold.

That “known repair ontology” assumption is load-bearing. The system knows the repair operators, their observable effects, and their costs. In the experiments, hard requirements and repair costs are observed. Under a shared-weight linear preference model, the paper gives a pseudo-dimension bound of roughly $\tilde{O}(p\log K)$, where $p$ is the number of preference parameters and $K$ is the number of repairs. The finite repair max adds only logarithmic combinatorial factors under the observed-feasibility setup.

This is why the mechanism stays tractable. RACL is not solving every possible intervention problem. It is choosing among a finite menu of repairs. That makes it much closer to enterprise systems that already have catalogs, fare rules, entitlement logic, configuration constraints, and service bundles.

The paper also adds a validation-selected guard because calibration matters near thresholds. Raw candidate distributions can make affine score calibration unstable. The guard caps unstable scaling and score span relative to threshold span. The selected operating point is chosen on validation data by false-veto rate first, then effective decision regret and AUROC.

That priority order is important. If you optimize for AUROC first, you are answering a different question. AUROC asks about ranking quality over thresholds. RACL’s target problem asks whether repairable-good options are being thrown away.

The experiments test semantics, not just prediction

The paper’s experiments are organized across four tiers: Synthetic-MAXSAT, Expedia-schema, DB1B-schema, and DB1B-derived. The first three are controlled or schema-calibrated settings. The final tier uses real DB1B ticket-record distributions from 2016Q3, then injects ancillary repair attributes, contexts, labels, and credit categories. The paper is explicit that the raw DB1B-derived tier is not a natural airline repair-choice dataset; it is a controlled stress test on a real ticket distribution.

That distinction matters. The evidence is not “airline users behave this way in the wild.” The evidence is “on a real ticket distribution with injected repair semantics, the method behaves as predicted under a harder distribution.” That is still useful. It is not the same claim.

Evidence component	Likely purpose	What it supports	What it does not prove
Synthetic-MAXSAT	Main controlled evidence	RACL can close the false-veto gap and recover structure when the repair task is clean	Real-world adoption or robustness to messy catalogs
Expedia-schema	Main schema-calibrated evidence	Repair semantics work in a hotel-like schema setting	Natural Expedia user behavior
DB1B-schema	Main schema-calibrated ticket evidence	The no-repair gap persists in ticket-field settings	Natural airline ancillary decision data
Raw DB1B-derived tier	Main stress test	RACL remains effective on real ticket-record distributions with injected repair fields	Fully natural repair-choice generalization
BlackBox, CreditHead, RepairSearch baselines	Comparison with alternative designs	Separates label fitting, credit supervision, repair search, and RACL semantics	That RACL dominates every black-box metric
Validation guard stability	Robustness/sensitivity check	Guard choice is stable across validation splits	That guard selection is unnecessary
Guard-grid and noise stress tests	Robustness/sensitivity tests	Low-FVR behavior is not a single test-tuned point; moderate noise is tolerated	Robustness to incomplete repair libraries
Credit dependency analysis	Diagnostic/implementation evidence	Credit comes from the repair ontology, not from magic label inference	That RACL uniquely discovers credit without structure

The most useful part of this design is that it refuses to compare RACL only against weak baselines. BlackBox+RepairSearch receives the same repair ontology and repair search space as RACL, but uses a black-box label model to score repaired candidates. That is a strong comparison because it asks whether the RACL decision semantics and calibration matter when ontology access is held constant.

They do, but not in the cartoonishly convenient way.

The raw DB1B-derived tier is an operating trade-off, not a coronation

In the controlled and schema-calibrated tiers, the story is clean. HASSLE-style NoRepair has FVR 1.000 on repairable-good cases. RACL reduces FVR to 0.000 in Synthetic-MAXSAT, Expedia-schema, and DB1B-schema, with CreditAcc and PlanAcc reported as 1.000 in those RACL operating points. BlackBox can fit labels in some cases, but it does not recover credit.

The raw DB1B-derived tier is more interesting because it is less polite.

Model	Repair search	Credit labels	FVR	False vetoes out of 4039	EDR	CreditAcc	PlanAcc	AUROC
HASSLE-style NoRepair	No	No	1.0000	4039	0.1591	0.4162	—	0.4276
SoftPenalty	No search	No	0.3325	1343	0.0870	0.4162	—	0.8694
BlackBox	No search	No	0.1129	456	0.0549	0.4162	—	0.9828
BlackBox+RepairSearch	Yes	No	0.2633	about 1064	0.0296	1.0000	1.0000	0.9787
BlackBox+CreditHead	No search	Yes	0.1129	456	0.0549	0.9671	—	0.9828
RACL-Full	Yes	No	0.1399	565	0.0540	1.0000	1.0000	0.9632
RACL-ValidatedGuard	Yes	No	0.0025	10	0.0563	1.0000	1.0000	0.9252
Oracle-RACL	Yes	n/a	0.0000	0	0.0000	1.0000	1.0000	0.9978

The headline is RACL-ValidatedGuard: 10 false vetoes among 4039 repairable-good cases, while preserving perfect structured credit and repair-plan accuracy in this test. Compared with BlackBox+RepairSearch at about 1064 false vetoes, that is a large reduction in the specific failure mode RACL is built to target.

But the table also says something inconvenient, which makes it more credible. BlackBox+RepairSearch has lower effective decision regret: 0.0296 versus RACL-ValidatedGuard’s 0.0563. BlackBox and BlackBox+CreditHead also have higher AUROC than RACL-ValidatedGuard.

So the result is not “RACL beats everything.” The result is sharper: RACL can be tuned to aggressively avoid false vetoes, at the cost of some regret and ranking performance. The validation guard explicitly chooses FVR first. The AUROC drop from RACL-Full to RACL-ValidatedGuard is not an embarrassing accident; it is the price of selecting a low-false-veto operating point.

This is precisely the business decision. If rejecting a repairable-good option is expensive — lost sale, failed booking, abandoned application, poor customer experience, unnecessary escalation — then the RACL operating point may be preferred. If suboptimal ranking among accepted options is more expensive, a different operating point or even a repair-search black box may be better.

The paper reports an application-level cost sweep in the main text: RACL is preferred to BlackBox+RepairSearch whenever the ratio of false-veto cost to regret cost exceeds 0.102. Since the detailed supplement is not part of the visible main PDF, this should be read as the paper’s reported operating threshold, not as something independently reconstructed here.

Still, the framing is useful: choose the model by the cost of the error mode, not by a universal leaderboard metric. Radical stuff, apparently.

Black boxes can rank well and still fail the workflow

The raw results show why generic classifiers can look impressive while failing the job.

The label-only BlackBox reaches AUROC 0.9828 and lowers FVR to 0.1129. That looks strong until you ask for structured credit or a repair plan. CreditAcc remains 0.4162, the same as the no-repair baseline in the table. The model predicts labels; it does not explain whether the candidate is impossible, too expensive to fix, fixable but still poor, or repairable-good.

BlackBox+CreditHead improves credit accuracy to 0.9671 because it receives explicit credit labels. But it still does not output repair plans. BlackBox+RepairSearch does output credit and plans because it has the same repair ontology and search space as RACL. This is the key diagnostic: the ontology is doing serious work.

RACL does not uniquely “discover” credit from binary labels. The paper says this plainly. RACL derives credit from the repair ontology and decision semantics. A black-box model wrapped in the same search can also derive credit and plans. The meaningful difference is where the score and calibration enter the repair decision.

For enterprise readers, this is the lesson: do not confuse high AUROC with operational completeness. A ranking model can be statistically elegant and still useless as a process controller if it cannot say what action should happen next.

The business value is fewer wasted rejections, not prettier explanations

The practical pathway from RACL to business use is clearest in domains where three conditions hold.

First, candidates are often near-feasible. The rejected option is not garbage; it is missing a known adjustment.

Second, repair actions are already cataloged. Add a service, change a configuration, request a document, swap a component, move a resource, upgrade a tier, or modify a bundle.

Third, false vetoes are costly enough to matter. If the system rejects a fixable offer and the customer leaves, the loss is not philosophical.

This points to several business settings:

Domain	Candidate	Repair ontology	False-veto cost
Travel	Flight, hotel, itinerary	Add baggage, refundable fare, seat, connection buffer, date adjustment	Lost booking or inferior itinerary
Retail	Product or bundle	Add missing accessory, substitute compatible item, upgrade warranty	Lost basket or support friction
Configuration	Hardware/software package	Swap component, add license, change dependency	Failed quote or engineering escalation
Scheduling	Appointment or resource slot	Move time, add staff, change room, reserve equipment	Unfilled slot or manual coordination
Financial/service eligibility	Application or service plan	Request document, adjust plan, add verification	Abandoned application or unnecessary rejection

The paper directly shows that under known repair ontologies and observed repair effects/costs, repair-before-veto semantics can avoid a structural false-veto gap and recover repair plans in controlled and semi-synthetic benchmarks.

Cognaptus infers that similar logic can improve business systems where existing catalogs already encode repair actions but the decision layer still behaves like violations are terminal. That inference is reasonable, not proven by the paper across all production settings.

What remains uncertain is the messy part: whether a given business has a sufficiently complete repair ontology, whether repair costs are reliable, whether preference thresholds are calibrated, and whether multi-step repair paths matter. The paper handles single-step repair. The real world, naturally, enjoys composing small problems into larger ones for sport.

RACL changes what should be logged

A repair-aware system needs different telemetry from a standard classifier.

Overall accuracy is not enough. Label accuracy is not enough. Even AUROC is not enough. RACL’s metrics are closer to operational loss categories.

False-veto rate asks: among repairable-good cases, how many did the model reject?

Effective decision regret asks: when selecting among options, how far was the selected value below the oracle best value?

Credit accuracy asks: did the system classify the reason correctly?

Plan accuracy asks: did it choose the exact repair operator for repairable candidates?

Those metrics map more directly to workflow questions. How many fixable opportunities did we discard? How often did we pick a weaker option? Can downstream teams trust the rejection reason? Can the system execute or display the repair?

A production version should log at least the following:

Log field	Why it matters
Original feasibility status	Separates already-good from repair-needed candidates
Enumerated admissible repairs	Shows whether the ontology had options
Repair costs and budget gate	Diagnoses over-budget versus impossible cases
Post-repair score and threshold margin	Identifies near-threshold instability
Selected credit category	Enables workflow routing and audit
Selected repair plan	Enables action, not just explanation
False-veto proxy labels from downstream behavior	Supports monitoring when ground truth is delayed
Missing-template incidents	Detects ontology incompleteness

That last line matters because RACL’s failure mode is not mysterious. If the needed repair template is missing, the system cannot select it. It will reject or misclassify. This is not a weakness of machine learning in the abstract. It is a catalog governance problem wearing a model-shaped hat.

The deployment checklist starts before modeling

The paper is best read as an architecture prompt, not merely a modeling paper. Before implementing something RACL-like, an operator should answer five questions.

1. What repairs do we actually know? List the finite operators. Do not write “AI will fix it.” That is not an ontology; that is a resignation letter in YAML.

2. Are repair effects observable? The system must know what the repair changes and whether the repaired candidate satisfies the hard requirement.

3. Are repair costs contextual? Affordability depends on budgets, user segments, service tiers, or operational constraints. If the budget gate is wrong, the decision semantics will be wrong.

4. Where do preference anchors come from? The paper uses feasible non-repaired anchors to fit persona preference scores. In the raw DB1B-derived setting, the lack of clean zero-budget contexts is a documented limitation and motivates calibration guards. In business terms: you need trustworthy examples of what “good enough” looks like.

5. Which error is more expensive: false veto or regret? RACL-ValidatedGuard deliberately pushes false vetoes down. That may be correct for high-opportunity-cost workflows. It may be wrong where ranking regret dominates. The operating point is a policy decision, not a universal truth handed down from the ROC curve.

Boundaries that materially affect practical use

The known-repair-ontology assumption is the main boundary. RACL studies decision making when repair operators, effects, and costs are specified. It does not discover arbitrary repair actions from binary labels. Missing repair templates are a direct failure mode.

The repair plan is single-step. The model selects one operator, including the identity. Multi-step repair composition is left to future work. That matters in domains where a candidate needs two or three coordinated changes: change date, then add resource, then adjust price. A single-step method may understate what is actually repairable.

The raw DB1B-derived benchmark is semi-synthetic. The candidate distribution comes from real ticket records, but ancillary repair fields, contexts, labels, and credit categories are injected. This gives a harder distributional stress test than a purely synthetic toy, but it is not direct evidence from observed airline repair decisions.

The feasibility and repair costs are observed in the evaluated regime. If feasibility itself must be learned, the theoretical picture changes because another hypothesis class enters the system. That is not a minor implementation note. Many real systems do have noisy feasibility rules, stale inventories, or ambiguous constraints.

Finally, RACL is not optimized to be the best generic rank classifier. The paper explicitly treats AUROC and label accuracy as secondary to repair-aware decision quality. Anyone trying to sell this as “higher accuracy AI” has missed the point with impressive efficiency.

What Cognaptus would take from this

The immediate business lesson is not “replace your recommender with RACL.” It is “stop letting the classifier be the final authority when the platform has valid actions available.”

A good enterprise decision system should separate four questions:

Question	Terminal-veto system	Repair-aware system
Is the candidate feasible now?	If no, reject	If no, inspect repair options
Can a known repair restore feasibility?	Usually not represented	Enumerate repair ontology
Is the repair affordable in context?	Usually outside decision rule	Apply budget gate
Is the repaired candidate worth presenting?	Usually lost after veto	Score post-repair value

This is a small semantic change with large process consequences. It turns some rejections into offers, some dead ends into escalations, and some opaque failures into diagnosable categories.

It also forces organizations to maintain the boring assets that actually make AI useful: catalogs, constraints, repair actions, cost models, thresholds, and evaluation labels. The model is the glamorous part only if one has a very forgiving definition of glamour.

Conclusion: a violation is not always the end of the sentence

RACL’s core insight is that infeasibility should not automatically mean rejection when the system knows how to act. In many business systems, a candidate can be wrong in its current form and still be right after a cheap, known repair.

The paper’s theoretical contribution is to show why terminal-veto semantics structurally fail on repairable-good candidates. Its empirical contribution is to show that, with a known repair ontology, RACL can recover structured credit and repair plans while dramatically reducing false vetoes in controlled and DB1B-derived benchmarks. The raw tier is especially useful because it refuses to pretend there is no trade-off: lower false vetoes come with a different regret and AUROC profile.

For operators, the question is practical. How many valuable options are your systems rejecting because the repair action lives outside the decision rule?

If the answer is “we do not know,” the model is not the only thing requiring calibration.

Cognaptus: Automate the Present, Incubate the Future.

Yifan Wang, “Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions,” arXiv:2606.02326v1, 2026. https://arxiv.org/abs/2606.02326 ↩︎

TL;DR for operators#

A rejected option is not always a bad option#

The mechanism is repair enumeration before rejection#

Credit categories make the decision auditable#

Terminal veto has a structural false-veto gap#

The algorithm is simple because the assumptions do real work#

The experiments test semantics, not just prediction#

The raw DB1B-derived tier is an operating trade-off, not a coronation#

Black boxes can rank well and still fail the workflow#

The business value is fewer wasted rejections, not prettier explanations#

RACL changes what should be logged#

The deployment checklist starts before modeling#

Boundaries that materially affect practical use#

What Cognaptus would take from this#

Conclusion: a violation is not always the end of the sentence#