Stop, Verify, and Listen: HALT‑RAG Brings a ‘Reject Option’ to RAG

The big idea

RAG pipelines are only as reliable as their weakest link: generation that confidently asserts things the sources don’t support. HALT‑RAG proposes an unusually pragmatic fix: don’t fine‑tune a big model—ensemble two strong, frozen NLI models, add lightweight lexical features, train a tiny task‑adapted meta‑classifier, and calibrate it so you can abstain when uncertain. The result isn’t just accuracy; it’s a governable safety control you can dial to meet business risk.

Why this matters for operators

Policy knobs instead of vibes. With calibrated probabilities, product teams can codify a precision floor (e.g., ≥0.70) and a coverage target (e.g., keep 90% of answers) to match legal or brand risk.
Model‑agnostic, retrieval‑agnostic. Acts as a post‑hoc bouncer at the end of any RAG pipeline—no re‑training of your foundation model.
Production‑friendly metrics. Out‑of‑fold (OOF) evaluation + post‑hoc calibration gives you risk you can reason about, not just leaderboard points.

How HALT‑RAG works (in one minute)

Chunk & pair. Split source and model output into premise–hypothesis pairs using non‑overlapping 320‑token windows; score each pair with RoBERTa‑large‑MNLI and DeBERTa‑v3‑large NLI heads.
Pool & enrich. Pool NLI signals (max/mean of entail/contradict/neutral) and concatenate lexical features (sequence lengths, ratios, ROUGE‑L, Jaccard).
Meta‑classify. Feed features into a simple classifier (LogReg for Summarization/Dialogue; LinearSVC for QA).
Calibrate. Do 5‑fold OOF training; then isotonic regression (LogReg) or Platt scaling (SVC) to map scores → probabilities.
Decide & abstain. Choose a threshold that maximizes F1 subject to a precision constraint; optionally widen a no‑go band around the threshold to meet a coverage target.

A quick scoreboard

Task	Threshold t*	Precision	Recall	F1	Calibration (ECE)
Summarization	0.377	0.712	0.851	0.776	0.011
QA	0.395	0.984	0.974	0.979	0.005
Dialogue	0.421	0.706	0.776	0.739	0.013

Read: QA is almost linearly separable in this feature space; multi‑turn dialogue is toughest (coreference, pragmatics, long‑range context).

What actually moves the needle (ablation sense‑check)

Remove/Change	ΔF1 (Summarization dev)	Takeaway
Entailment signal	−4.5 pts	Having evidence something is supported* matters more than spotting explicit contradictions.*
Contradiction signal	−2.1 pts	Still important; catches hard errors.
Lexical features	−1.3 pts	Cheap, helpful context (length/overlap) improves separability.
Use single NLI (DeBERTa only)	−1.8 pts	Architectural diversity in the ensemble pays off.

The abstention superpower

Abstention is the difference between research and operations. With calibrated probabilities you can trade a bit of coverage for cleaner precision.

Task	Setting	Coverage	Precision	F1
Summarization	Standard	100%	0.712	0.776
	Selective (≈90%)	89.4%	0.798	0.782
QA	Standard	100%	0.984	0.979
	Selective (≈90%)	90.6%	0.998	0.980
Dialogue	Standard	100%	0.706	0.739
	Selective (≈90%)	90.2%	0.783	0.724

Interpretation: even a blunt abstain band around t* gives you audit‑friendly precision lifts without crushing F1.

Implementation notes for Cognaptus systems

Where to plug in. Place HALT‑RAG as a final verifier in your RAG DAG: retrieve → rerank → generate → HALT‑RAG → (answer | abstain→fallback).
Policy template.
- Precision floor: ≥0.80 for legal/health; ≥0.70 for general knowledge bases.
- Coverage target: 85–95% depending on SLA.
- Fallbacks: (a) ask‑for‑clarification; (b) re‑retrieve with narrower query; (c) defer to human.
Telemetry to log. Raw and calibrated scores, chosen threshold, coverage/precision per route, abstention rate by intent, and the feature top‑k that triggered refusal (for explainability).
Latency budget. Two large NLI passes across multiple windows can pinch tail latencies. Mitigations: (1) short‑circuit after early high‑confidence contradiction; (2) dynamic windowing (long for dialog, short for QA); (3) cache per‑chunk NLI scores keyed by doc hash.

Limits you should plan around

Fixed 320‑token windows miss cross‑window dependencies—especially painful in multi‑turn chat. A rolling or hierarchical context model would help.
Task‑adapted (not fully universal). The meta‑classifier and thresholds still need per‑task tuning; ship them as policy artifacts, not baked into code.
Clean‑context bias. Results are on benchmarks with relevant passages; real RAG retrieval noise will degrade performance. Stress‑test with adversarially irrelevant chunks.

Where we’d push next

Entity‑graph features. Lightweight linking/relational checks to catch subtle “invented links.”
Retriever‑aware abstention. When abstaining, use why scores (low entailment/high contradiction) to drive targeted re‑retrieve prompts.
Governance loops. Weekly threshold review tied to precision@intent KPIs; auto‑suggest new t* when drift is detected.

Bottom line: HALT‑RAG turns hallucination detection from a heuristic into a policy instrument. If you’re running RAG in production, calibrated abstention is the cheapest lever you can pull for safety without throttling your roadmap.

Cognaptus: Automate the Present, Incubate the Future

The big idea#

Why this matters for operators#

How HALT‑RAG works (in one minute)#

A quick scoreboard#

What actually moves the needle (ablation sense‑check)#

The abstention superpower#

Implementation notes for Cognaptus systems#

Limits you should plan around#

Where we’d push next#