The big idea
RAG pipelines are only as reliable as their weakest link: generation that confidently asserts things the sources don’t support. HALT‑RAG proposes an unusually pragmatic fix: don’t fine‑tune a big model—ensemble two strong, frozen NLI models, add lightweight lexical features, train a tiny task‑adapted meta‑classifier, and calibrate it so you can abstain when uncertain. The result isn’t just accuracy; it’s a governable safety control you can dial to meet business risk.
Why this matters for operators
- Policy knobs instead of vibes. With calibrated probabilities, product teams can codify a precision floor (e.g., ≥0.70) and a coverage target (e.g., keep 90% of answers) to match legal or brand risk.
- Model‑agnostic, retrieval‑agnostic. Acts as a post‑hoc bouncer at the end of any RAG pipeline—no re‑training of your foundation model.
- Production‑friendly metrics. Out‑of‑fold (OOF) evaluation + post‑hoc calibration gives you risk you can reason about, not just leaderboard points.
How HALT‑RAG works (in one minute)
- Chunk & pair. Split source and model output into premise–hypothesis pairs using non‑overlapping 320‑token windows; score each pair with RoBERTa‑large‑MNLI and DeBERTa‑v3‑large NLI heads.
- Pool & enrich. Pool NLI signals (max/mean of entail/contradict/neutral) and concatenate lexical features (sequence lengths, ratios, ROUGE‑L, Jaccard).
- Meta‑classify. Feed features into a simple classifier (LogReg for Summarization/Dialogue; LinearSVC for QA).
- Calibrate. Do 5‑fold OOF training; then isotonic regression (LogReg) or Platt scaling (SVC) to map scores → probabilities.
- Decide & abstain. Choose a threshold that maximizes F1 subject to a precision constraint; optionally widen a no‑go band around the threshold to meet a coverage target.
A quick scoreboard
Task | Threshold t* | Precision | Recall | F1 | Calibration (ECE) |
---|---|---|---|---|---|
Summarization | 0.377 | 0.712 | 0.851 | 0.776 | 0.011 |
QA | 0.395 | 0.984 | 0.974 | 0.979 | 0.005 |
Dialogue | 0.421 | 0.706 | 0.776 | 0.739 | 0.013 |
Read: QA is almost linearly separable in this feature space; multi‑turn dialogue is toughest (coreference, pragmatics, long‑range context).
What actually moves the needle (ablation sense‑check)
Remove/Change | ΔF1 (Summarization dev) | Takeaway |
---|---|---|
Entailment signal | −4.5 pts | Having evidence something is supported matters more than spotting explicit contradictions. |
Contradiction signal | −2.1 pts | Still important; catches hard errors. |
Lexical features | −1.3 pts | Cheap, helpful context (length/overlap) improves separability. |
Use single NLI (DeBERTa only) | −1.8 pts | Architectural diversity in the ensemble pays off. |
The abstention superpower
Abstention is the difference between research and operations. With calibrated probabilities you can trade a bit of coverage for cleaner precision.
Task | Setting | Coverage | Precision | F1 |
---|---|---|---|---|
Summarization | Standard | 100% | 0.712 | 0.776 |
Selective (≈90%) | 89.4% | 0.798 | 0.782 | |
QA | Standard | 100% | 0.984 | 0.979 |
Selective (≈90%) | 90.6% | 0.998 | 0.980 | |
Dialogue | Standard | 100% | 0.706 | 0.739 |
Selective (≈90%) | 90.2% | 0.783 | 0.724 |
Interpretation: even a blunt abstain band around t* gives you audit‑friendly precision lifts without crushing F1.
Implementation notes for Cognaptus systems
-
Where to plug in. Place HALT‑RAG as a final verifier in your RAG DAG:
retrieve → rerank → generate → HALT‑RAG → (answer | abstain→fallback)
. -
Policy template.
- Precision floor: ≥0.80 for legal/health; ≥0.70 for general knowledge bases.
- Coverage target: 85–95% depending on SLA.
- Fallbacks: (a) ask‑for‑clarification; (b) re‑retrieve with narrower query; (c) defer to human.
-
Telemetry to log. Raw and calibrated scores, chosen threshold, coverage/precision per route, abstention rate by intent, and the feature top‑k that triggered refusal (for explainability).
-
Latency budget. Two large NLI passes across multiple windows can pinch tail latencies. Mitigations: (1) short‑circuit after early high‑confidence contradiction; (2) dynamic windowing (long for dialog, short for QA); (3) cache per‑chunk NLI scores keyed by doc hash.
Limits you should plan around
- Fixed 320‑token windows miss cross‑window dependencies—especially painful in multi‑turn chat. A rolling or hierarchical context model would help.
- Task‑adapted (not fully universal). The meta‑classifier and thresholds still need per‑task tuning; ship them as policy artifacts, not baked into code.
- Clean‑context bias. Results are on benchmarks with relevant passages; real RAG retrieval noise will degrade performance. Stress‑test with adversarially irrelevant chunks.
Where we’d push next
- Entity‑graph features. Lightweight linking/relational checks to catch subtle “invented links.”
- Retriever‑aware abstention. When abstaining, use why scores (low entailment/high contradiction) to drive targeted re‑retrieve prompts.
- Governance loops. Weekly threshold review tied to precision@intent KPIs; auto‑suggest new t* when drift is detected.
Bottom line: HALT‑RAG turns hallucination detection from a heuristic into a policy instrument. If you’re running RAG in production, calibrated abstention is the cheapest lever you can pull for safety without throttling your roadmap.
Cognaptus: Automate the Present, Incubate the Future