The big idea

RAG pipelines are only as reliable as their weakest link: generation that confidently asserts things the sources don’t support. HALT‑RAG proposes an unusually pragmatic fix: don’t fine‑tune a big model—ensemble two strong, frozen NLI models, add lightweight lexical features, train a tiny task‑adapted meta‑classifier, and calibrate it so you can abstain when uncertain. The result isn’t just accuracy; it’s a governable safety control you can dial to meet business risk.

Why this matters for operators

  • Policy knobs instead of vibes. With calibrated probabilities, product teams can codify a precision floor (e.g., ≥0.70) and a coverage target (e.g., keep 90% of answers) to match legal or brand risk.
  • Model‑agnostic, retrieval‑agnostic. Acts as a post‑hoc bouncer at the end of any RAG pipeline—no re‑training of your foundation model.
  • Production‑friendly metrics. Out‑of‑fold (OOF) evaluation + post‑hoc calibration gives you risk you can reason about, not just leaderboard points.

How HALT‑RAG works (in one minute)

  1. Chunk & pair. Split source and model output into premise–hypothesis pairs using non‑overlapping 320‑token windows; score each pair with RoBERTa‑large‑MNLI and DeBERTa‑v3‑large NLI heads.
  2. Pool & enrich. Pool NLI signals (max/mean of entail/contradict/neutral) and concatenate lexical features (sequence lengths, ratios, ROUGE‑L, Jaccard).
  3. Meta‑classify. Feed features into a simple classifier (LogReg for Summarization/Dialogue; LinearSVC for QA).
  4. Calibrate. Do 5‑fold OOF training; then isotonic regression (LogReg) or Platt scaling (SVC) to map scores → probabilities.
  5. Decide & abstain. Choose a threshold that maximizes F1 subject to a precision constraint; optionally widen a no‑go band around the threshold to meet a coverage target.

A quick scoreboard

Task Threshold t* Precision Recall F1 Calibration (ECE)
Summarization 0.377 0.712 0.851 0.776 0.011
QA 0.395 0.984 0.974 0.979 0.005
Dialogue 0.421 0.706 0.776 0.739 0.013

Read: QA is almost linearly separable in this feature space; multi‑turn dialogue is toughest (coreference, pragmatics, long‑range context).

What actually moves the needle (ablation sense‑check)

Remove/Change ΔF1 (Summarization dev) Takeaway
Entailment signal −4.5 pts Having evidence something is supported matters more than spotting explicit contradictions.
Contradiction signal −2.1 pts Still important; catches hard errors.
Lexical features −1.3 pts Cheap, helpful context (length/overlap) improves separability.
Use single NLI (DeBERTa only) −1.8 pts Architectural diversity in the ensemble pays off.

The abstention superpower

Abstention is the difference between research and operations. With calibrated probabilities you can trade a bit of coverage for cleaner precision.

Task Setting Coverage Precision F1
Summarization Standard 100% 0.712 0.776
Selective (≈90%) 89.4% 0.798 0.782
QA Standard 100% 0.984 0.979
Selective (≈90%) 90.6% 0.998 0.980
Dialogue Standard 100% 0.706 0.739
Selective (≈90%) 90.2% 0.783 0.724

Interpretation: even a blunt abstain band around t* gives you audit‑friendly precision lifts without crushing F1.

Implementation notes for Cognaptus systems

  • Where to plug in. Place HALT‑RAG as a final verifier in your RAG DAG: retrieve → rerank → generate → HALT‑RAG → (answer | abstain→fallback).

  • Policy template.

    • Precision floor: ≥0.80 for legal/health; ≥0.70 for general knowledge bases.
    • Coverage target: 85–95% depending on SLA.
    • Fallbacks: (a) ask‑for‑clarification; (b) re‑retrieve with narrower query; (c) defer to human.
  • Telemetry to log. Raw and calibrated scores, chosen threshold, coverage/precision per route, abstention rate by intent, and the feature top‑k that triggered refusal (for explainability).

  • Latency budget. Two large NLI passes across multiple windows can pinch tail latencies. Mitigations: (1) short‑circuit after early high‑confidence contradiction; (2) dynamic windowing (long for dialog, short for QA); (3) cache per‑chunk NLI scores keyed by doc hash.

Limits you should plan around

  • Fixed 320‑token windows miss cross‑window dependencies—especially painful in multi‑turn chat. A rolling or hierarchical context model would help.
  • Task‑adapted (not fully universal). The meta‑classifier and thresholds still need per‑task tuning; ship them as policy artifacts, not baked into code.
  • Clean‑context bias. Results are on benchmarks with relevant passages; real RAG retrieval noise will degrade performance. Stress‑test with adversarially irrelevant chunks.

Where we’d push next

  1. Entity‑graph features. Lightweight linking/relational checks to catch subtle “invented links.”
  2. Retriever‑aware abstention. When abstaining, use why scores (low entailment/high contradiction) to drive targeted re‑retrieve prompts.
  3. Governance loops. Weekly threshold review tied to precision@intent KPIs; auto‑suggest new t* when drift is detected.

Bottom line: HALT‑RAG turns hallucination detection from a heuristic into a policy instrument. If you’re running RAG in production, calibrated abstention is the cheapest lever you can pull for safety without throttling your roadmap.

Cognaptus: Automate the Present, Incubate the Future