Silence is safe. That is the awkward little secret behind many “reliable AI” systems.

Ask a retrieval-augmented generation system a question. It drafts an answer. A factuality filter checks each claim. Risky claims are removed. The final answer is cleaner, safer, and statistically more defensible. On a dashboard, factuality goes up. In a meeting, everyone nods. In production, the user receives something that says almost nothing.

Congratulations. The system has become reliable in the same way an empty spreadsheet has no accounting errors.

The paper behind this article, Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights, studies this problem directly.1 Its target is conformal factuality for RAG-based large language models: a method that uses a calibration set to filter generated claims so that the retained claims satisfy a formal factuality guarantee. The paper’s useful contribution is not merely another hallucination benchmark. It shows why a guarantee that looks strong at the claim level can become weak, expensive, or operationally dull once we ask the business question: “Did the system still answer the user?”

That is the truth filter paradox. The more aggressively we filter for truth, the more likely we are to delete usefulness. The villain is not conformal prediction. The villain is pretending that “no false claims survived” is the same as “the answer is good.” A small distinction, naturally, only large enough to ruin the product.

The filter does not make the answer better; it makes parts of it disappear

The core mechanism is simple enough to be dangerous.

A user asks a query $x$. A RAG system retrieves a reference document $R(x)$ and generates an answer $y$. The answer is then decomposed into atomic claims. Each claim receives a factuality score from a verifier or confidence model. A threshold is calibrated on held-out examples. Claims below the threshold are removed. The remaining claims are merged back into a final answer.

The paper studies this claim-level filtering pipeline under the conformal factuality framework. In plain terms, the method uses calibration data to choose a threshold such that, with high probability, every retained claim is factual:

$$ P(\forall c \in F(\tau_\alpha),\ c \text{ is factual}) \geq 1 - \alpha $$

The threshold $\tau_\alpha$ is selected from calibration examples. For each calibration answer, the system asks: how strict would the threshold need to be so that all retained claims are factual? These example-level thresholds are then aggregated using a conformal quantile. The promise depends on a familiar but often under-loved condition: the calibration examples and future test examples must be exchangeable. In business language, the traffic used to set the filter must resemble the traffic where the filter is deployed.

That assumption is where the product story begins.

Pipeline step What happens technically Business failure mode
Generate answer The RAG model writes a response from a retrieved reference The answer may contain useful facts mixed with errors
Parse claims The answer is split into atomic factual units Parser changes can alter what gets evaluated
Score claims A verifier estimates whether each claim is supported Confidence scores may not separate subtle hallucinations
Calibrate threshold A held-out set determines how strict filtering should be Calibration data may not match deployment data
Filter and merge Low-confidence claims are removed The final answer may become empty, generic, or insufficient

Notice what this pipeline does not do. It does not regenerate a better answer. It does not ask whether the remaining answer is complete. It does not know whether the user’s actual task has been solved unless additional metrics are introduced. It simply deletes claims that fail the threshold.

Deletion is useful. Deletion is also a blunt instrument. A compliance officer may enjoy blunt instruments; users usually do not.

The paper’s first move is to stop rewarding empty answers

The paper’s most important editorial move is also its most practical one: it refuses to treat an empty output as a success.

Traditional conformal factuality metrics can make a system look good when it has merely become evasive. Empirical factuality asks whether all retained claims are factual. If no claims remain, that condition is trivially satisfied. Power measures how many true claims survive relative to the original output. These are meaningful, but incomplete. They tell us something about error control, not enough about informativeness.

The authors therefore introduce informativeness-aware metrics. These are not decorative extras. They are the difference between evaluating a fact filter and evaluating a usable RAG system.

Metric What it asks Why it matters operationally
Empirical factuality Are all retained claims factual? Can be inflated by empty outputs
Power What share of originally true claims survived? Measures retention, but not task completion
Non-empty rate Does the filtered answer keep at least one claim? Detects vacuous “safe” answers
Non-vacuous empirical factuality Among non-empty outputs, are retained claims factual? Separates truth from abstention
Sufficient correctness Does the answer contain enough correct information to recover the right answer? Connects factuality to usefulness
Conditional sufficient correctness When the original answer was sufficient, did filtering preserve sufficiency? Isolates damage caused by the filter

This metric set changes the interpretation of conformal factuality. The question is no longer only, “Can we control false claims?” It becomes, “Can we control false claims without deleting the answer’s value?”

That sounds obvious. It is not obvious in evaluation practice. Many AI dashboards still reward models for avoiding visible errors more than for completing the user’s job. The paper’s metrics force the embarrassing audit question: did the system become safer, or did it just learn to say less?

References help the generator, but they do not rescue the guarantee

The paper evaluates RAG-style generation across factual biography, mathematical reasoning, and open-domain question answering tasks. It uses datasets including FActScore, a rare-person variant of FActScore, MATH, and Natural Questions. For the main experiments, the authors deliberately assume oracle retrieval: the reference is treated as sufficient for answering the query.

That assumption is not a weakness of the mechanism study. It is a controlled simplification. If factuality filtering struggles even when retrieval is not the bottleneck, then a real-world system with noisy retrieval has fewer excuses, not more.

The first major empirical pattern is straightforward: references help the generator produce better initial answers. This is the “main evidence” role of the reference comparison. When models receive relevant references, sufficient correctness improves across settings. On factual biographies and open-domain questions, references are especially important. On math problems, the gap narrows for larger models, but reference context still matters.

That result should not surprise anyone who has used RAG outside a demo. Retrieval helps because the model is no longer asked to hallucinate from memory with a straight face. But the paper’s deeper point is subtler: better initial answers do not eliminate the evaluation problem. A good reference can improve generation quality, while a claim filter can still erase too much, fail under shifted traffic, or become computationally expensive.

The business interpretation is therefore not “RAG works.” We already knew RAG often helps. The interpretation is: the quality of the retrieved context and the quality of the post-generation factuality filter are different control surfaces. Mixing them into one generic “accuracy” score is a lovely way to lose the root cause.

The scoring ablations mostly say: stop worshipping the larger scorer

The paper compares several ways to score claims. Some scorers are LLM-based confidence models. Others are entailment-based verifiers, including document-level and sentence-level natural language inference models. The experiments vary prompts, references, scoring formats, reasoning traces, evidence highlighting, multiple samples, and model scale.

These are mostly ablations and implementation comparisons. Their purpose is not to establish a new universal best verifier. Their purpose is to reveal which design choices actually matter in the filtering pipeline.

Several findings are useful for engineering teams.

First, numeric confidence scoring is generally better than Boolean yes-or-no scoring. That makes sense. A thresholding method needs a ranking signal. A binary answer turns a calibrated filter into a crude gate and then asks statistics to clean up the mess. Statistics is many things; janitorial magic is not one of them.

Second, sampling multiple confidence judgments can improve stability. Again, not shocking, but operationally important. If a scorer is noisy, one score per claim may be a poor basis for deletion. Multiple generations can reduce variance, though they also increase cost.

Third, chain-of-thought prompting and evidence highlighting do not reliably improve factuality scoring. This is an important negative result. It pushes against a common habit in LLM system design: when uncertain, add more prompt ceremony. The paper suggests that ceremony is not the same as calibration. A beautifully formatted prompt can still be a weak measurement instrument.

Fourth, scaling the scoring model does not guarantee better filtering. Larger LLMs do not automatically produce better conformal factuality behavior. In some model families, scaling helps; in others, the relationship is weak or even degraded. Smaller models can be competitive. This matters because scoring happens per claim. If one answer yields many claims and every claim requires a large LLM call, the cost profile becomes unfriendly very quickly.

The more interesting comparison is between LLM confidence scorers and entailment-based verifiers. The paper finds that lightweight entailment models can match or outperform LLM confidence scorers in several settings, while using dramatically less compute. This should make product teams pause. A large model may be a good writer, but that does not mean it is the cheapest or most reliable fact checker for every claim.

A useful RAG stack may therefore look less glamorous than the sales deck: a capable generator, a disciplined parser, and a small verifier doing repetitive factuality work. Not every component needs to be a giant model wearing a crown.

High factuality can be purchased with low usefulness

The paper’s most business-relevant results appear when empirical factuality is read together with non-empty rate, sufficient correctness, and conditional sufficient correctness.

At high target factuality levels, conformal filtering often becomes conservative. It removes more claims. Empirical factuality improves. But non-empty rate and sufficient correctness can fall. In other words, the system achieves a cleaner answer by preserving less answer.

This is not a bug in the mathematics. It is a consequence of the objective.

The conformal guarantee concerns the claims that survive. If the safest way to satisfy the guarantee is to retain almost nothing, then the method has done what it was asked to do. The problem lies in asking too narrow a question. “Are the retained claims factual?” is a necessary question for enterprise AI. It is not a sufficient one.

Conditional sufficient correctness is especially useful here. It asks whether filtering preserves usefulness when the original answer was already sufficient. That isolates the filter’s damage. If the generator produced enough correct information, but filtering removed the pieces needed to answer the question, the failure belongs to the factuality layer, not the retrieval layer or the generator.

This is the metric a product manager should want on the dashboard. Not because it is fashionable, but because it tells you when your safety layer is quietly converting useful answers into polite fragments.

The guarantee breaks exactly where product traffic gets interesting

Conformal methods rely on calibration. Calibration relies on exchangeability. Deployment traffic has a charming habit of violating both.

The paper tests robustness in two main ways. These are robustness and sensitivity tests, not side quests. They ask whether the factuality guarantee survives when the calibration distribution differs from test-time conditions or when the answer contains plausible hallucinated distractors.

The first robustness test uses calibration claims from a different distribution. The paper compares calibration on claims from a prior human-annotated source with calibration on same-distribution held-out claims. Under distribution shift, empirical factuality can fall below the target, especially at higher target factuality levels. Some entailment scorers appear more robust in certain settings, but often at the cost of power: they preserve fewer useful claims.

The second robustness test injects plausible distractor claims into answers. These are not cartoonishly false claims. They are designed to resemble hallucinations that a model might plausibly generate. As distractor rates increase, empirical factuality drops sharply. The filter has difficulty separating these plausible false claims from supported ones.

The authors also test whether calibrating with a matching distractor proportion can restore factuality. It can help, but the price is lower non-empty rate. The threshold becomes stricter. The system responds to harder traffic by deleting more.

This is the production lesson: robustness is not free. If your calibration set is made of neat benchmark claims and your deployment traffic contains messy user prompts, evolving model outputs, parser updates, retrieval drift, adversarially plausible content, and business-domain ambiguity, the formal guarantee may become a paper umbrella in a typhoon.

Not useless. Just not magic. AI evaluation could use fewer magic umbrellas.

The cheapest verifier may be the grown-up choice

The paper’s efficiency analysis is unusually important because factuality filtering is a repeated operation. A generator may answer once, but a verifier may score many claims per answer. The cost multiplier hides in the claim decomposition.

The authors compare LLM confidence scorers with smaller entailment models, including DeBERTa- and RoBERTa-based verifiers. They estimate floating-point operations under a generation and scoring setup. The precise ratios depend on model, sequence length, and inference assumptions, but the pattern is clear: entailment verifiers sit far below large LLM scorers in compute cost while remaining competitive in factuality filtering.

Scorer category Paper’s practical role Operational interpretation
Large LLM confidence scorer Flexible claim scorer with natural-language judgment ability Expensive when called per claim; scaling does not guarantee calibration gains
Smaller LLM scorer Lower-cost confidence scorer Can be competitive, but family-specific behavior varies
Document-level entailment verifier Tests whether a claim is supported by the full reference Strong cost-performance candidate for RAG factuality filtering
Sentence-level entailment verifier Tests claim support against sentence evidence Useful but aggregation choices matter

The business implication is not simply “use the smallest model.” That would be the usual cost-cutting overreaction, and we already have enough of those. The better lesson is to separate generation from verification. A model that is good at writing may not be the best model for scoring support. A verifier that is boring, small, and consistent may be exactly what the economics require.

For enterprise RAG, this matters because factuality filtering is not a one-off research expense. It becomes part of every production request. If the verifier is too expensive, teams will sample fewer claims, reduce evaluation coverage, or disable checks under latency pressure. Reliability systems that are too costly often become ceremonial systems. They exist in architecture diagrams and disappear during peak traffic.

What the paper directly shows, and what Cognaptus infers

The paper is careful about its experimental setting. Cognaptus should be equally careful about the business interpretation.

Paper result What the paper directly supports Cognaptus business inference Boundary
Informativeness-aware metrics reveal vacuity Traditional factuality can look strong when outputs are empty or thin RAG dashboards should track usefulness, not only factuality Metric design still needs domain-specific success criteria
References improve initial generation quality Retrieved context helps models produce more sufficient answers Retrieval quality and factuality filtering should be diagnosed separately Experiments use controlled references, often oracle-style
LLM scoring prompt choices matter unevenly Numeric scoring and multiple samples help more reliably than prompt ornamentation Invest in measurement design before prompt theatrics Results vary by model family and task
Distribution shift weakens guarantees Calibration mismatch can push empirical factuality below target Recalibrate after model, parser, prompt, retrieval, or traffic changes The exact failure rate depends on deployment distribution
Plausible distractors stress the filter Hallucination-like false claims are hard to separate from true claims Red-team factuality filters with realistic false claims, not only obvious errors Generated distractors approximate but do not exhaust real-world failures
Lightweight entailment verifiers compete at lower compute Smaller verifiers can offer strong cost-performance trade-offs Use specialized verifiers where possible to protect margin and latency Some domains may require custom entailment data or human validation

The practical design pattern is clear: conformal factuality should be treated as a guardrail with instrumentation, not as a complete reliability solution. It can reduce unsupported claims. It cannot, by itself, guarantee that the answer remains useful, robust, or economically sensible.

Where this result should not be overread

The paper’s strongest lesson is about mechanism and measurement. It is not a final field manual for every enterprise RAG deployment.

First, the main setup assumes strong retrieval, often effectively oracle retrieval. That is analytically useful because it isolates filtering. But real systems also fail because retrieval misses the relevant document, retrieves stale documents, ranks the wrong passage, or mixes sources with conflicting authority. In those systems, factuality filtering is only one layer of the problem.

Second, the benchmarks cover important but bounded task types: factual biography, math reasoning, and open-domain question answering. Enterprise settings such as legal review, insurance claims, procurement policy, clinical documentation, or financial compliance may have different definitions of sufficient correctness. A claim can be true and still unusable if it omits a legally required condition.

Third, the paper uses automated claim parsing and factuality labeling, with human validation reported for sampled claims. That is reasonable for scale, but it means the evaluation layer is itself partly model-mediated. In high-stakes deployments, companies should validate the evaluator, not merely the generator. The thermometer also needs calibration.

Fourth, generated distractors are useful stress tests, but real hallucinations may be shaped by domain incentives, ambiguous source documents, user pressure, and prompt injection. Synthetic distractors can reveal brittleness; they cannot certify robustness against every failure mode.

These boundaries do not weaken the paper. They define how to use it. The result is not “conformal factuality fails.” The result is more precise: conformal factuality can control a narrow risk under matching conditions, but the operational value depends on informativeness, robustness, and verifier cost.

The business lesson is to instrument deletion

Most AI reliability conversations still focus on whether the model says something false. That is understandable. False claims are visible, embarrassing, and occasionally expensive. But after reading this paper, the more mature question is slightly different: what did the reliability layer delete to become safe?

A useful enterprise RAG evaluation suite should therefore include at least four layers.

First, it should measure claim factuality among retained claims. This is the traditional safety target, and it remains necessary.

Second, it should measure non-empty rate and non-vacuous factuality. A system that answers with nothing should not receive the same praise as a system that answers correctly.

Third, it should measure sufficient correctness for the user’s task. This requires task-specific definitions. A customer support bot, an internal policy assistant, and a financial research agent do not share the same notion of enough information.

Fourth, it should measure conditional sufficient correctness after filtering. This reveals whether the filter preserved value when the generator had already produced it.

That last metric deserves more attention. It is the audit trail for the safety layer. Without it, teams may blame the generator for failures caused by the filter. With it, they can see whether the system is hallucinating, retrieving poorly, scoring badly, or over-deleting.

The operational playbook is not glamorous, but it is clear:

  1. Calibrate on deployment-like data, not only benchmark examples.
  2. Recalibrate after changing the generator, parser, scorer, retrieval pipeline, prompt format, or user segment.
  3. Stress-test with plausible hallucinated claims, not only obvious falsehoods.
  4. Track usefulness metrics alongside factuality metrics.
  5. Consider specialized entailment verifiers before paying a large LLM to judge every claim.
  6. Treat empty answers as abstentions, not as factuality victories.

The theme is not anti-conformal. Quite the opposite. Conformal factuality is valuable because it makes assumptions visible. The problem begins when teams hide those assumptions behind a single reliability number and then act surprised when the product becomes timid, expensive, or brittle.

Reliable AI cannot mean silent AI

The paper’s central contribution is a useful correction to the reliability debate. It reminds us that a filtered answer is not automatically a better answer. It may be a safer answer. It may also be a thinner answer, a more expensive answer, or an answer that survives only under friendly calibration conditions.

For Cognaptus readers building AI systems, the lesson is direct. Do not buy “truth filters” as if they were universal disinfectant. Use them as measurable components in a broader reliability architecture. Measure what they remove. Measure whether the remaining answer still solves the task. Measure whether the guarantee survives the kind of traffic your users actually send when they are tired, impatient, and creatively destructive.

AI reliability should not be a contest to produce the cleanest silence. A useful system must say enough, say it correctly, and know when its own guardrails are deleting the point.

That is harder than a dashboard green light. It is also much closer to the work.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yi Chen, Daiwei Chen, Sukrut Madhav Chikodikar, Caitlyn Heqi Yin, and Ramya Korlakai Vinayak, “Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights,” arXiv:2603.16817, March 17, 2026. https://arxiv.org/abs/2603.16817 ↩︎