Chart review is the boring part of medicine, which is exactly why AI systems should learn from it.
A clinical discharge summary does not fail only when it sounds clumsy. It fails when it tells a patient something that did not happen, invents a medication change, adds a procedure, misstates a timing detail, or turns a vague note into a confident medical fact. The prose may still be smooth. The bedside manner may even be excellent. Unfortunately, a hallucination delivered in fluent patient-friendly language is not safer because it has better manners.
That is the useful starting point for Hallucination Detection-Guided Preference Optimization for Clinical Summarization.1 The paper is not simply another attempt to make a medical LLM “more aligned.” Its real contribution is more operational: it turns hallucination detection into a correction workflow, then turns that correction workflow into preference data. In other words, the model is not merely told to be faithful. It is shown where it went off-chart, asked to repair the specific unsupported content, and then trained to prefer the repaired version.
This mechanism matters because one of the tempting business assumptions around clinical LLMs is wrong: if a model is fine-tuned on clinician-written summaries, surely factuality should improve. In this paper, supervised fine-tuning does the opposite for the main LLaMA-3.1-8B-Instruct evaluation. It improves fluency and coherence, but hallucinations rise from 29 under prompting to 57 under SFT. That is not a rounding error. That is the model becoming more polished while becoming less grounded. A very modern failure mode, really: better packaging, worse contents.
The paper’s answer is not “prompt harder.” It is a two-stage mechanism: first, use hallucination detectors to guide self-refinement at inference time; second, convert those detector-guided revisions into preference pairs and train with direct preference optimization. The resulting methods are called HDSR and HDSR-PL. The distinction is important because it maps neatly onto a practical product decision: do you pay for detector-guided correction each time you generate, or do you use those corrections to train a model that behaves better without the repeated loop?
The error is a fact-control problem, not a style problem
The task studied in the paper is clinical summarization from Brief Hospital Course sections to Discharge Instructions, using datasets derived from MIMIC-IV-Note v2.2. The source is a clinician-facing hospital-course narrative. The target is a patient-facing summary that should explain what happened during the admission in clearer language.
This is a hard version of summarization because the output must do two things at once. It must translate medical notes into accessible language, and it must not introduce unsupported clinical facts. Those two goals can conflict. A model that tries to be helpful may smooth over ambiguity, add plausible advice, or infer clinical details that are common in similar cases but absent from this patient’s record.
The paper evaluates hallucinations at the entity level and uses clinician review for the main LLaMA-3.1-8B-Instruct results. The qualitative dimensions are familiar summarization metrics: consistency, coherence, fluency, and relevance. But the important measurement is not just whether the summary reads well. It is whether specific clinical claims are supported by the source.
That changes the business interpretation. If a hospital, insurer, care-navigation company, or health-record vendor treats clinical summarization as a writing problem, the natural solution is better prompts, better examples, or supervised fine-tuning on high-quality summaries. If it treats the problem as a fact-control problem, the solution has to include diagnosis: identify which claim is unsupported, then revise that claim without disturbing supported content.
The paper chooses the second route.
HDSR starts by marking the suspect span
Hallucination Detection Guided Self-Refinement, or HDSR, is an inference-time process. The model first generates a summary. A hallucination detector then compares that summary against the source clinical note and identifies unsupported or inconsistent content. The model receives the original source, the draft summary, and detector feedback, then revises the draft. The process can repeat until a fixed iteration limit is reached or no further hallucinations are detected.
The key detail is that the detector does not merely produce a general warning such as “this summary may contain hallucinations.” In the prompt design shown in the appendix, potentially incorrect or unsupported text is wrapped in <error> ... </error> tags. The revision instruction then tells the model to remove, correct, or keep each marked segment depending on whether the source supports it.
That makes HDSR different from generic self-reflection. A generic self-refinement loop asks the model to review its own answer. That can help, but it can also produce confident rearrangement rather than factual repair. HDSR narrows the model’s job: do not rewrite everything; look at this suspect content; check it against the source; repair only what needs repair.
A simplified view is:
| Step | What happens | Why it matters |
|---|---|---|
| Draft generation | The model writes a patient-facing discharge instruction summary from the Brief Hospital Course. | This is where unsupported content can enter. |
| Detection | A detector marks unsupported or contradictory spans in the draft. | The model receives localized factual feedback instead of vague criticism. |
| Targeted revision | The model revises the marked content while preserving supported information. | The correction is aimed at factuality, not cosmetic rewriting. |
| Iteration | The revised summary can be checked again. | The loop can catch remaining unsupported content, at additional cost. |
The authors use two detector families: MedCat, which links clinical concepts in the source and summary to biomedical ontologies, and prompt-based detectors following MedAlign annotation guidelines. These detectors are not magic truth machines. They are supervisory instruments. Their quality matters because detector false positives and false negatives become correction signals. Still, the paper’s core insight is that even imperfect detectors can structure the revision process better than asking a model to “be careful.”
That is the first business lesson: factuality control needs a sensor. Without a sensor, “alignment” becomes a corporate wish written in YAML.
HDSR-PL turns corrections into preference data
HDSR improves factuality at inference time, but it has an obvious operational cost. Each generated summary may require detector calls and revision iterations. In a production healthcare workflow, that cost may be acceptable for high-risk or low-volume use cases, but it becomes painful if summaries are generated at scale or inside latency-sensitive applications.
HDSR-PL addresses that problem by using the HDSR process to create preference pairs. The original draft becomes the non-preferred output. The detector-refined version becomes the preferred output. The model is then trained with direct preference optimization, so it learns to prefer more faithful summaries without needing the full detector-revision loop at inference time.
This is the paper’s more interesting move. Many discussions of preference optimization treat preference data as something that must be collected from humans or generated through broad AI feedback. Here, the preference signal comes from a concrete operational process: detect the unsupported content, revise it, then treat the revision as better than the original.
The pipeline can be read as an amortization strategy:
| Method | Where the factuality work happens | Operational trade-off |
|---|---|---|
| HDSR | During inference, through detector-guided revision. | Better targeted correction, but more compute and latency per summary. |
| HDSR-PL | During training, by converting refined trajectories into DPO preference pairs. | Lower inference overhead, but requires training resources and inherits detector quality. |
This distinction is useful for AI product planning. HDSR is closer to a review workflow. HDSR-PL is closer to a model-improvement workflow. One watches the model while it works; the other teaches the model from earlier corrections.
Neither is universally better. HDSR may be preferable where every output can tolerate an extra verification loop, where source records are especially complex, or where auditability of correction steps matters. HDSR-PL may be preferable where scale, speed, and integration simplicity matter more. A sensible healthcare AI stack could use both: run HDSR in controlled environments to generate high-quality correction data, then periodically train an HDSR-PL model to reduce the cost of future inference.
The main evidence: SFT gets smoother and less faithful
The main experimental result focuses on LLaMA-3.1-8B-Instruct on the Hallucination-Generated-DI benchmark. The numbers are simple enough to be dangerous, so they deserve interpretation.
| Model / method | Hallucination count | Consistency | Coherence | Fluency | Relevance | Average quality |
|---|---|---|---|---|---|---|
| GPT-5 prompting | 36 | 3.55 | 4.73 | 4.73 | 4.08 | 4.27 |
| LLaMA-3.1-8B prompting | 29 | 4.08 | 3.83 | 4.05 | 3.23 | 3.79 |
| LLaMA-3.1-8B SFT | 57 | 3.03 | 4.43 | 4.53 | 3.03 | 3.75 |
| LLaMA-3.1-8B HDSR, best with MedAlign | 22 | 4.13 | 4.48 | 4.53 | 3.95 | 4.27 |
| LLaMA-3.1-8B HDSR-PL, best with MedCat | 15 | 4.40 | 4.28 | 4.05 | 3.90 | 4.16 |
The first surprise is SFT. Supervised fine-tuning on clinician-written references increases hallucination count from 29 to 57. At the same time, fluency rises from 4.05 to 4.53 and coherence rises from 3.83 to 4.43. That combination is exactly why traditional summarization quality metrics can be misleading in clinical settings. A summary can become easier to read while becoming less faithful to the patient record.
The likely mechanism is not that SFT “breaks” the model in some mysterious way. More plausibly, SFT teaches the model the style and content patterns of target summaries. If the model learns how discharge instructions usually sound without receiving a strong enough signal about source-grounded factual constraints, it may become more willing to fill in clinically plausible details. Pattern imitation is not fact verification. This is not an accusation; it is simply what language models are good at.
HDSR moves in the opposite direction. With MedAlign as the detector, hallucinations fall to 22, while the average human-evaluated quality rises to 4.27. That is important because factuality improvement does not appear to come by making the output terse, awkward, or less useful. The system reduces hallucinations while preserving, and in some dimensions improving, summary quality.
HDSR-PL goes further on hallucination count, reducing it to 15 with MedCat-derived preference pairs. That is an approximate 48% reduction relative to the LLaMA-3.1 prompting baseline and 74% relative to SFT. The quality average is slightly lower than HDSR, and fluency falls relative to HDSR and SFT, but the factuality result is the strongest among the tested LLaMA-3.1 variants.
This is the central trade-off: HDSR appears better when direct detector feedback is available during generation; HDSR-PL internalizes enough of that signal to reduce hallucinations more aggressively without repeated inference-time refinement, but it may smooth away some of the local revision benefits that HDSR obtains by looking at the current draft.
The ablation says the detector is not decoration
The paper includes a comparison between self-refinement without detectors and detector-guided self-refinement. This is best understood as an ablation: it asks whether the improvement comes from self-refinement in general or from the detector signal specifically.
| Method | Hallucination count | Consistency | Coherence | Fluency | Relevance | Likely purpose of test |
|---|---|---|---|---|---|---|
| Self-Refine without detector | 31 | 3.85 | 4.25 | 4.58 | 4.20 | Ablation: isolate generic revision loop. |
| HDSR with MedCat | 25 | 4.08 | 4.55 | 4.58 | 4.13 | Ablation/comparison: add concept-linking detector feedback. |
| HDSR with MedAlign | 22 | 4.13 | 4.48 | 4.53 | 3.95 | Ablation/comparison: add prompted clinical error detector feedback. |
Self-refinement without detectors produces 31 hallucinations, worse than the prompting baseline of 29 in the main table. That does not mean self-refinement is always harmful. It means that, in this task, asking the model to revise without localized detection feedback is not enough. The model may improve wording, add missing information, or reorganize the summary, but it does not reliably remove unsupported clinical content.
Adding detector feedback changes the result. HDSR with MedCat reduces hallucinations to 25; HDSR with MedAlign reduces them to 22. The detector is therefore not decorative. It is the part of the loop that tells the model where factual attention should be paid.
For business teams, this ablation is more valuable than another leaderboard comparison. It says the control layer matters. If a product team builds a “review and rewrite” button around a medical LLM without a detector, they may be buying an expensive paraphraser with a stethoscope sticker on it.
The error-type table shows what gets fixed
The paper also breaks hallucinations into categories: unsupported condition, procedure, medication, time, location, number, name, word, other, contradicted fact, and incorrect fact. This table is not the main result by itself. It functions as explanatory evidence: it shows which kinds of errors the methods reduce.
For LLaMA-3.1-8B, unsupported conditions, procedures, and medications are prominent under prompting and SFT. Prompting produces 8 unsupported conditions, 2 unsupported procedures, and 1 unsupported medication. SFT raises these to 15, 11, and 8 respectively. That is clinically meaningful because these are not trivial wording problems. They concern what the patient had, what was done, and what medication was used.
HDSR variants reduce several of these categories. HDSR with MedAlign reduces unsupported conditions to 4, procedures to 2, and medications to 1. HDSR-PL with MedCat reduces unsupported conditions to 3, procedures to 1, and medications to 0.
There is a nuance, though. HDSR-PL with MedCat still shows 8 contradicted facts in the table, the same number as the prompting baseline. So the headline hallucination reduction does not mean every error type is solved equally. The method appears especially strong against unsupported clinical entities, while contradicted facts may remain harder or more dependent on detector behavior.
This distinction matters for deployment. A care-summary product should not report only aggregate hallucination counts. It should track error families. A system that reduces invented medications but leaves contradictions unchanged is very different from one that reduces contradictions but still invents follow-up procedures. Risk is not evenly distributed across error categories.
Smaller-model results support the direction, but they are not the main proof
The paper reports limited results for LLaMA-3.2-3B-Instruct and Gemma-3-4B-IT. HDSR-PL reduces hallucinations from 26 to 13 for LLaMA-3.2-3B and from 15 to 13 for Gemma-3-4B-IT. Quality metrics also improve slightly.
These results are useful, but they should be read with discipline. The authors state that, due to resource constraints, the smaller-model hallucination counts use clinician-provided annotations from a single clinician, and the qualitative evaluation uses automatic LLM-as-judge rather than the fuller clinician evaluation used for the main LLaMA-3.1 results.
So the smaller-model findings are best classified as a limited robustness or exploratory extension. They suggest the approach is not uniquely tied to one model size or model family. They do not carry the same evidentiary weight as the main LLaMA-3.1 evaluation.
This is where business readers often make a small but costly mistake. They see “works across models” and treat it as a procurement green light. The better reading is narrower: the mechanism looks portable enough to justify further testing on your own model and note type. It does not remove the need for local validation.
The appendix is an implementation clue, not a second thesis
The appendix contains prompts for summarization, self-refinement with and without detectors, and MedAlign-style hallucination detection. These details are easy to skip, but they clarify how the mechanism actually operates.
The summarization prompt asks the assistant to help patients understand medical records, summarize the Brief Hospital Course in one paragraph, avoid medical jargon, and start with “You were admitted.” The self-refinement prompt emphasizes that the source facts are authoritative, instructs the model not to invent diagnoses, medications, procedures, or dates, and asks it to preserve source terminology where appropriate.
The detector-guided revision prompt adds the decisive instruction: text inside <error> ... </error> tags should be removed, corrected, rewritten from source facts, or retained if accurate. The MedAlign detection prompts define error categories and span-labeling rules, including how to handle deidentified information and how to label the smallest useful error span.
This appendix material is best treated as implementation detail. It does not independently prove the method works. But it explains why the method is more precise than generic critique-and-rewrite. The prompts operationalize a principle: separate detection from correction, and make correction local.
That principle is portable beyond this paper. In many enterprise AI systems, the mistake is to ask one model to generate, audit, revise, and justify everything in one conversational blob. HDSR suggests a cleaner architecture: generate first, detect second, revise third, and train later if the correction traces become reliable enough.
What Cognaptus infers for business use
The paper directly shows that detector-guided self-refinement and detector-derived preference learning reduce hallucinations in a specific clinical summarization task. The business interpretation is broader but should stay bounded.
| Paper finding | Directly shown | Cognaptus business inference | Boundary |
|---|---|---|---|
| HDSR reduces hallucinations from 29 to 22 for LLaMA-3.1-8B. | Detector-guided revision improves factuality in the evaluated BHC-to-DI task. | Use detector-guided refinement where factual risk is high and latency is acceptable. | Depends on detector quality and may require clinical validation for each note type. |
| HDSR-PL reduces hallucinations to 15. | Preference learning from refined summaries can amortize factuality gains. | Build correction-data pipelines, not just prompt libraries. | Requires training capability and inherits flaws in the generated preference pairs. |
| SFT raises hallucinations to 57. | Reference-summary fine-tuning can worsen factual grounding while improving style. | Do not treat clinician-written examples as sufficient factuality supervision. | SFT behavior may vary by dataset, model, and training setup. |
| Detector feedback beats detector-free self-refinement. | Localized factual feedback matters in this experiment. | Add a verification layer before trusting revision loops. | A weak detector may create false confidence or incorrect edits. |
For healthcare AI vendors, the most important shift is from “model quality” to “factuality operations.” A product should not merely ask which foundation model performs best. It should ask:
- What factual claims can the system detect as unsupported?
- Which detector is responsible for each claim type?
- How are detector outputs converted into revision instructions?
- Which corrections are logged as training data?
- Which error categories remain stubborn after refinement?
That is not as glamorous as saying “agentic clinical copilot.” Good. Glamour is not an FDA-adjacent control strategy.
Where the method fits in a healthcare AI stack
A practical deployment architecture could use this paper’s mechanism in three layers.
First, use detector-guided review for high-risk summaries. This is the HDSR layer. It fits use cases where a patient-facing summary, prior-authorization note, care-navigation explanation, or discharge instruction must be checked against a source record before release. The cost is higher inference overhead, but the benefit is targeted factual repair.
Second, save revision trajectories as a controlled training asset. Each pair of draft and detector-refined summary can become a candidate preference pair. Not all pairs should be accepted blindly. Some should be sampled for clinician review, especially where detectors disagree or where changes affect medications, procedures, diagnoses, or follow-up instructions.
Third, periodically train or adapt a model using preference optimization. This is the HDSR-PL layer. The operational goal is not to eliminate verification forever. It is to reduce the frequency and severity of predictable hallucinations before verification begins.
This creates a useful flywheel:
Generate summary
↓
Detect unsupported clinical claims
↓
Revise the marked claims
↓
Store draft–revision pairs
↓
Train preference model
↓
Generate fewer unsupported claims next cycle
The loop is valuable because it turns production errors into structured learning material. Many AI teams already collect user feedback. Much less often do they collect machine-readable correction traces that can be transformed into preference data. The latter is more expensive. It is also more useful.
The boundary conditions are not footnotes; they are product requirements
The paper’s limitations are practical rather than cosmetic.
First, HDSR and HDSR-PL rely on detector quality. A detector that misses subtle contradictions will not guide the model to fix them. A detector that flags supported content may induce unnecessary edits. In clinical summarization, false confidence in the detector layer could become a new failure mode rather than a safety feature.
Second, the evidence is task-specific. The evaluation focuses on MIMIC-IV-Note-derived Brief Hospital Course to Discharge Instruction summarization. Other clinical note types may have different factual structures. A radiology impression, emergency department note, medication reconciliation document, or insurance appeal letter would need its own validation.
Third, the cost profile differs across HDSR and HDSR-PL. HDSR adds detector and revision calls at inference time. HDSR-PL reduces inference overhead but requires training resources and a reliable pipeline for preference data creation. The right choice depends on volume, latency, regulatory exposure, and the cost of human review.
Fourth, aggregate hallucination count is not enough. The error-type analysis shows that some categories improve more than others. A production system should monitor error categories, not only total hallucinations. “Fewer hallucinations” is good. “No unsupported medication changes” is the kind of sentence that matters in a procurement meeting.
The real lesson: alignment needs instruments
The paper is valuable because it makes factuality control less mystical. It does not ask us to trust that a model has become more responsible because it was fine-tuned on better examples. It inserts an instrument into the workflow: a hallucination detector. Then it uses the detector twice, first as an inference-time guide and then as a preference-data generator.
That is the mechanism-first reading. Detection marks the unsupported content. Refinement repairs the marked content. Preference learning converts the repair into a training signal. DPO amortizes part of the correction behavior into the model. The result is not a perfect clinical summarizer, and the paper does not claim that. It is a more disciplined way to move from fluent summarization toward source-grounded summarization.
For business use, the takeaway is equally plain. In healthcare AI, better writing is not the same as better reliability. Supervised fine-tuning can teach a model to sound more like the target document while weakening its attachment to the source. If the product risk lies in unsupported facts, then the control system must see unsupported facts.
The humble detector, in this paper, does the unglamorous work. It points at the suspicious sentence and says: check this against the chart.
That is not a full safety case. But it is a better engineering habit than hoping fluency has a conscience.
Cognaptus: Automate the Present, Incubate the Future.
-
Shamanth Kuthpadi Seethakantha, Dung Ngoc Thai, Vara Prasad Gudi, Simran Tiwari, Rami Matar, Avijit Mitra, Wenlong Zhao, Wael Salloum, and Andrew McCallum, “Hallucination Detection-Guided Preference Optimization for Clinical Summarization,” arXiv:2605.28910v1, 27 May 2026, https://arxiv.org/abs/2605.28910. ↩︎