Mind the Gap: When Clinical LLMs Learn from Their Own Mistakes

Mistakes are usually treated as waste.

In clinical AI, they are treated even more nervously: logged, redacted, escalated, converted into a slide deck, and then politely buried under the next benchmark table. Understandable. Nobody wants a medical agent whose product roadmap reads like “learning through patient-adjacent embarrassment.”

But the paper Closing Reasoning Gaps in Clinical Agents with Differential Reasoning Learning makes a useful move: it treats mistakes not as isolated failures, but as a structured raw material for improving future reasoning.¹ The core idea is not that a clinical LLM should “reflect” harder, nor that we should throw more guidelines into the prompt until the context window starts whimpering. The idea is more surgical: compare the model’s reasoning with a better reference reasoning trace, locate the precise gap, convert that gap into a reusable instruction, and retrieve that instruction when a similar case appears later.

That is the paper’s real contribution. It is not ordinary medical RAG. It is not just in-context learning with a stethoscope. It is a method for mining process-level corrections from prior reasoning discrepancies.

The distinction matters because most healthcare AI systems do not fail only by lacking facts. They fail by mishandling the path from fact to hypothesis to action. A patient has risk factors, but the model underweights them. A physician’s plan suggests concern, but the model treats discharge as reassurance. A lab value and wound description are both present, but the model discusses them separately instead of synthesizing them. The answer may even be correct while the reasoning is clinically untrustworthy. Annoying, yes. Also very common. We call this “decision support,” then act surprised when clinicians ask to see the decision.

This paper proposes a framework called Differential Reasoning Learning, or DRL. Its mechanism is simple enough to explain, but difficult enough to implement well:

Clinical case
   ↓
Reference reasoning + agent reasoning
   ↓
Typed reasoning graphs
   ↓
Graph discrepancy analysis
   ↓
Reusable corrective instruction
   ↓
Differential Reasoning Knowledge Base
   ↓
Retrieved patch for future cases

That loop is the article. The benchmark numbers are important, but they only make sense after the mechanism is clear.

The gap is not the wrong answer; it is the wrong route to the answer

Clinical reasoning is not a list of facts. It is a relational structure.

A symptom supports a hypothesis. A hypothesis motivates a test. A test changes the confidence in a diagnosis. A patient’s age, frailty, comorbidities, and discharge instructions jointly affect disposition risk. These relationships are where many LLM failures hide. If a model omits one clinically relevant node or connects two facts through the wrong inference path, the final answer can become unreliable even when the generated paragraph sounds competent.

DRL makes that structure explicit. The framework takes two reasoning traces for the same case:

a reference reasoning trace, derived from physician-authored rationale, clinical guidelines, expert knowledge, or a stronger teacher model;
an agent reasoning trace, extracted from the model’s own chain-of-thought-style reasoning.

Both traces are converted into directed acyclic graphs. In the paper’s schema, the graph contains:

Graph element	What it represents	Why it matters clinically
Facts	Symptoms, vitals, labs, history, demographics, documented findings	The evidence base the model is allowed to use
Hypotheses	Diagnoses, clinical conditions, ruled-in or ruled-out possibilities	The intermediate reasoning layer
Actions	Tests, treatments, assessments, monitoring, prescriptions	The clinical plan or recommendation layer
Final node	The answer, disposition, or final diagnosis	The endpoint, not the whole story
Edges	Supports, contradicts, or suggests-test relationships	The actual reasoning path

This is not graph decoration. The graph turns free-form clinical prose into something that can be compared. Once the model’s reasoning and the reference reasoning are both represented as structured objects, DRL can ask more useful questions than “Was the final answer correct?”

It can ask:

Did the model miss a clinically important factor?
Did it invent or overemphasize an unsupported factor?
Did it connect the right facts to the wrong hypothesis?
Did it recommend or imply actions inconsistent with the physician’s documented plan?
Did it reach the right answer through a path a clinician would not endorse?

That is the first useful shift: the paper treats reasoning quality as inspectable structure, not vibes wearing a lab coat.

Graph edit distance becomes a clinical error lens

After graph extraction, DRL compares the reference graph and the agent graph using a clinically weighted form of graph edit distance.

In ordinary terms, graph edit distance asks: how much work would it take to transform one graph into another? In this paper, the “edits” are interpreted clinically. Missing nodes, hallucinated nodes, and broken reasoning paths become different categories of error.

The authors use an LLM-as-a-judge to perform semantic matching because exact string matching would be too brittle. “Elderly patient living alone” and “limited social support in geriatric discharge” may be phrased differently but represent the same clinical idea. The judge also applies different weights to different node types: actions are treated as more consequential than hypotheses, and hypotheses more consequential than facts. That weighting is sensible. A model hallucinating an irrelevant historical detail is bad; a model hallucinating an admission plan or contraindicated treatment is worse. One is noise. The other is noise with a pager.

The paper’s discrepancy analysis focuses on three error families:

Error family	Clinical meaning	Example pattern
Missing or mismatched nodes	The model omitted an important fact, hypothesis, or action	Ignoring immunosuppression in infection-risk assessment
Hallucinated or irrelevant nodes	The model introduced unsupported clinical content	Discussing severe complications not grounded in the note
Path or edge errors	The model connected evidence, hypotheses, and actions incorrectly	Treating discharge as proof of low risk rather than a decision that may itself contain caution signals

This is where DRL moves beyond evaluation. The discrepancy report is not the endpoint. It becomes training material.

The “knowledge base” stores reasoning patches, not medical trivia

The paper’s most business-relevant idea is the Differential Reasoning Knowledge Base, or DR-KB.

A normal medical RAG system retrieves documents: guidelines, reference pages, drug labels, clinical notes, prior cases. That can help when the model lacks information. But it does not directly fix the model’s recurring reasoning habits. If the agent repeatedly underweights social support in elderly fall cases, adding more geriatric discharge documents may help, but only indirectly. The model still has to infer the correction at runtime.

DR-KB stores something different: distilled corrective instructions produced from prior reasoning discrepancies.

For example, a discrepancy analysis may find that the agent failed to consider social support for an elderly fall patient. The generated instruction might tell the agent that, in geriatric fall cases, it should explicitly assess social support and functional status before judging safe discharge. This instruction is not a raw exemplar. It is a compact, reusable reasoning patch.

That is why the “learning from mistakes” phrase should not be misunderstood. DRL is not letting the model update its own weights after every answer, which would be a thrillingly bad governance plan. Instead, it creates a reviewable layer of corrective knowledge outside the model. The base model stays fixed. The patch library grows.

At inference time, DRL retrieves the top relevant patches from DR-KB and injects them into the prompt for the new case. The model is not merely reminded of medical facts. It is reminded of specific reasoning mistakes it is prone to make under similar conditions.

The difference is operationally important:

Approach	What gets retrieved	What problem it mainly solves	Business limitation
Traditional RAG	Documents, guidelines, knowledge snippets	Missing information	May not fix bad reasoning structure
In-context learning	Similar examples	Pattern imitation	Expensive in tokens; examples may be weakly targeted
Fine-tuning	Updated parameters	Broad behavioral adaptation	Costly, slower to govern, harder to audit
DRL	Corrective reasoning patches	Recurring process-level failures	Depends on high-quality reference rationales and reliable graph extraction

That last row is the novelty. DRL is a patch system for clinical reasoning process, not a bigger filing cabinet.

The strongest evidence appears under domain shift

The paper evaluates DRL on MedQA, MedMCQA, and a Return Visit Admission prediction task, which it calls RVA-QA. The RVA task is based on emergency department notes and asks whether a patient will return to the ED and be admitted within nine days after discharge.

This is a useful test setting because it is closer to institutional clinical reasoning than standard medical multiple-choice benchmarks. It is not just “does the model know the textbook answer?” It asks the model to synthesize messy evidence: comorbidities, symptoms, clinician concern, discharge plan, follow-up instructions, functional status, and the possibility of deterioration after discharge.

The main result is easy to misread. DRL improves performance on MedQA and MedMCQA, but the improvements are modest. The dramatic result is on RVA-QA.

Method	MedQA accuracy	MedMCQA accuracy	RVA-QA accuracy
DRL(Qwen)	72.20 ± 1.64	64.80 ± 2.17	81.28 ± 0.47
Qwen3-8B	70.20 ± 1.10	62.80 ± 0.84	56.97 ± 0.57
DRL(LLaMA)	53.60 ± 3.36	56.60 ± 3.05	65.23 ± 0.37
LLaMA-3.1-8B-Instruct	51.20 ± 4.09	52.20 ± 0.58	49.91 ± 0.69
MedReason-8B	72.40 ± 0.55	56.80 ± 0.37	53.49 ± 1.43
HuatuoGPT-o1-8B	60.80 ± 1.10	65.00 ± 1.10	49.54 ± 0.99
MedPRM-8B	62.20 ± 0.84	52.20 ± 0.49	47.98 ± 0.87

The headline number is DRL(Qwen) reaching 81.28 ± 0.47 on RVA-QA, compared with 56.97 ± 0.57 for the base Qwen3-8B model. That is a 24.31-point improvement. DRL(LLaMA) also improves over LLaMA-3.1-8B-Instruct by 15.32 points on the same task.

The interpretation should be careful. This does not prove clinical readiness. It does suggest that discrepancy-mined reasoning patches are particularly useful when the task distribution shifts away from open medical QA into institution-specific, note-heavy prognosis. That is exactly where many healthcare AI deployments live: not in clean benchmark questions, but in local workflows, local documentation styles, and local decision thresholds.

The specialized medical models do not transfer well to RVA-QA in this experiment. That is not shocking. Models fine-tuned on open medical QA may learn benchmark-shaped reasoning. RVA-QA asks for a different kind of reasoning: patient trajectory, disposition risk, and synthesis across messy ED notes. The paper’s result therefore supports a practical claim: for local clinical-agent deployments, targeted reasoning correction may beat generic medical fine-tuning when the deployment task is narrow, messy, and institution-specific.

That sentence is less glamorous than “AI doctor solves medicine.” It is also more useful.

The ablations show this is not just more examples in the prompt

The paper’s ablation results matter because they test the mechanism rather than only the final outcome. On RVA-QA, the authors compare DRL with and without physician-authored rationales, across different retrieval depths. They also compare against an in-context learning baseline using the same training dataset.

RVA-QA setting	Top-1	Top-3	Top-5	Top-10
DRL with physician rationales	74.59 ± 0.47	80.00 ± 0.23	81.28 ± 0.47	81.28 ± 0.53
DRL without physician rationales	72.75 ± 0.43	80.37 ± 0.27	79.63 ± 0.34	80.92 ± 0.47
In-context learning	68.72 ± 0.53	77.80 ± 0.49	74.68 ± 0.47	Not reported due to token limit

These results serve three different purposes.

Test	Likely purpose	What it supports	What it does not prove
With vs. without physician rationales	Ablation	Expert rationales help build stronger reference graphs, especially when fewer patches are retrieved	That physician rationales are always available or consistent
Top-k retrieval depth	Sensitivity / implementation test	Retrieval depth matters; performance improves sharply from Top-1 to Top-3/Top-5	That more retrieved patches are always better in larger settings
DRL vs. ICL	Mechanism comparison	Reasoning patches outperform simply showing examples under similar token budgets	That ICL is weak in all clinical tasks

The ICL comparison is especially useful. If DRL were merely “more context,” then examples should be competitive. They are not. DRL performs better because it retrieves instructions distilled from errors, not full cases that the model must interpret from scratch. This is exactly the kind of evidence one wants when evaluating an applied AI architecture: not just “it works,” but “the proposed mechanism contributes something distinguishable.”

The top-k pattern also deserves interpretation. Moving from Top-1 to Top-3 produces a large gain. Top-5 is best or near-best. Top-10 adds little for Qwen in the physician-rationale condition. That suggests the method benefits from enough coverage to capture relevant failure modes, but eventually approaches a context-budget plateau. In production, that plateau is not a nuisance detail. It is the difference between a useful retrieval layer and a prompt stuffed with so much “guidance” that the model needs a guidance counselor.

The physician review is face validity, not a clinical trial

The paper includes physician qualitative review of representative RVA-QA cases. This section is valuable, but it should be interpreted with the right label: face validity and error analysis, not outcome validation.

The cases illustrate why process-level correction matters.

In one urinary retention case, the base agent predicted no RVA, and the ground truth was no RVA. The final answer was correct. But the reasoning over-branched, referenced undocumented labs, and introduced speculative complications. DRL identified these weaknesses and generated instructions to anchor reasoning on documented findings and the physician’s actual plan.

That is a subtle but important point: DRL is not only for wrong answers. It can also diagnose wrong or messy reasoning behind right answers. In clinical decision support, this matters because a correct answer with unstable reasoning is not something one should comfortably operationalize. It is a lucky intern with good handwriting.

In a rash and wound infection case, the agent predicted no RVA, but the ground truth was yes. The physician judged risk as moderate to high because of baseline vulnerability, leukocytosis, concerning wound features, and explicit 24–48 hour return precautions. DRL’s generated instruction corrected the model’s under-synthesis of cumulative risk factors. This is the cleanest illustration of the framework’s value: the model had many of the ingredients, but it failed to combine them into the clinically relevant risk picture.

In a fall, rib fracture, and tube bleeding case, the agent predicted RVA correctly, but the clinician noted that the reasoning mixed evidence across concurrent problems and underweighted admission cues such as frailty, cachexia, functional decline, and device-related symptoms. DRL guidance addressed some of these issues. However, the paper also reports an inconsistency: the DRL assessment said the agent failed to predict disposition even though the prediction was correct. Good. That limitation should be visible. A system designed to audit reasoning must itself be auditable when it audits badly. Otherwise we have merely created a second confident narrator.

The physician review therefore supports the paper’s qualitative claim: DRL can produce corrections that clinicians recognize as relevant. It does not prove that DRL improves patient outcomes, reduces adverse events, or is ready for autonomous deployment. The authors do not need to have proven that for the paper to be useful. They just need to show a credible mechanism and early evidence. They mostly do.

The business value is not cheaper training; it is cheaper institutional adaptation

For healthcare AI vendors, the interesting business implication is not that DRL avoids fine-tuning. Avoiding fine-tuning is nice. It saves money, reduces operational burden, and keeps model governance cleaner. But that is not the deeper value.

The deeper value is that DRL offers a structured way to adapt a clinical agent to local reasoning failures.

Hospitals differ. Patient populations differ. ED workflows differ. Documentation habits differ. Admission thresholds differ. A model that performs well on a general medical QA benchmark may still miss the signals that matter in a specific institution’s return-admission workflow. DRL fits this world because it does not assume the base model must internalize everything through parameter updates. It builds an external correction layer from observed discrepancies.

A practical deployment pathway would look like this:

Operational step	What the organization does	Governance advantage
Collect cases	Sample local model outputs, clinician rationales, guidelines, and reviewed outcomes	Creates a controlled improvement dataset
Extract reasoning graphs	Convert reference and agent reasoning into structured representations	Makes reasoning failures inspectable
Compare discrepancies	Identify missing factors, hallucinations, and broken inference paths	Moves review from anecdote to taxonomy
Generate patches	Distill recurring failures into concise instructions	Produces editable, auditable guidance
Retrieve at inference	Inject relevant patches for similar future cases	Adapts behavior without changing model weights
Review patch library	Clinicians approve, edit, retire, or version instructions	Keeps responsibility with accountable humans

This is especially relevant for clinical analytics teams building decision-support copilots around high-risk workflows: ED revisits, discharge planning, medication reconciliation, specialist referral triage, or follow-up prioritization. In these settings, the problem is rarely “the model has never heard of sepsis.” The problem is that the model does not reliably weigh the right combination of local cues.

Cognaptus would interpret DRL as part of a broader pattern in enterprise AI: the valuable layer is shifting from general model capability to workflow-specific correction memory. The model supplies language and broad reasoning ability. The institution supplies reviewed discrepancies. The product turns those discrepancies into reusable control logic. Less cinematic than artificial general intelligence. More likely to pass a governance meeting.

Where this approach can break

DRL depends on several components that are themselves fallible.

First, graph extraction is performed by an LLM. If the extractor misclassifies facts, hypotheses, or actions, the downstream discrepancy analysis can become noisy. A patch generated from a bad graph may confidently correct the wrong thing. That is not a minor implementation detail; it is the foundation of the method.

Second, the graph edit distance analysis uses an LLM-as-a-judge for semantic matching and reasonableness checks. This is necessary because clinical language is full of paraphrase and context. But judges can be biased, inconsistent, or overconfident. The paper’s own physician-review section shows that DRL-generated assessments can contain inconsistencies. Any production version would need validation, monitoring, and selective human review for high-stakes patches.

Third, DRL assumes access to good reference reasoning. Physician-authored rationales improve the reference graph, but such rationales are not always available, complete, or consistent. Clinical medicine is not always governed by a single clean gold standard. Guidelines may be incomplete. Physician reasoning may vary. Local practice patterns may encode both expertise and habit. A DR-KB is only as good as the reasoning it learns from. Very profound, yes: if you distill flawed judgment, you get artisanal flawed judgment.

Fourth, the internal RVA-QA setting is promising but bounded. The dataset is constructed from one clinical domain and converted into QA format. The reported physician review is useful, but not large-scale validation. The method should be tested across more institutions, specialties, note types, and prospective workflows before anyone treats it as a general safety solution.

Finally, there is a deployment issue around reasoning traces. The paper uses agent reasoning traces as input to the DRL process. In production, many systems should avoid relying on unfiltered private chain-of-thought as an operational artifact. A safer implementation may use structured rationale outputs, evidence maps, or model-generated reasoning summaries designed for audit, rather than exposing or storing raw hidden reasoning. The paper’s mechanism can survive that adjustment, but the engineering choice matters.

What the paper directly shows, and what business readers should infer

The paper directly shows that DRL can improve accuracy on selected medical QA tasks, with the largest gains on the RVA-QA domain-shift task. It shows that physician rationales and retrieval depth matter. It shows that discrepancy-derived patches outperform a comparable in-context learning setup in the reported RVA-QA ablation. It also shows, through representative clinician review, that some DRL-generated corrections align with clinical critique.

Cognaptus infers a broader business lesson: for enterprise AI, especially in regulated workflows, a model’s error history can become an improvement asset if it is structured properly. The useful unit is not the failed answer. It is the reusable correction extracted from the gap between weak reasoning and better reasoning.

What remains uncertain is equally clear. We do not yet know how DRL scales across hospitals, how stable the patch library remains over time, how often patches conflict, how clinicians should approve them, or how much improvement survives in prospective deployment. We also do not know whether the same approach works equally well outside clinical settings, though the mechanism is obviously tempting for finance, compliance, legal review, and operations. Tempting, of course, is not the same as validated. Product teams occasionally confuse the two, usually right before the pilot becomes “strategic learning.”

The useful future is an auditable correction loop

The best way to read this paper is not as a claim that clinical LLMs can now fix themselves. That framing is too loose and too theatrical.

A better reading is this: clinical agents need a memory not only of facts, but of corrected reasoning failures. DRL proposes a way to build that memory outside the model, in a form that can be inspected, edited, and retrieved. It turns “the model got this wrong” into “the model missed this kind of factor under this kind of context, so future prompts should include this correction.”

That is a practical architecture. It respects the reality that healthcare AI must be improved under governance constraints. It also respects the reality that most deployed AI systems will not be retrained every time a new local failure mode appears. A patchable, auditable, institution-specific reasoning layer is not glamorous. It is merely the kind of thing that can survive contact with procurement, compliance, and clinicians who have seen enough magic demos for one lifetime.

The paper’s core message is therefore simple: mind the gap, but do not just admire it. Convert it into a patch.

Cognaptus: Automate the Present, Incubate the Future.

Jinsong Liu, Yuhang Jiang, Ramayya Krishnan, Rema Padman, Yiye Zhang, and Jiang Bian, “Closing Reasoning Gaps in Clinical Agents with Differential Reasoning Learning,” arXiv:2602.09945, 2026. ↩︎

The gap is not the wrong answer; it is the wrong route to the answer#

Graph edit distance becomes a clinical error lens#

The “knowledge base” stores reasoning patches, not medical trivia#

The strongest evidence appears under domain shift#

The ablations show this is not just more examples in the prompt#

The physician review is face validity, not a clinical trial#

The business value is not cheaper training; it is cheaper institutional adaptation#

Where this approach can break#

What the paper directly shows, and what business readers should infer#

The useful future is an auditable correction loop#