Clinic.

That is where the comforting AI story starts to wobble.

In a benchmark, a clinical model receives a clean question, enough context, and a scoring rule that usually rewards the right answer. In a clinic, the same model sees an elderly patient with multiple conditions, incomplete records, medication changes from years ago, possible specialist involvement, ambiguous prescribing history, and a problem that may not require action at all. The model is not merely being asked, “Can you spot a risk?” It is being asked, “Do you understand whether this risk is real, current, important, and safely actionable?”

That distinction is the useful part of this paper.

In A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care, the authors evaluate an LLM-based medication-safety review system on real NHS primary-care electronic health record data, rather than synthetic vignettes or exam-style questions.1 The headline result looks excellent: 100% sensitivity for detecting patients with clinically significant prescribing issues. The operational result is much less flattering: the system produced a fully correct output in only 46.9% of reviewed patients.

This is not a story about a model that knew nothing. It is a story about a model that often knew enough to be dangerous.

The mechanism is simple, and that is why it matters: detecting that “something may be wrong” is easier than deciding exactly what is wrong, whether intervention is warranted, and how to intervene without creating a new problem. Healthcare AI vendors like to sell the first capability. Clinical workflows need the second.

The paper tests action quality, not just medical trivia

Medication safety is a good stress test for clinical AI because it sits between knowledge and judgment.

A prescribing review is not just a pharmacology quiz. It requires reading medication lists, diagnoses, labs, dates, risk factors, care context, and healthcare-system clues. A drug combination may be unsafe in one patient and intentional in another. A missing medication may be a serious omission, or a deliberate de-intensification in a frail patient. A recommendation to stop a drug may be clinically correct in principle but unsafe if done abruptly.

The authors studied electronic health records from NHS Cheshire and Merseyside, covering 2,125,549 adults. From a 200,000-patient test set, they selected 300 patients across different sources of medication-safety risk and clinical complexity. After exclusions for data quality or insufficient information, 277 patients remained for expert clinician review.

The evaluated system used gpt-oss-120b, a 120-billion parameter model, configured as a medication-safety reviewer. It received structured patient profiles converted into chronological Markdown. It did not use external knowledge sources such as NICE guidelines, the British National Formulary, dm+d, or PubMed. It produced structured outputs: whether intervention was needed, probability score, identified clinical issues, supporting evidence, and proposed intervention.

That design choice is important. This was not an optimized production agent with retrieval, tool use, specialist escalation, and repeated self-checks. It was a deliberately simple single-pass system. That makes the result less like a final product benchmark and more like a behavioral probe: what happens when a strong open model is asked to perform real medication review under realistic data constraints?

The answer: it detects risk impressively, then loses a lot of safety value in the translation from detection to action.

The central failure is the drop from “something is wrong” to “do this safely”

The paper’s evaluation framework is the most important design feature. It separates three levels of performance:

Evaluation level Question Why it matters
Level 1: Issue identification Did the system correctly flag that a patient had a clinically significant issue? This is detection. It supports triage and review prioritisation.
Level 2: Issue correctness If it flagged an issue, did it identify the correct issue or issues? This is clinical reasoning. The wrong explanation can send the workflow in the wrong direction.
Level 3: Intervention appropriateness If it identified the issue correctly, did it recommend an appropriate intervention? This is action quality. It is where patient safety actually lives.

At Level 1, the system looks excellent. Among 206 patients where the clinician found an issue requiring intervention, the system identified all 206. Sensitivity was 100%, with specificity of 83.1% among the 71 patients where the clinician judged no intervention was required. Overall binary accuracy was 95.7%.

A dashboard would love this result. A procurement slide would adore it. A clinician should keep reading.

At Level 2, the picture changes. Of the 206 positive cases, the system correctly identified all issues in 121 cases, or 58.7%. In the remaining 85 cases, it identified some relevant issue but not the complete correct set. Importantly, the model was rarely completely off-track. It was often partly right.

That is precisely the problem. A partly correct clinical recommendation is not merely “some value.” It may be a persuasive incomplete story.

At Level 3, the narrowing continues. Among the 121 cases where the system correctly identified all issues, only 71 had fully appropriate interventions. Another 46 had partially appropriate interventions, and 4 had inappropriate interventions. When true negatives are included, the system produced a fully correct output in 130 of 277 patients: 46.9%.

The practical translation is blunt:

Metric What it seems to say What it actually means for workflow
100% sensitivity The model did not miss positive cases in this sample. Useful for triage, assuming the same operating conditions hold.
83.1% specificity The model was not wildly over-triggering. False positives remain manageable but still require review.
46.9% fully correct output Less than half of patients received complete issue-plus-intervention correctness. Not safe as an autonomous recommendation engine.
95.7% binary accuracy Strong headline performance. Potentially misleading if used as a deployment-readiness metric.

This is the paper’s main lesson. Sensitivity is about not missing smoke. Safety is about knowing whether it is a fire, whether it is yours to extinguish, and whether water will make the situation worse.

Apparently that last part is where things get expensive.

Why the model failed: context beat knowledge

The most useful part of the paper is not the headline metric. It is the failure taxonomy.

Across 148 patients with errors, the authors identified 178 distinct failure instances. Only 25 instances, or 14%, were factual errors: hallucinated drug compositions, pharmacological knowledge gaps, or incorrect guideline thresholds. The remaining 153 instances, or 86%, were contextual reasoning failures.

That means the dominant problem was not “the model does not know medicine.” The dominant problem was “the model does not know how to apply medical knowledge under messy clinical context.”

The five failure categories make this clearer.

Failure reason Count What went wrong Business interpretation
Overconfidence in uncertainty 51 The model acted when it should have gathered more information or consulted a specialist. Clinical AI needs uncertainty-triggered workflow actions, not just confident text generation.
Protocol vs patient gap 49 The model applied guidelines without enough adjustment for frailty, palliative care, goals of care, or competing risks. Guideline compliance is not the same as patient-centered safety.
Protocol vs practice gap 30 The model misunderstood how UK prescribing and care delivery actually work. Tacit operational knowledge must be represented somewhere in the system.
Coherent but factually incorrect 25 The model reasoned fluently from wrong facts, such as drug composition hallucinations or incorrect pharmacology. Retrieval and drug databases help here, but this is a minority of failures.
Process blindness 23 The model identified a valid clinical goal but recommended an unsafe path to get there. Systems must reason about sequencing, transition risk, and shared decision-making.

This distribution is inconvenient for the usual product roadmap.

If factual errors dominated, the fix would be straightforward: add retrieval, connect to drug databases, fine-tune on medical guidelines, and run another benchmark. Some of that is still useful. The paper gives examples where automatic lookup against authoritative medication sources could likely reduce hallucinated drug-composition errors.

But if most failures come from context, then “more knowledge” is not enough. The system needs to decide when available data are insufficient, when a guideline should yield to patient context, when a record is not the same as actual medication exposure, and when the first safe intervention is not a medication change but an information-gathering step.

That is not a database problem. It is a workflow reasoning problem.

Overconfidence under uncertainty: the model preferred action over asking

The largest failure category was overconfidence under uncertainty, with 51 instances.

This is the most commercially relevant failure because it directly clashes with how AI products are often positioned. Users want decisive output. Product demos reward confident recommendations. Interfaces often nudge models toward completion rather than escalation. Clinical safety often requires the opposite.

The paper describes cases where the system acted on historical information without verifying whether it remained current, recommended discontinuing specialist-initiated medications without appropriate consultation, or made medication changes without understanding the original indication.

One vignette involved methotrexate alongside an NSAID. The system recognized a plausible interaction risk and recommended withholding methotrexate. But methotrexate may be specialist-managed, and stopping it without specialist discussion could trigger relapse of a serious condition. The issue was not that the model failed to see risk. It saw risk too eagerly and converted uncertainty into action.

For business use, this suggests a different design requirement:

A safe clinical AI system should not only output recommendations. It should classify what kind of next step is safe.

Possible next steps include: proceed with recommendation, request missing information, check an authoritative protocol, consult specialist records, escalate to a clinician, or mark the case as unsuitable for automated recommendation.

This is where agentic design becomes relevant, but not in the Silicon Valley sense of “let the agent do more things.” In clinical AI, agentic capability is valuable only if it lets the system do fewer unsafe things: pause, look up, ask, verify, and defer.

Protocols failed when patients stopped looking like textbook cases

The second major failure group was the protocol-vs-patient gap, with 49 instances.

Here the model applied standard guidelines without adequately adjusting for individual patient context. This is the classic trap of clinical AI: guideline adherence feels objective, but clinical care is full of legitimate exceptions.

The paper gives the example of an 80-year-old woman with coronary heart disease, heart failure, chronic kidney disease, and documented frailty. Several cardiovascular medications had been discontinued months earlier. The system flagged the omissions and recommended restarting aspirin, statin, ACE inhibitor, and beta blockers. For a younger or less frail patient, that might look reasonable. In this case, the clinician interpreted it as inappropriate advanced care planning.

The model was not “anti-medical.” It was too medical in the wrong way. It treated absence from a guideline pathway as a defect to repair, without enough awareness of goals of care, prognosis, burden of treatment, and the possibility that de-intensification was intentional.

For buyers and builders, the lesson is that safety evaluation cannot stop at “does the model know the guideline?” The harder question is: can it identify when the guideline is not the governing logic?

That requires patient-context data that structured EHR exports often do not contain: preferences, adherence, frailty nuance, specialist reasoning, family discussions, end-of-life goals, and the tacit “why” behind prior decisions. The paper’s structured profiles included coded information but no free-text clinical notes. That limitation matters because many of the missing contextual signals live precisely outside neat coded fields.

Practice knowledge is not the same as medical knowledge

The third failure category is especially interesting for enterprise AI: protocol-vs-practice gap.

In 30 instances, the model misunderstood how healthcare delivery works in practice. It misread duplicate prescriptions, misunderstood UK prescribing conventions, or confused prescription records with actual patient exposure.

One example involved ramipril prescribed as 2.5 mg and 1.25 mg concurrently. The model flagged this as duplication and recommended stopping the 1.25 mg tablet. But the combined dose of 3.75 mg may be intentional because it cannot be achieved with a single tablet strength.

This is not exactly pharmacology. It is local operational literacy.

Every enterprise workflow has its equivalent. In finance, a duplicated transaction may be fraud or a settlement convention. In logistics, an odd routing pattern may be waste or a customs workaround. In healthcare, duplicate prescriptions may be unsafe duplication or correct dose construction. The model needs the practice grammar of the institution, not just the official policy manual.

This is why generic “clinical reasoning” claims are too broad. A system can know medical facts and still misunderstand the administrative and operational forms through which medicine is delivered.

For AI implementation, the implication is practical: local workflow knowledge must be modeled, tested, and maintained. It cannot be assumed to emerge automatically from a larger foundation model.

Factual errors were real, but they were not the main event

The paper does not excuse hallucination. It documents it.

In 25 instances, the system produced coherent reasoning from incorrect facts. Some involved hallucinated drug compositions. The system repeatedly misidentified Monomil XL, an isosorbide mononitrate brand, implying in different cases that it contained clopidogrel, was a calcium channel blocker, or was an opioid. Other errors involved misunderstanding systemic risk from topical preparations or applying incorrect threshold values.

These errors matter because fluent clinical prose can make wrong facts look respectable. A bad recommendation with a clear rationale is often more persuasive than a vague one. Wonderful. Exactly what everyone wanted: confidence with footnotes it made up internally.

But the numeric distribution matters. Factual errors were 14% of failure instances. Retrieval, medication dictionaries, and guideline lookup could reduce these failures. They would not solve the other 86%.

That is the procurement lesson: RAG is a necessary control for some risks, not a complete safety architecture.

A better deployment architecture would pair retrieval with context gates. For example:

Risk type Likely control Why it is insufficient alone
Wrong drug composition Drug database lookup Does not decide whether action is appropriate for this patient.
Wrong guideline threshold NICE/BNF retrieval Does not capture frailty, preferences, or specialist context.
Duplicate prescription misread Local prescribing rules and examples Still requires distinguishing intentional from accidental duplication.
Premature intervention Uncertainty calibration and escalation rules Requires workflow design, not just better text generation.
Unsafe transition Sequencing checklist and clinician review Requires process reasoning beyond endpoint recommendation.

The practical question is not “Should we use RAG?” It is “Which failure modes does RAG actually cover, and which ones remain exposed?”

Process blindness: correct endpoint, unsafe path

The fifth category, process blindness, is smaller than overconfidence or protocol misapplication but strategically important.

In these cases, the system identified a reasonable clinical endpoint but recommended an unsafe route. It advised starting anticoagulation without prerequisite bleeding-risk assessment. It recommended antihypertensive treatment without confirming hypertension through home or ambulatory monitoring. It advised immediate discontinuation of medications that require tapering. It suggested stopping contraception before arranging an alternative.

This is a different kind of failure from hallucination. The final state may be clinically sensible. The transition is not.

Many enterprise AI systems fail this way. They optimize for the answer, not the pathway. In clinical work, the pathway is part of the answer. A safe plan includes timing, prerequisites, monitoring, substitution, patient consent, and handoff.

For healthcare AI, this means intervention generation should be evaluated as a process plan, not just a recommendation label. “Stop drug X” is not complete if the safe version is “taper drug X over several weeks, monitor withdrawal symptoms, and arrange alternative therapy before discontinuation.”

The paper’s failure examples therefore point toward a more demanding evaluation standard: not only whether the model found the right clinical concern, but whether it understood the order of operations.

The supporting tests mostly reinforce the main story

The paper includes several additional analyses. They are useful, but they do different jobs. Treating all of them as equal “findings” would blur the argument.

Test or analysis Likely purpose What it supports What it does not prove
Three-level clinician evaluation Main evidence Binary detection can hide poor issue-plus-intervention correctness. It does not establish prospective patient-outcome impact.
Failure taxonomy and vignettes Main explanatory evidence Most errors were contextual reasoning failures rather than factual gaps. Counts may shift under different prompts, richer notes, or agentic workflows.
Patient complexity analysis Sensitivity / explanatory analysis Performance declined as complexity increased, though age, medication count, and comorbidity were intercorrelated. It does not isolate a single independent causal driver of failure.
Self-consistency and anchoring assessment Robustness / bias check The system showed substantial output variability; non-blinded clinician review may have inflated agreement. It does not fully correct the evaluation for anchoring bias.
Multi-model comparison Comparison test Larger within-architecture model scale and medium reasoning effort performed best in this setup; medical fine-tuning alone did not close the gap. It does not prove general superiority across all clinical tasks or closed models.
Ethnicity counterfactual test Fairness-oriented sensitivity test Adding White, Asian, or Black ethnicity labels did not significantly change average performance in this experiment. It does not establish broad demographic fairness across real-world subgroups or outcomes.
Prescribing safety indicator appendix Implementation detail plus supplementary analysis Deterministic indicators help sample high-risk cases but do not replace clinician judgment. It does not provide a complete automated ground truth.
Population-level extrapolation Deployment-facing supplementary estimate In a random indicator-negative subset, estimated binary performance remained strong. It still concerns binary flagging, not full action safety.

The self-consistency finding is especially worth noting. Across repeated runs, the system showed substantial output variability. The LLM-as-judge scorer was much more consistent than the reviewed system itself, suggesting that the variation was not merely evaluation noise. The model could give meaningfully different medication-safety assessments for the same patient profile.

That matters operationally. If repeated runs produce different outputs, “the model’s recommendation” is not a stable object. It is a sample from a distribution. Clinical governance must decide whether to average, vote, escalate disagreement, or treat inconsistency as a risk signal.

The model comparison is also useful, but should not be over-read. The best-performing configuration was gpt-oss-120b with medium reasoning effort. The smaller gpt-oss-20b performed substantially worse. Gemma-family models performed worse still, although medical fine-tuning improved Gemma relative to its base version. Higher reasoning effort did not monotonically improve performance; in this setup it increased false positives and reduced overall score relative to medium effort.

The lesson is not “always use this model.” The lesson is that model choice, scale, reasoning configuration, and workflow scaffolding interact. Procurement based on a single medical benchmark is, technically speaking, vibes with paperwork.

Ground truth was itself a clinical problem

One of the paper’s most business-relevant appendices concerns ground truth.

Before relying on clinician review, the authors investigated whether structured EHR codes could provide reliable labels. They found they could not. Medication change rates after structured medication reviews were nearly identical whether codes indicated an issue had been found or not: 30.8% versus 30.2% within three months. Even prescribing safety indicators showed 30.1% disagreement with clinician judgment among reviewed indicator-positive cases.

This is not a small methodological footnote. It is a warning about evaluation infrastructure.

Many healthcare AI products will be tempted to validate themselves cheaply using existing structured codes, downstream medication changes, or rule-based proxies. The paper shows why that can fail. Medication safety is contextual. A code may not capture why a decision was made. A medication change may not indicate whether the original issue was correctly identified. A deterministic indicator may flag a real complexity signal even when the specific rule does not warrant intervention.

For business leaders, the uncomfortable implication is that real clinical AI validation requires expert review. It may also require access to richer context: notes, preferences, adherence, secondary-care information, care goals, and local workflow knowledge. That makes evaluation slower and more expensive. It also makes it more honest.

There is no free A/B test hidden inside the EHR. Annoying, but reality often is.

What healthcare AI buyers should infer — and what they should not

The paper directly shows four things.

First, a strong LLM system can perform very well at binary detection of medication-safety issues in a real NHS primary-care sample.

Second, that same system can fail to produce fully correct issue-plus-intervention outputs in more than half of reviewed patients.

Third, the dominant failure mechanism was contextual reasoning, not missing medical knowledge.

Fourth, structured EHR data alone were insufficient to create reliable ground truth or provide all context needed for safe judgment.

Cognaptus would infer several business implications from this, but these are inferences, not direct clinical trial outcomes.

For triage, the system looks promising. A tool that flags patients for review with high sensitivity and strong positive predictive value could help prioritise scarce pharmacist or clinician attention. In the population-level extrapolation appendix, binary estimates remained strong in the random indicator-negative subset. That supports the idea that LLMs may help find cases worth human review.

For autonomous recommendation, the evidence is not supportive. The fully correct output rate is too low, the failures are too contextual, and the system’s single-pass design lacks the ability to request missing information or verify local practice assumptions.

For product design, the next step is not merely “add a bigger medical model.” A safer architecture would likely need:

  1. authoritative medication and guideline lookup;
  2. uncertainty detection that triggers information gathering;
  3. local prescribing-practice knowledge;
  4. explicit sequencing checks for medication transitions;
  5. self-consistency or disagreement detection;
  6. clinician-in-the-loop review for action recommendations;
  7. evaluation metrics that separate detection, issue correctness, and intervention appropriateness.

For governance, the evaluation standard should shift from “Does it find risks?” to “What decisions is the system allowed to make after it finds them?”

That distinction determines the product category. A triage assistant is one thing. A recommendation engine is another. A semi-autonomous medication reviewer is something else entirely, and should be regulated, evaluated, and insured accordingly.

Boundaries: where this result applies, and where it should not be stretched

The study is unusually valuable because it uses real clinical data and detailed failure analysis, but its boundaries are clear.

The data came from one NHS Integrated Care Board and used structured EHR profiles without free-text notes or complete secondary-care records. That likely made some contextual reasoning harder. A system with notes, specialist letters, patient preferences, adherence information, and richer care-plan context might perform differently.

The evaluation used one experienced clinician, and the clinician reviewed the system output rather than independently assessing each case first. The authors acknowledge that this non-blinded design may have introduced anchoring bias and inflated agreement. Their self-consistency analysis suggests observed binary agreement may indeed be higher than the model’s own stability ceiling.

The system was deliberately simple: single-pass inference, no external knowledge tools, no self-consistency checks, no iterative prompt refinement, no agentic information-gathering. That is a limitation if the question is “what is the best possible clinical AI workflow?” It is a strength if the question is “what failure behaviors appear before we hide them under scaffolding?”

The task was medication safety review, mostly involving chronic prescribing decisions. The harm profile may differ from emergency prescribing, acute diagnosis, or triage. In this study, most failures were classified as no harm or mild harm if implemented without review; 7.5% were moderate, one was potentially severe, and none were classified as likely to cause or accelerate death. That should not be generalized to all clinical AI settings.

Finally, the model comparison was limited to the evaluated open-model configurations and this specific pipeline. It does not establish universal model rankings. It does, however, warn against assuming that medical fine-tuning or higher reasoning effort automatically solves real-world clinical judgment.

The real product is not the model; it is the safety workflow

The cleanest way to read this paper is not “LLMs are unsafe in healthcare.” That is too lazy. The system did something genuinely useful: it found every clinician-confirmed positive case in this sample. For overloaded health systems, that is not trivial.

The better reading is more specific: LLMs may be good at surfacing medication-safety concerns before they are good at safely resolving them.

That should change product design.

A clinical AI system should not be evaluated as a single text box that emits advice. It should be evaluated as a workflow component with permission boundaries. What can it flag? What can it recommend? What must it verify? When must it ask for more information? When must it stop? When should disagreement across repeated runs become a safety signal? When should a clinician be required before any action reaches the patient record?

The paper’s 46.9% fully correct rate is not just a metric. It is a map of the missing middle between benchmark competence and clinical deployment.

Detection is cheap. Judgment is not. And in healthcare, the expensive part is usually where the patient is.

Cognaptus: Automate the Present, Incubate the Future.


  1. Oliver Normand, Esther Borsi, Mitch Fruin, Lauren E. Walker, Jamie Heagerty, Chris C. Holmes, Anthony J. Avery, Iain E. Buchan, and Harry Coppock, “A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care,” arXiv:2512.21127, 2025, https://arxiv.org/abs/2512.21127↩︎