Opening — Why this matters now
Healthcare AI has entered its most dangerous phase: the era where models look good enough to trust. Clinician‑level benchmark scores are routinely advertised, pilots are quietly expanding, and decision‑support tools are inching closer to unsupervised use. Yet beneath the reassuring metrics lies an uncomfortable truth — high accuracy does not equal safe reasoning.
This paper delivers a rare and sobering reality check. Instead of synthetic vignettes or exam‑style questions, it evaluates a large language model performing medication safety reviews on real NHS primary care data. The results are impressive, unsettling, and — most importantly — instructive.
Background — From benchmarks to bedside
Medication safety is not a niche problem. It is one of the largest sources of preventable harm in modern healthcare, particularly in ageing populations with multimorbidity and polypharmacy. Structured medication reviews help, but the NHS lacks the capacity to perform them at scale.
LLMs seem like an obvious solution. Prior studies show models matching or exceeding clinicians on medical exams, diagnostic tasks, and curated datasets. However, most of that evidence rests on controlled, well‑posed problems. Real clinical work is none of those things.
This study targets the evaluation gap directly: What happens when an LLM is dropped into messy, longitudinal, incomplete electronic health records — and asked to make safety‑critical judgments?
Analysis — What the study actually did
The authors evaluated an LLM‑based medication safety review system on real NHS primary care data covering over 2.1 million adults. From this population, they carefully sampled 277 patients across a spectrum of clinical complexity.
Key design choices matter here:
- Real EHR data — no synthetic cases, no exam questions
- Clinician‑graded outputs, not automated scoring alone
- Hierarchical evaluation, separating detection from reasoning and action
The three‑level evaluation framework
| Level | Question being asked | Why it matters |
|---|---|---|
| Level 1 | Is any safety issue present? | Pure detection ability |
| Level 2 | Are the right issues identified? | Clinical reasoning |
| Level 3 | Is the proposed intervention appropriate? | Real‑world safety |
This structure exposes a critical illusion: strong top‑line accuracy can coexist with deeply flawed decision‑making downstream.
Findings — Strong detection, brittle judgment
At first glance, the system looks excellent.
- Sensitivity: 100% — it never missed a patient with a genuine safety issue
- Specificity: 83% — reasonable restraint in low‑risk cases
But the illusion collapses quickly.
Only 46.9% of patients received a fully correct output — meaning correct issue identification and a safe, appropriate intervention.
In other words: the model almost always sensed that something was wrong, but frequently failed to understand what was wrong or what to do about it.
Where things went wrong
Across 148 patients, clinicians identified 178 distinct failure instances. These failures cluster into a striking pattern.
| Failure category | Share of failures | What it reveals |
|---|---|---|
| Contextual reasoning failures | ~86% | The dominant problem |
| Factual / knowledge errors | ~14% | A minority issue |
This alone overturns a common industry assumption: that better retrieval, bigger knowledge bases, or more medical fine‑tuning will solve clinical AI safety.
The five failure patterns that matter
The paper identifies five recurring failure modes. None are exotic. All are deeply human — and deeply dangerous when automated.
1. Overconfidence under uncertainty
The system often acted decisively when it should have paused. Missing information did not trigger clarification; it triggered action.
Typical errors included:
- Changing specialist‑initiated medications without consultation
- Acting on outdated historical data
- Failing to request confirmation before irreversible changes
2. Protocol over patient
Guidelines were applied mechanically, without adjusting for:
- Frailty
- Palliative care goals
- Competing risks
In several cases, the model attempted to “fix” patients whose treatment plans were intentionally de‑intensified for end‑of‑life care.
3. Protocol over practice
The model understood documentation better than how healthcare actually works.
Examples:
- Treating intentional split‑dose prescriptions as errors
- Misreading supply‑chain substitutions as duplicate therapy
- Confusing prescription records with actual medication exposure
This is tacit system knowledge — rarely written down, never benchmarked.
4. Coherent but wrong
Some outputs were beautifully reasoned — and factually incorrect.
Hallucinated drug compositions and misapplied pharmacology appeared, but importantly, these were not the dominant failure mode.
5. Process blindness
Even when the model identified the correct clinical goal, it often proposed unsafe paths to get there:
- Abrupt medication cessation without tapering
- Starting therapies without prerequisite risk assessment
- Removing treatment before arranging a safe alternative
The model optimized endpoints, not transitions.
Implications — What this means for AI deployment
Three uncomfortable conclusions emerge.
1. Benchmarks are not safety tests
Passing medical exams proves knowledge. It does not prove judgment. The gap between detection and safe action is where real harm lives.
2. RAG will not save you
Most failures were not caused by missing information, but by misapplied information. More documents will not teach models when not to act.
3. Clinical AI must become agentic — carefully
Safe systems will need the ability to:
- Recognize uncertainty
- Request additional information
- Defer decisions
- Escalate rather than intervene
Ironically, this makes them slower, more cautious, and less “decisive” — traits current product incentives actively discourage.
Conclusion — Detection is cheap, judgment is not
This study delivers a necessary correction to the current AI narrative in healthcare. High sensitivity is easy. Contextual judgment is hard. Translating knowledge into safe action remains the unsolved problem.
Until evaluation frameworks reflect that reality, claims of clinician‑level AI should be treated with the appropriate clinical response: cautious skepticism.
Cognaptus: Automate the Present, Incubate the Future.