Cover image

When 100% Sensitivity Isn’t Safety: How LLMs Fail in Real Clinical Work

Opening — Why this matters now Healthcare AI has entered its most dangerous phase: the era where models look good enough to trust. Clinician‑level benchmark scores are routinely advertised, pilots are quietly expanding, and decision‑support tools are inching closer to unsupervised use. Yet beneath the reassuring metrics lies an uncomfortable truth — high accuracy does not equal safe reasoning. ...

December 25, 2025 · 5 min · Zelina
Cover image

When Benchmarks Rot: Why Static ‘Gold Labels’ Are a Clinical Liability

Opening — Why this matters now Clinical AI has entered an uncomfortable phase of maturity. Models are no longer failing loudly; they are failing quietly. They produce fluent answers, pass public benchmarks, and even outperform physicians on narrowly defined tasks — until you look closely at what those benchmarks are actually measuring. The paper at hand dissects one such case: MedCalc-Bench, the de‑facto evaluation standard for automated medical risk-score computation. The uncomfortable conclusion is simple: when benchmarks are treated as static truth, they slowly drift away from clinical reality — and when those same labels are reused as reinforcement-learning rewards, that drift actively teaches models the wrong thing. ...

December 23, 2025 · 4 min · Zelina
Cover image

Knows the Facts, Misses the Plot: LLMs’ Knowledge–Reasoning Split in Clinical NLI

The gist A new clinical natural language inference (NLI) benchmark isolates what models know from how they reason—and the results are stark. State‑of‑the‑art LLMs ace targeted fact checks (≈92% accuracy) but crater on the actual reasoning tasks (≈25% accuracy). The collapse is most extreme in compositional grounding (≈4% accuracy), where a claim depends on multiple interacting clinical constraints (e.g., drug × dose × diagnosis × schedule). Scaling yielded fluent prose, not reliable inference. ...

August 18, 2025 · 4 min · Zelina