When 100% Sensitivity Isn’t Safety: How LLMs Fail in Real Clinical Work
Clinic. That is where the comforting AI story starts to wobble. In a benchmark, a clinical model receives a clean question, enough context, and a scoring rule that usually rewards the right answer. In a clinic, the same model sees an elderly patient with multiple conditions, incomplete records, medication changes from years ago, possible specialist involvement, ambiguous prescribing history, and a problem that may not require action at all. The model is not merely being asked, “Can you spot a risk?” It is being asked, “Do you understand whether this risk is real, current, important, and safely actionable?” ...