Cover image

Twin Peaks: When Alzheimer’s AI Learns to Remember What Clinics Forget

Opening — Why this matters now Healthcare AI has spent years trying to look impressive in carefully lit laboratory conditions. Alzheimer’s disease, with its irregular follow-ups, missing scans, incomplete biomarkers, and deeply uneven patient trajectories, is less polite. It is not a clean benchmark. It is a bureaucracy of biology. That is why the arXiv paper “CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer’s Disease” deserves attention.1 It does not merely ask whether a model can classify Alzheimer’s disease from a snapshot. That problem is already crowded, noisy, and occasionally dressed up as clinical transformation. Instead, the paper asks a harder and more operationally relevant question: can an AI system model an individual patient’s cognitive trajectory over time, using fragmented clinical evidence, while remaining accurate, calibrated, and fair across demographic groups? ...

April 29, 2026 · 12 min · Zelina
Cover image

When 100% Sensitivity Isn’t Safety: How LLMs Fail in Real Clinical Work

Opening — Why this matters now Healthcare AI has entered its most dangerous phase: the era where models look good enough to trust. Clinician‑level benchmark scores are routinely advertised, pilots are quietly expanding, and decision‑support tools are inching closer to unsupervised use. Yet beneath the reassuring metrics lies an uncomfortable truth — high accuracy does not equal safe reasoning. ...

December 25, 2025 · 5 min · Zelina
Cover image

When Benchmarks Rot: Why Static ‘Gold Labels’ Are a Clinical Liability

Opening — Why this matters now Clinical AI has entered an uncomfortable phase of maturity. Models are no longer failing loudly; they are failing quietly. They produce fluent answers, pass public benchmarks, and even outperform physicians on narrowly defined tasks — until you look closely at what those benchmarks are actually measuring. The paper at hand dissects one such case: MedCalc-Bench, the de‑facto evaluation standard for automated medical risk-score computation. The uncomfortable conclusion is simple: when benchmarks are treated as static truth, they slowly drift away from clinical reality — and when those same labels are reused as reinforcement-learning rewards, that drift actively teaches models the wrong thing. ...

December 23, 2025 · 4 min · Zelina
Cover image

Knows the Facts, Misses the Plot: LLMs’ Knowledge–Reasoning Split in Clinical NLI

The gist A new clinical natural language inference (NLI) benchmark isolates what models know from how they reason—and the results are stark. State‑of‑the‑art LLMs ace targeted fact checks (≈92% accuracy) but crater on the actual reasoning tasks (≈25% accuracy). The collapse is most extreme in compositional grounding (≈4% accuracy), where a claim depends on multiple interacting clinical constraints (e.g., drug × dose × diagnosis × schedule). Scaling yielded fluent prose, not reliable inference. ...

August 18, 2025 · 4 min · Zelina