Clinical Decision Support

The Mask Is Not the Model: MMIR-TCM Makes Clinical Memory Inspectable

TL;DR for operators How should a clinical AI system move from a noisy image to a recommendation without hiding every judgment inside one model? The practical answer is to separate image standardization, structured interpretation, retrieval, and recommendation generation so each stage can be inspected, corrected, and validated independently. MMIR-TCM’s strongest evidence comes from removing those supports one at a time. Clinical-case memory produced the largest overall loss in prescription reasoning when removed. Formal diagnostic-theory memory mattered most for syndrome differentiation, while removing tongue findings particularly weakened prescription generation. By contrast, tongue segmentation—the architecture’s most visible component—improved semantic performance only modestly. ...

Calibrated Confidence: When AI Learns to Doubt Itself (Just Enough)

A doctor does not need an assistant that sounds certain all the time. That is just an intern with better typography. What the doctor needs is narrower and more useful: an assistant that knows when its answer deserves a second look. In high-stakes work, the confidence attached to an answer is not decoration. It is workflow metadata. It tells the system whether to proceed, pause, escalate, or ask someone with a license and malpractice insurance. ...

Too Many Doctors in the Room? Benchmarking the Rise of Medical AI Agent Teams

Too Many Doctors in the Room? Benchmarking the Rise of Medical AI Agent Teams Doctors know the problem. A difficult case enters the room. One specialist sees a radiology pattern. Another notices a metabolic clue. A third worries about a rare diagnosis. Everyone has a useful fragment. Then the meeting gets longer, the notes get messier, and somehow the final answer becomes less clear than the first opinion. ...

Triage by Token: When Context Clues Quietly Override Clinical Judgment

A patient walks into an emergency department. Or arrives by ambulance. Or lives far from the hospital. Or has private insurance. Or has missed prior appointments. Clinically, those details may be background noise. In triage, the core question is supposed to be sharper: how sick is this patient, how urgent is the risk, and what resources are likely needed? The Emergency Severity Index, or ESI, is not a lifestyle quiz with a stethoscope attached. ...

Doctor GPT, But Make It Explainable

Triage begins with messy language. A patient does not usually arrive as a clean feature vector. They arrive with “I feel tired,” “my stomach is strange,” “I have fever but not always,” or the classic: “I searched online and now I am either fine or dying.” Traditional diagnostic models are not built for this level of human poetry. They prefer structured fields, stable vocabularies, and the fantasy that symptoms behave like dropdown menus. ...

Dreams Decoded: When Vision–Language Models Learn to Read Your Brain Waves

Sleep looks simple until someone has to label it. A patient lies still. Sensors record electrical activity. The night becomes a long strip of waveforms. Then a sleep technologist, following clinical scoring rules, breaks the record into 30-second epochs and assigns stages: Wake, N1, N2, N3, REM. That sounds mechanical. It is not. N1 can look annoyingly close to REM. Wake can share alpha activity with early sleep. Signals are noisy. Humans disagree. Machines, when handed the wrong representation, fail with impressive confidence. Very on brand. ...

Graph Medicine: When RAG Stops Guessing and Starts Diagnosing

Hospitals do not suffer from a shortage of medical text. They suffer from a shortage of medical text that machines can use without becoming dangerously imaginative. Clinical guidelines are full of thresholds, exceptions, disease associations, diagnostic pathways, and terminology that looks tidy only until someone tries to automate it. A guideline may say one thing about a biomarker in the context of cardiovascular risk, another in renal disease, and something subtly different when age, sex, postoperative status, or treatment history enters the room. This is exactly the sort of nuance that makes large language models useful—and also exactly the sort of nuance that makes them risky. ...

Charting a Better Bedside: When Agentic RL Teaches RAG to Diagnose

TL;DR for operators Diagnosis is not a search-box problem. A clinician does not simply type a symptom list, read a guideline, and pick a disease like ordering takeaway. The useful work is iterative: form a hypothesis, compare against similar cases, notice what does not fit, retrieve again, ignore plausible-looking rubbish, and only then commit. ...