A patient walks into a clinic and tells the doctor several things at once: chest tightness, shortness of breath, leg swelling, leg pain, maybe a history of walking too much, maybe some anxiety, maybe something that sounds more obviously cardiac. The dangerous part is not the word “chest.” The dangerous part is the chain: leg swelling and pain may suggest deep vein thrombosis; shortness of breath may suggest pulmonary embolism; pulmonary embolism can kill.
In the paper behind today’s article, GPT-4o missed that chain in a simulated emergency-style prompt. It prioritized the more obvious chest symptoms and failed the DVT-to-PE risk assessment, receiving the worst score on that probe.1 That single miss is the right place to begin, because the paper is not mainly about whether Qwen3-Max beats GPT-4o on a small benchmark. Leaderboards are the snack food of AI commentary: instantly satisfying, nutritionally suspicious.
The real question is harder: can an LLM handle the messy first layer of medicine, before the patient has been converted into a clean case summary?
That layer is where patients ramble, understate, exaggerate, contradict themselves, hide fear inside jokes, mention crucial symptoms as side comments, and describe serious disease in language that would make a medical ontology quietly resign. The paper calls the resulting failure pattern AI-MASLD: AI-Metabolic Dysfunction-Associated Steatotic Liver Disease. The metaphor is theatrical, but the problem is practical. When a model is fed too much irrelevant, emotionally noisy, contradictory, or poorly ordered clinical information, it may accumulate “information fat”: redundant summaries, wrong priorities, hallucinated symptoms, and missed red flags.
This is not a cute pathology. It is an intake safety problem.
The patient narrative is the real input, not the polished medical case
Most optimistic claims about medical LLMs are built on cleaned inputs. A structured exam question says what matters. A doctor-written case report has already performed the first act of clinical intelligence: filtering the patient’s messy story into medically useful form. An electronic health record may be noisy, yes, but it is still a professional artifact, not a frightened person talking in circles.
The paper targets the gap between those two worlds. It evaluates four mainstream LLMs — GPT-4o, Gemini 2.5, DeepSeek 3.1, and Qwen3-Max — using twenty clinician-designed “medical probes.” These probes are grouped into five dimensions:
| Probe dimension | What it tests | Why it matters in clinical AI |
|---|---|---|
| Noise filtering | Extracting core medical facts from irrelevant life details | The model must find signal before diagnosis can even begin |
| Priority triage | Identifying the most urgent risk among distractors | Patient-facing tools must not chase the loudest symptom |
| Contradiction detection | Finding conflicts between patient beliefs and clinical warning signs | Patients often misread their own severity |
| Fact-emotion separation | Separating objective symptoms from anxiety, metaphor, complaint, or fear | Clinicians need usable facts, not emotional sludge |
| Timeline sorting | Reconstructing disease progression from nonlinear narratives | Temporal order changes diagnostic interpretation |
The setup is intentionally narrow: pure text, simulated patient narratives, clinician-defined gold standards, and a 0–4 inverse scoring scale where lower is better. A perfect model across twenty probes would score 0; the worst possible total would be 80.
The result: no model looked clean.
Qwen3-Max performed best with 16/80, followed by DeepSeek 3.1 at 23/80, GPT-4o at 27/80, and Gemini 2.5 at 32/80. The ranking matters, but the score pattern matters more. The models did not fail evenly. They failed in shapes.
The DVT-to-PE miss shows why triage is not keyword matching
The most business-relevant failure is the priority-triage probe involving DVT and pulmonary embolism. The gold-standard judgment was that leg swelling and pain should be treated as the priority because of the possible DVT-to-PE cascade. Chest tightness and shortness of breath were not irrelevant, but the clinical danger lay in linking them to the leg symptoms.
GPT-4o received a 4, the worst score. Gemini 2.5 also received a 4. DeepSeek 3.1 received a 1. Qwen3-Max received a 0.
This is the kind of result that breaks lazy procurement logic. In many organizations, “best general model” quietly becomes “safe enough for specialized workflow.” That shortcut is expensive in medicine. The paper shows that a model can know medical terms and still fail to weight them correctly under narrative pressure.
The distinction is important:
| Capability | Looks like | Failure mode |
|---|---|---|
| Medical knowledge | Knows that DVT and PE are related | Still may not activate that relation in a noisy case |
| Symptom extraction | Lists leg swelling, chest symptoms, and dyspnea | Treats all facts as equally important |
| Clinical prioritization | Identifies the hidden lethal chain | Follows the most salient or familiar symptom instead |
| Safe triage support | Escalates the right risk for human review | Provides plausible but unsafe reassurance or wrong ordering |
That is why the DVT-to-PE case should not be read as a “gotcha” against one model. It is a design warning. Clinical intake is not a search task. It is a risk-weighting task under uncertainty, with ugly input.
A chatbot that lists symptoms is not yet a triage assistant. It is a typist with better grammar.
The strongest overall model still failed under extreme noise
Qwen3-Max had the best total score. That should be reported. It should not be worshipped.
In Probe 1.1, the patient narrative mixes car accident frustration, snoring, sleep disturbance, leg cramps, weakness, yellowish facial appearance, greasy takeout, and a past abnormal liver-function marker. The gold-standard extraction is concise: recent fatigue, yellowing face, nocturnal leg cramps, and abnormal liver-function history.
Here the paper reports a striking pattern. GPT-4o scored 2 because it included secondary information such as gender and snoring. Gemini 2.5, DeepSeek 3.1, and Qwen3-Max each scored 4 because they fabricated “hypersomnia” or equivalent daytime sleepiness, even though the original patient statement did not say that.
This is the paper’s “information steatosis” metaphor in miniature. Under heavy narrative load, the model does not merely omit; it may metabolize noise into invented clinical content. That is worse than verbosity. A verbose but faithful summary wastes clinician time. A fabricated symptom can redirect clinical reasoning.
The business implication is uncomfortable: the best model in the aggregate can still fail catastrophically on the wrong input shape. Model selection cannot be based only on total score. It must include failure-mode review.
For healthcare deployment, the evaluation question should be:
Which failures remain after the average score looks acceptable?
That question is less glamorous than “which model wins?” Unfortunately, it is also the one that matters.
The five dimensions separate “chat ability” from clinical utility
The paper’s main evidence is the twenty-probe score table and its dimension-level analysis. The supplementary materials add the probe design, scoring rules, and model-response tables. That supplementary section is not an ablation or robustness test; it functions mainly as an implementation audit trail. It shows what the models were asked, what the gold standards were, and why scores were deducted.
The dimension totals are revealing:
| Dimension | GPT-4o | Gemini 2.5 | DeepSeek 3.1 | Qwen3-Max | Interpretation |
|---|---|---|---|---|---|
| Noise filtering | 8 | 11 | 9 | 8 | Universal weak point |
| Priority triage | 5 | 12 | 2 | 2 | Sharpest safety divergence |
| Contradiction detection | 7 | 4 | 1 | 3 | DeepSeek strongest here |
| Fact-emotion separation | 7 | 5 | 7 | 3 | Qwen3-Max strongest here |
| Timeline sorting | 0 | 0 | 4 | 0 | Mostly solved, except DeepSeek redundancy |
The pattern is not “LLMs are bad at medicine.” That is too crude. The pattern is more useful: current models are relatively good at arranging timelines, less reliable at filtering noisy patient speech, and highly uneven at clinical prioritization and contradiction detection.
That gives product teams a practical map. If the task is to reorder already-extracted symptom events, several models may be adequate. If the task is to decide which messy patient statement deserves urgent escalation, the evaluation bar must be much higher.
Timeline sorting is the low-friction capability. Priority triage is the landmine.
Information steatosis is not just verbosity; it is operational drag
The paper identifies three failure modes: catastrophic functional failure, information steatosis, and hallucination or fabrication.
The second one deserves special attention because many AI products hide it behind politeness. A model gives the right answer, then surrounds it with caveats, secondary details, emotional paraphrase, differential diagnoses, explanations nobody asked for, and reassuring language. The answer is technically present. The clinician still has to excavate it.
In a low-stakes consumer app, verbosity is annoying. In clinical intake, verbosity becomes workflow debt.
Imagine a nurse or physician reviewing dozens of AI-generated intake summaries. If each summary contains the right warning sign buried under soft language and secondary facts, the product has not saved time. It has shifted filtering labor from the front of the workflow to the clinician’s already overloaded attention budget.
This is where “information steatosis” becomes a useful metaphor, despite its slightly dramatic naming. The problem is not that the model lacks content. The problem is inefficient metabolism: too much input becomes too much output, and the medically useful signal is diluted.
For business use, that means evaluation should include output density, not only correctness. A summary that is correct but clinically bloated may still fail the product requirement.
A practical intake metric could track four things:
| Evaluation target | Useful question |
|---|---|
| Signal capture | Did the model include the clinically necessary facts? |
| Noise rejection | Did it exclude irrelevant life details and emotional filler? |
| Risk weighting | Did it put dangerous possibilities first? |
| Output density | Can a clinician act on it quickly? |
The paper does not provide an industrial benchmark for these metrics. It does provide a prototype of the kind of stress test healthcare vendors should be running before making patient-facing claims.
Contradictions are where patient belief becomes clinical risk
Contradiction detection is one of the paper’s strongest design choices. Patients rarely present themselves as diagnostic puzzles. They present interpretations: “my gastritis is worse,” “my blood pressure is controlled,” “this cough is probably nothing,” “I feel fine except for this strange symptom.”
A safe model must distinguish the patient’s interpretation from the clinical facts.
Probe 3.3 captures this neatly. The patient believes recurring heartburn is the main issue and says over-the-counter medication helps. But the new symptoms — progressive dysphagia and unintentional weight loss — are alarm signs that should shift attention toward possible esophageal cancer. The contradiction is not semantic. It is clinical: perceived improvement in one symptom coexists with new signals of a potentially serious condition.
GPT-4o performed poorly on this contradiction-detection dimension, while DeepSeek 3.1 performed best overall. Again, the point is not brand drama. The point is capability granularity. A model can be fluent, knowledgeable, and still too compliant with the patient’s frame.
This matters for AI intake because patient-facing systems are exposed to self-diagnosis. If the model politely accepts the patient’s interpretation, it may become a very smooth amplifier of under-triage.
The safer behavior is not to contradict every patient. That would be useless and unpleasant. The safer behavior is to detect when the patient’s explanation is inconsistent with higher-risk evidence, then escalate the uncertainty clearly.
Emotion-fact separation is not bedside empathy; it is data cleaning
One of the paper’s better distinctions is between being a pleasant dialogue partner and being a useful clinical tool. The two overlap, but they are not the same.
In fact-emotion separation probes, the model must extract objective medical facts from emotionally charged language. A patient may describe memory decline through self-mockery, joint pain through weather metaphors, reflux through irony, or a child’s illness through parental panic. The model’s job is not to delete emotion; emotion is clinically meaningful. The job is to separate the objective fact from the emotional wrapper.
Qwen3-Max performed best in this dimension, with a dimension score of 3 compared with 7 for GPT-4o, 5 for Gemini 2.5, and 7 for DeepSeek 3.1.
That result points to a subtle product requirement. In patient-facing healthcare AI, emotional sensitivity is not automatically good. If the system over-absorbs emotional language, it may inflate urgency, summarize metaphors as symptoms, or create longer outputs that sound caring but require cleanup.
A useful clinical intake assistant might need two separate channels:
- Clinical facts: symptoms, duration, triggers, severity, progression, relevant history.
- Emotional context: fear, distress, denial, shame, frustration, caregiver anxiety.
Mixing them into a single paragraph looks humane. It is also how output becomes soup.
What the paper directly shows, and what Cognaptus infers
The paper directly shows a controlled stress-test result: four LLMs were evaluated on twenty simulated text probes, using clinician-defined gold standards and an inverse 0–4 scoring scale. The models showed different performance profiles. Qwen3-Max scored best overall. Gemini 2.5 scored worst. Noise filtering was the weakest shared dimension. Timeline sorting was the easiest. High-risk failures included missed DVT-to-PE reasoning, hallucinated symptoms under noise, and redundant outputs that buried core information.
Cognaptus’ business inference is broader but bounded: healthcare AI governance should not treat medical exam performance, generic model ranking, or polished case accuracy as sufficient evidence for direct patient-facing deployment. Before an LLM is used in intake, triage support, symptom collection, or pre-consultation summarization, it should be stress-tested on raw patient-language scenarios.
That inference applies most clearly to workflows where the model is upstream of a clinician:
| Workflow | Risk exposed by the paper | Governance response |
|---|---|---|
| Patient intake chatbot | Missing hidden red flags in noisy narratives | Require escalation-oriented stress tests |
| Pre-visit summarization | Producing verbose summaries that still need clinician filtering | Measure output density and signal-to-noise ratio |
| Symptom triage assistant | Prioritizing salient symptoms over lethal causal chains | Test risk-weighting, not just extraction |
| Chronic disease follow-up | Accepting patient self-assessment despite contradictory symptoms | Add contradiction-detection probes |
| Multilingual or culturally varied intake | Misreading metaphors, family narratives, or indirect expressions | Build local narrative probes before deployment |
What remains uncertain is equally important. The paper does not prove that these exact rankings will generalize to every medical specialty, language, deployment interface, prompt design, or model version. It does not test real patient conversations, audio tone, imaging, multimodal signals, or clinician-in-the-loop workflow performance. It does not show whether fine-tuning, retrieval, structured prompting, or multi-agent pipelines would reduce the observed failures. It also uses a small, custom benchmark; that is valuable for diagnosis, not enough for certification.
So the correct takeaway is not “do not use LLMs in healthcare.” The correct takeaway is: do not confuse cleaned-case competence with raw-intake safety.
The missing benchmark is an “intake liver panel”
The paper proposes the idea of future “AI FibroScan” style testing for AI-MASLD — a standardized stress test for clinical information metabolism. The metaphor is playful, but the product idea is serious.
Healthcare organizations already understand pre-deployment validation. The missing piece is that many validations still evaluate the wrong layer. They test answer quality after the clinical facts have been selected. The harder problem is fact selection itself.
A useful “intake liver panel” for medical LLMs would test at least five capabilities:
| Capability | Stress condition | Failure to catch |
|---|---|---|
| Noise filtering | Irrelevant life details mixed with subtle symptoms | Signal loss and symptom fabrication |
| Priority triage | Multiple urgent-sounding complaints with one hidden lethal chain | Wrong escalation |
| Contradiction detection | Patient self-diagnosis conflicts with warning signs | Unsafe agreement |
| Emotion-fact separation | Anxiety, metaphor, irony, or caregiver fear | Emotional contamination of clinical summary |
| Timeline reconstruction | Nonlinear story with corrections and delayed details | Wrong disease progression |
This is not merely a technical evaluation. It changes procurement. Instead of asking vendors, “What is your model’s medical benchmark score?” healthcare buyers should ask, “Show me the model’s failure cases on raw patient narratives similar to our population.”
That one sentence would make several demos less magical. Good.
The safest architecture may be less chatty and more modular
The paper briefly suggests future interventions: training on more authentic unstructured clinical dialogue, RLHF targeting warning symptoms, information-filtering algorithms, prompt engineering, and mixture-of-experts systems. These are not tested interventions in the paper; they are future directions. Still, they point toward a reasonable architecture.
A production system for clinical intake should probably not be a single general chatbot producing one polished answer. It should be modular:
- Extractor: pulls symptoms, duration, severity, triggers, history, medications, and context.
- Noise filter: removes irrelevant details while preserving possible weak signals.
- Risk scorer: flags red-flag combinations and hidden causal chains.
- Contradiction detector: compares patient interpretation against clinical warning signs.
- Summary compressor: produces a short clinician-facing note.
- Escalation guardrail: sends high-risk or low-confidence cases to human review.
This architecture is less charming than a conversational doctor-bot. It is also less likely to bury a pulmonary embolism signal under a paragraph of bedside-manner theater.
The business value is not replacing clinicians. It is reducing low-value intake labor without degrading safety. That only works if the AI improves the signal before it reaches the clinician. If the clinician must re-filter the AI’s output, the product has merely added a second waiting room.
Boundaries: useful warning, not final diagnosis
The paper is strongest as a stress-test demonstration and weakest if treated as a universal model ranking.
Several boundaries should discipline interpretation.
First, the benchmark is small: twenty probes across five categories. That is enough to expose failure modes, not enough to establish broad clinical reliability.
Second, the prompts are simulated rather than drawn from de-identified real patient conversations. Simulation allows clean scoring, but real patients bring more linguistic diversity, cultural variation, missing context, and follow-up dynamics.
Third, the study is text-only. That avoids multimodal confounding, but real clinical encounters may include voice tone, facial expression, imaging, sensor data, and clinician follow-up questions.
Fourth, the cases lean toward internal medicine and metabolism-related scenarios. Surgical, psychiatric, pediatric, obstetric, emergency, and oncology workflows may stress different capabilities.
Fifth, the results are model-version specific. LLMs change quickly. A benchmark result from one API version should be treated like a lab value at a point in time, not a permanent character judgment.
These limitations do not weaken the core warning. They define where the warning applies. The paper does not tell us which model should run clinical AI. It tells us what kind of failure clinical AI governance must look for.
The article’s uncomfortable conclusion
The seductive story is that medical LLMs are close to doctor-level because they can answer hard medical questions. The paper’s correction is sharper: answering a hard question is not the same as surviving a messy patient narrative.
A patient does not hand the model a differential diagnosis prompt. The patient says the leg hurts, the chest feels tight, the stomach is probably gastritis, the child is a furnace, the memory is a sieve, the face looks yellowish, the car accident ruined the week, and maybe the liver test was abnormal last month.
Somewhere inside that mess is the clinical signal.
AI-MASLD is an awkward name for a useful idea: models can become metabolically inefficient under real communication load. They may retain facts, but fail to filter. They may know risks, but fail to prioritize. They may sound empathetic, but contaminate the summary. They may be correct enough to impress a demo and messy enough to waste a clinician’s time.
For healthcare AI, the next serious benchmark is not another polished exam. It is the waiting room.
Cognaptus: Automate the Present, Incubate the Future.
-
Yuan Shen, Xiaojun Wu, and Linghua Yu, “AI-MASLD Metabolic Dysfunction and Information Steatosis of Large Language Models in Unstructured Clinical Narratives,” arXiv:2512.11544, 2025. https://arxiv.org/pdf/2512.11544 ↩︎