When Bigger Isn’t Smarter: Stress‑Testing LLMs in the ICU

A hospital does not buy “intelligence.” It buys a workflow.

That distinction sounds obvious until an AI vendor arrives with a model that has billions of parameters, a clinical pretraining story, and the gentle implication that smaller models are now museum pieces. In the ICU, however, the useful question is not whether the model can talk like a doctor. It is whether it can detect tomorrow’s clinical deterioration from messy notes better than simpler systems that cost less, run faster, and attract fewer infrastructure headaches.

The paper behind today’s article asks exactly that unfashionable question. Malhotra and colleagues benchmark GatorTron Base, Llama 8B, and Mistral 7B embeddings for predicting next-day abnormal shock index in ICU patients, then compare them with smaller language-model baselines such as Word2Vec+Doc2Vec, BioClinicalBERT+DocBERT, and BioBERT+DocBERT.¹ The answer is not “LLMs fail.” That would be too easy, and also wrong. The answer is more useful: LLMs are competitive, but not clearly superior. In clinical AI procurement, that difference is the gap between “promising technology” and “expensive assumption wearing a white coat.”

The real contest is not LLM versus medicine. It is LLM versus a cheaper baseline.

The paper focuses on shock prediction, using shock index as the target signal. Shock index is defined as heart rate divided by systolic blood pressure:

$$ SI = \frac{HR}{SBP} $$

The authors label an epoch abnormal when $SI \geq 0.7$ and normal when $SI < 0.7$. They are not trying to classify whether the word “shock” appears in a note. That would be a fairly cheap trick, and clinical NLP has enough cheap tricks already. Instead, they map physician notes to next-day labels generated from continuous vital-sign data. A new abnormal shock-index episode must last more than thirty minutes and be preceded by at least twenty-four hours of normal shock index, reducing the risk that the model is merely detecting a deterioration already in progress.

That design matters. Predicting a future trajectory is harder than extracting a named entity or summarizing a discharge note. It asks the text representation to carry signals about an evolving patient state, not just retrieve obvious labels. The authors also mask directly revealing words and therapeutics, including diagnoses such as “shock” and “septic” and drugs such as dobutamine, dopamine, adrenaline, and noradrenaline. This does not remove all possible leakage, but it shows the paper is trying to avoid the easiest shortcut.

The final textual cohort is small: 355 normal shock-index patients and 87 abnormal shock-index patients. That is enough to run a benchmarking study, but not enough to declare a clinical product ready for deployment. It is exactly the sort of dataset where fashionable model assumptions should be stress-tested rather than admired from a safe distance.

The benchmark is a layered comparison, not a single leaderboard

The paper’s workflow is deliberately hybrid. The LLMs are not used as direct clinical decision agents. They generate embeddings from physician-note content, especially history of present illness and therapeutics-related context. Those embeddings are then fed into conventional classifiers: Logistic Regression, Random Forest, Gradient Boosting, AdaBoost, and XGBoost.

This choice is easy to underappreciate. In many business conversations, “using an LLM” means asking a model to reason, classify, or generate a final decision. Here, the LLM is closer to a feature extractor. The downstream classifier still decides. That makes the comparison more operationally realistic for many hospital analytics teams: the question becomes whether richer language embeddings improve a predictive pipeline enough to justify their added complexity.

The paper’s experimental structure can be read as four comparisons:

Comparison	Likely purpose	What it supports	What it does not prove
LLM embeddings versus SLM baselines	Main evidence	Whether larger models outperform smaller clinical/text models on this shock-index prediction task	Whether LLMs are generally inferior or superior in healthcare
Pretrained GatorTron versus fine-tuned GatorTron	Main evidence / ablation	Whether task-specific fine-tuning improves performance on the small cohort	Whether fine-tuning would fail with larger, more diverse data
Cross-entropy versus focal loss	Ablation for class imbalance	Whether loss-function choice helps with an imbalanced abnormal-shock cohort	Whether focal loss is universally better for clinical prediction
Dropout-rate variants	Robustness / sensitivity test	Whether performance is highly sensitive to regularization settings	Whether the final model is robust under hospital deployment shifts
SHAP analysis for Random Forest	Exploratory interpretability extension	Which features influence the best GatorTron-based classifier and whether they resemble prior SLM findings	Whether the model’s reasoning is clinically causal

This is why the article needs a comparison-based reading. The paper is not selling a new grand architecture. Its value is in the uncomfortable cross-checks.

The headline result: GatorTron performs well, but the old baselines refuse to die

The strongest LLM result comes from pretrained GatorTron Base embeddings paired with Random Forest. That configuration reaches accuracy of 0.80 ± 0.0161, recall of 0.805 ± 0.0152, and F1 score of 0.74 ± 0.0266. Among GatorTron classifiers, Gradient Boosting delivers the highest AUC-ROC at 0.68 ± 0.0389.

On the surface, that sounds like the clinical LLM did its job. It did. The problem, for anyone trying to justify an expensive deployment, is that the smaller baselines are not embarrassed.

Word2Vec+Doc2Vec with AdaBoost reports recall of 0.81 ± 0.0070 and accuracy of 0.80 ± 0.0070. BioBERT+DocBERT with Random Forest reports accuracy and recall of 0.81 ± 0.0072. BioBERT+DocBERT with XGBoost reaches F1 score of 0.77 ± 0.0096. BioBERT+DocBERT with Gradient Boosting reaches AUC-ROC of 0.68 ± 0.0139, essentially matching the best GatorTron AUC-ROC reported in the paper.

So the paper’s real result is not “LLMs are bad.” It is: when tested as embeddings inside a classical predictive pipeline, LLMs do not automatically dominate smaller alternatives. In some cells they look strong. In other cells, simpler or smaller systems match them. The result is a draw with expensive costumes.

That is already enough to change the business conversation. A hospital analytics buyer does not need to ask, “Is this an LLM?” The better question is, “Against which smaller model, on our clinical prediction target, with our alerting cost, and under our operating constraints?”

High recall is not the same as clinical readiness

The most tempting number in the paper is GatorTron+Random Forest’s recall of 0.805. For an ICU deterioration task, recall matters. Missing a patient heading toward physiological decompensation is costly, clinically and ethically. A model that catches more possible abnormal shock-index cases deserves attention.

But recall alone is not a deployment plan. The same GatorTron+Random Forest configuration reports specificity of 0.25 ± 0.0604. Several ensemble models show the same pattern: respectable accuracy and recall, but weak specificity. The SLM baselines often show similar trade-offs. Logistic Regression variants can produce higher specificity, but often with much lower accuracy and recall.

This is where metric interpretation becomes operational, not academic. A low-specificity warning system may create too many false alerts. In an ICU, alert fatigue is not a minor user-experience issue; it is a workflow hazard. A model that is useful as a silent risk score, a second-look triage layer, or a research signal may still be too noisy as a front-line alarm.

The paper does not provide a full clinical utility analysis, threshold optimization, calibration assessment, or prospective deployment study. That is not a flaw so much as a boundary. The benchmark tells us which representation-classifier combinations are promising. It does not tell us how many nurses get interrupted, how often clinicians override the signal, or whether patient outcomes improve. Procurement teams should notice the difference before the pilot begins, preferably before the invoice arrives.

Fine-tuning did not rescue the large model story

The authors also fine-tune GatorTron using cross-entropy loss and focal loss. This matters because one common response to weak benchmark dominance is: “Fine, but we have not tuned it yet.” That response is plausible. It is also not a blank cheque.

In the paper, fine-tuning produces limited and inconsistent gains. For Random Forest, the not-fine-tuned GatorTron setup reports accuracy of 0.80, recall of 0.805, and F1 score of 0.74. The cross-entropy fine-tuned version drops to accuracy and recall of 0.72 and F1 of 0.64. The focal-loss version reaches accuracy and recall of 0.73 and F1 of 0.63. For Gradient Boosting, focal loss improves specificity relative to cross-entropy, but the not-fine-tuned version remains stronger on precision, recall, F1, and AUC-ROC.

This is not proof that fine-tuning is useless. It is evidence that fine-tuning on a small cohort can fail to improve a predictive clinical pipeline, even when the loss function is chosen to address class imbalance. The authors themselves point to the limited fine-tuning cohort as a likely reason.

There is a more general lesson here. Fine-tuning is often discussed as if it were a ritual: add task data, stir with GPUs, receive performance. In real predictive healthcare, the ritual can produce a model that is more specialized and less useful. When the target is patient trajectory prediction, the missing ingredient may not be “more clever tuning.” It may be larger, more diverse, trajectory-oriented training data.

The dropout tests are robustness checks, not a second thesis

The paper also tests dropout variations under focal-loss fine-tuning, using selected dropout/epoch combinations such as 0.7/13, 0.999999/13, and 0.9999/20. These results should not be overread. They are best treated as sensitivity checks around regularization, not as the main argument.

Across classifiers, the differences are generally modest. Random Forest varies more than some others, with accuracy ranging from 0.73 to 0.76. Gradient Boosting accuracy stays around 0.68 to 0.69. XGBoost moves from 0.65 to 0.69. The paper’s interpretation is that no single dropout configuration clearly dominates and that performance is relatively insensitive to dropout changes, conditional on selecting a reasonable number of epochs and achieving convergence.

For business readers, this matters because robustness is often confused with superiority. The dropout tests do not show that the LLM pipeline is clinically ready. They show that, within this experimental setting, regularization tweaks do not overturn the main comparison. The big conclusion remains intact: the larger model is capable, but not decisively better.

The SHAP plot quietly weakens the magic story

The authors use SHAP to examine feature importance for the best GatorTron Random Forest model. They report that heparin sodium prophylaxis and coumadin strongly affect the onset-of-shock prediction, while famotidine and risperidone influence the negative class. They also note that these findings correlate with prior work, suggesting that LLMs are learning similar features to SLMs.

That last point is more interesting than it first looks. If a much larger model learns roughly the same operational signals as smaller models, its advantage may not come from “deeper clinical understanding.” It may simply be another way to encode the same predictive correlates from the notes. That can still be valuable. Better embeddings can improve engineering convenience or transferability. But it is not the same as discovering a new clinical reasoning layer.

In business terms, the SHAP result pushes against the mythology of hidden intelligence. The model may be useful because it compresses text into predictive features, not because it has become an ICU fellow who happens to live in a vector space. Annoying distinction, yes. Expensive distinction, also yes.

What this means for healthcare AI buyers

The paper directly shows that, on this MIMIC-III shock-index prediction benchmark, LLM embeddings paired with conventional classifiers perform competitively but do not consistently beat smaller baselines. It also shows that fine-tuning GatorTron under the tested settings does not reliably improve performance, and that dropout variations do not materially change the story.

Cognaptus infers three business lessons from that evidence.

First, clinical AI evaluation should begin with strong baselines, not model prestige. If BioBERT+DocBERT or Word2Vec+Doc2Vec can match an LLM-based pipeline on the target metric, then the LLM must justify itself through other advantages: easier maintenance, better transfer across tasks, lower feature-engineering burden, superior integration, or improved performance on a clinically prioritized metric. “It is bigger” is not a justification. It is a hosting bill with a personality.

Second, predictive healthcare is not the same product category as summarization. A model that writes a plausible discharge summary or extracts medication names may still struggle to forecast physiological deterioration. Prediction requires temporal signal, cohort design, label discipline, and evaluation against future outcomes. Hospitals should not let success in documentation workflows quietly migrate into claims about early-warning systems.

Third, the right deployment question is not “LLM or no LLM?” It is “Which component should be large?” A large model may be useful as a feature extractor. A smaller classifier may be better as the final decision layer. A task-specific SLM may be enough. A multimodal model combining notes and vitals may be better than text alone. The architecture should follow the clinical bottleneck, not the conference trend.

Where this benchmark should not be stretched

This paper is a useful stress test, not a universal verdict on LLMs in medicine.

The cohort is small: 442 textual cases after construction, with 355 normal and 87 abnormal shock-index patients. The data come from MIMIC-III, a valuable but retrospective and historical ICU database. The target is one physiological decompensation proxy: next-day abnormal shock index. The models are mostly evaluated as embedding generators inside downstream classical classifiers, not as end-to-end clinical agents. The study does not establish prospective clinical benefit, alert calibration, bedside usability, or outcome improvement.

Those boundaries matter because the wrong takeaway would be another lazy slogan: “LLMs do not work in the ICU.” The better takeaway is narrower and sharper: for this specific predictive task, under this data regime, larger language models did not automatically outperform smaller language-model pipelines. Scale helped enough to be competitive. It did not help enough to end the comparison.

That is exactly the kind of result healthcare AI needs more often. Not a victory parade. Not a funeral. A benchmark.

The smaller model is not the enemy. The untested assumption is.

The attractive story is that clinical AI advances by replacing small models with large ones. The more useful story is that clinical AI advances by matching model design to task structure. For summarization, conversational support, and broad text interpretation, large models may bring obvious advantages. For a narrow, imbalanced, trajectory-prediction problem built from ICU notes and vital-derived labels, smaller baselines can remain stubbornly competitive.

The paper’s quiet warning is that “clinical pretraining” and “large context” do not automatically translate into better prediction of future patient states. If the training objective has not taught the model to understand trajectories, and the local dataset is too small to teach that skill reliably, model size alone may not close the gap.

So yes, stress-test the LLM in the ICU. But stress-test the procurement logic too. Ask what the model is compared against. Ask which metric matters clinically. Ask how false alarms are handled. Ask whether fine-tuning improves the result or merely improves the sales narrative.

In the ICU, bigger may still be useful. It just does not get to skip the exam.

Cognaptus: Automate the Present, Incubate the Future.

Chehak Malhotra, Mehak Gopal, Akshaya Devadiga, Pradeep Singh, Ridam Pal, Ritwik Kashyap, and Tavpritesh Sethi, “Benchmarking LLMs for Predictive Applications in the Intensive Care Units,” arXiv:2512.20520, 2025, https://arxiv.org/abs/2512.20520. ↩︎

The real contest is not LLM versus medicine. It is LLM versus a cheaper baseline.#

The benchmark is a layered comparison, not a single leaderboard#

The headline result: GatorTron performs well, but the old baselines refuse to die#

High recall is not the same as clinical readiness#

Fine-tuning did not rescue the large model story#

The dropout tests are robustness checks, not a second thesis#

The SHAP plot quietly weakens the magic story#

What this means for healthcare AI buyers#

Where this benchmark should not be stretched#

The smaller model is not the enemy. The untested assumption is.#