A patient walks into an emergency department. Or arrives by ambulance. Or lives far from the hospital. Or has private insurance. Or has missed prior appointments.
Clinically, those details may be background noise. In triage, the core question is supposed to be sharper: how sick is this patient, how urgent is the risk, and what resources are likely needed? The Emergency Severity Index, or ESI, is not a lifestyle quiz with a stethoscope attached.
The paper behind this article, Uncovering Latent Bias in LLM-Based Emergency Department Triage Through Proxy Variables, tests what happens when those background details are inserted into LLM-generated ED triage scenarios.1 The result is not the cartoon version of AI bias where a model sees a protected attribute and behaves badly. It is more irritating, because it is more deployable: the model can shift acuity judgments when ordinary context clues appear in the patient description.
That is the governance problem. A medical LLM may look neutral because it is not explicitly asked to use race, gender, or income. Meanwhile, it may still be quietly moved by the linguistic fingerprints of transport access, insurance status, neighborhood deprivation, prior ED use, social isolation, or other proxy variables. Bias, in this case, does not need to enter through the front door. It can come in wearing a visitor badge.
The mechanism: non-clinical context becomes pseudo-clinical evidence
The paper’s central move is simple and useful: hold the clinical scenario steady, then perturb the context.
The author starts with ED encounter records from MIMIC-IV-ED Demo and MIMIC-IV Demo, with additional use of restricted-access MIMIC-IV-ED and MIMIC-IV data where greater statistical power is needed. Each scenario includes ordinary triage information such as chief complaint, demographics, age, and vital signs. The LLM tested is gpt-4o-mini, prompted to act as an experienced ED triage nurse and assign an ESI score from 1 to 5.
The important detail is that lower ESI means higher acuity. ESI 1 is most urgent; ESI 5 is least urgent. So a negative shift in score means the model sees the patient as more urgent, while a positive shift means it sees the patient as less urgent.
Then the paper inserts proxy-variable qualifiers. For each of 32 proxy variables, the author creates paired positive and negative descriptions. For example:
| Proxy variable | Positive qualifier logic | Negative qualifier logic |
|---|---|---|
| Insurance status | Comprehensive private insurance | Uninsured and delayed care due to cost |
| Transport mode | Arrived by private car after recognizing symptoms promptly | Arrived by ambulance after transportation delays |
| Primary language | Speaks English fluently | Requires an interpreter |
| Area deprivation | Well-resourced neighborhood | High-poverty area with limited resources |
| Social isolation | Regular contact with friends and family | Limited personal connections |
The paper reports that these qualifiers were generated with ChatGPT and manually reviewed to remove content that would be legitimately relevant to ESI. This matters. If a qualifier says the patient has crushing chest pain or severe respiratory distress, the model should change acuity. That would not be bias; that would be triage. The test is only meaningful if the added text is intended to be non-clinical.
The experiment therefore asks a clean question:
When the medical facts are otherwise unchanged, does non-clinical context alter the model’s triage judgment?
The answer is yes often enough to be uncomfortable.
The paper is not mainly about “bad words”; it is about pathways
A shallow reading would say: the model is biased against disadvantaged patients. That is partly true, but too blunt.
The paper separates two mechanisms. This distinction is the part worth spending time on, because it changes what hospitals and vendors should audit.
| Pattern | What changes in the prompt | What changes in the model output | Practical risk |
|---|---|---|---|
| Polarity-dependent shift | Positive versus negative framing of the same proxy | ESI shifts in different directions depending on framing | Disadvantaged or advantaged groups may receive systematically different urgency estimates |
| Polarity-independent shift | Either framing of a proxy appears | ESI shifts in the same direction regardless of positive or negative wording | Token presence itself behaves like a trigger, even when semantics should differ |
| Negligible or subtle effect | Proxy appears, but little consistent movement follows | ESI remains near baseline | Lower immediate risk for that proxy, though not proof of safety |
Polarity-dependent shifts are easier to understand. If “uninsured and delayed care” pushes the model toward higher urgency while “private insurance” pushes it toward lower urgency, the model is treating socioeconomic context as if it had clinical meaning. Some of that association may reflect real-world correlations. Patients who delay care can indeed present later and sicker. But the experiment’s point is that the clinical presentation is already supplied. The proxy is not supposed to add independent triage evidence unless it contains valid information about acuity or resource need.
Polarity-independent shifts are stranger. Here the model shifts in the same direction whether the qualifier is positive or negative. That suggests that the model may not be cleanly interpreting the sentence-level meaning. Instead, the presence of certain contextual tokens may activate learned associations. The model is not necessarily asking, “Does this phrase clinically change the case?” It may be doing something more statistical and less flattering: “I have seen words like this near cases like that.”
One can call this contextual sensitivity. In a high-stakes workflow, the more honest phrase is: silent over-weighting of irrelevant context. Less elegant, more useful.
What the evidence actually supports
The paper’s evidence has three layers: the controlled proxy insertion test, the net shift analysis, and the ambulance-arrival example.
| Paper element | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| System prompt defining ESI role and response format | Implementation detail | The model was instructed to use standard ESI logic and return structured output | That the model truly reasons like a trained triage nurse |
| Patient scenario template | Implementation detail | The input structure was controlled across baseline, positive, and negative conditions | That all real-world ED note formats would behave the same way |
| Figure 3: shifts from default under positive and negative qualifiers | Main evidence | Many proxy qualifiers move ESI predictions relative to baseline; shifts can be polarity-dependent or polarity-independent | That each proxy causes real-world patient harm by itself |
| Figure 4: net negative-minus-positive shift | Main evidence / contrast test | Roughly three-quarters of proxy variables show statistically significant net bias by the paper’s criterion | That the direction and magnitude would persist across all models or hospitals |
| Figure 5: ambulance use by race and acuity | Exploratory extension / pathway illustration | A proxy variable can differ across racial groups even at similar acuity levels, creating a route for unequal outputs | That ambulance arrival explains all racial disparities in triage |
The most important result is not that one particular phrase changes one particular score. It is that the audit framework repeatedly finds non-zero movement when non-clinical context is added.
Figure 3 reports mean ESI shifts with 95% confidence intervals for negative and positive qualifiers relative to default scenarios. The paper interprets bars whose confidence intervals do not cross zero as statistically significant at $\alpha = 0.05$. Figure 4 then compares negative and positive conditions directly, describing the resulting difference as net “bias.” According to the paper, about three-quarters of the proxy variables show statistically significant net bias.
That is the main evidence: controlled prompt perturbation produces systematic acuity movement.
Now, careful wording matters. A shift in ESI prediction is not the same as a proven change in patient outcomes. The experiment does not show that patients would actually wait longer, receive fewer resources, or experience harm in a deployed ED system. It shows something upstream: a model’s triage judgment is sensitive to proxy context that should not independently determine acuity. In clinical AI governance, upstream failure modes are exactly where one would prefer to catch the problem. Waiting for downstream harm is a rather expensive validation strategy, generally popular only among people who do not pay the malpractice bill.
Ambulance arrival shows how proxy bias becomes group bias
The ambulance example is the paper’s clearest bridge from token-level behavior to social impact.
The model, in the proxy tests, treats ambulance arrival as a cue of higher severity. That is not absurd on its face. In real ED operations, ambulance arrival can correlate with serious events. But it is also shaped by access, geography, insurance, trust, cost, and local emergency-service infrastructure. In other words, it is not pure physiology. It is physiology mixed with society, billing, habit, and sometimes fear.
The paper then examines ambulance use across racial groups at similar acuity levels and reports that White patients are more likely than Black patients to arrive by ambulance for the same acuity level. If the model systematically interprets ambulance arrival as higher severity, then this population-level difference can become an inference-time pathway for unequal triage estimates.
No explicit race token is required. The model can be race-sensitive without “using race,” because it uses a proxy whose distribution is socially patterned.
That is the misconception the paper usefully attacks. Removing protected attributes from the prompt does not make the system neutral. Sometimes it merely makes the bias harder to audit.
The business problem is not model morality; it is input governance
For hospitals and clinical AI vendors, the immediate takeaway is not “never use LLMs in triage.” That would be too easy, and therefore suspiciously satisfying. The better takeaway is operational: triage AI needs a proxy-variable audit layer before deployment.
A hospital deploying an LLM triage assistant must decide which input fields are allowed to influence output and which fields are merely operational context. That boundary is not always obvious. Prior hospitalization, recent ED use, or transport mode can sometimes be clinically relevant. But relevance must be argued, measured, and constrained. It cannot be smuggled in because a language model finds the phrase statistically familiar.
The practical governance question becomes:
Which contextual variables are causally or clinically justified for the decision being made, and which ones are social residue pretending to be signal?
A useful deployment checklist would include at least four controls.
| Control | What it does | Why it matters |
|---|---|---|
| Proxy-variable stress testing | Runs paired positive/negative context perturbations while holding clinical facts fixed | Detects whether non-clinical descriptors move model output |
| Context ablation | Removes or masks selected fields and compares predictions | Separates clinical necessity from statistical convenience |
| Field-level policy | Defines which fields may influence triage and which should be excluded, transformed, or only shown after scoring | Prevents “everything in the note” from becoming model evidence |
| Subgroup pathway analysis | Tests whether proxy distributions differ across protected or vulnerable groups | Finds indirect discrimination routes that direct fairness tests miss |
This is not just compliance hygiene. It affects operational efficiency.
Over-triage can consume scarce ED resources. Under-triage can delay urgent care. A model that reacts to proxies may create both errors simultaneously: raising urgency for some patients because context sounds alarming, lowering urgency for others because context sounds stable, and doing both without a clinically valid reason. That is not a fairness problem sitting politely in a separate ethics report. It is queue management, staffing, liability, quality control, and patient safety.
In business language, proxy sensitivity is a hidden cost center.
Why prompt instructions are not enough
The system prompt in the paper tells the model not to assume information that is not explicitly provided and to use standard ESI principles. That is a reasonable instruction. It is also insufficient.
This is one of the paper’s more useful lessons for enterprise AI beyond healthcare. Instructions define the desired behavior; they do not erase learned associations. A model can comply with the format, cite the right framework, and still overweight irrelevant context. The JSON can be perfect while the judgment is quietly contaminated. Very modern, very tidy, very dangerous.
The same pattern appears in other business domains. A credit-support AI may be told not to discriminate, but still react to job history, address language, or communication style. A hiring assistant may avoid protected categories, but still treat school prestige, employment gaps, or dialect markers as pseudo-performance signals. A procurement assistant may not explicitly penalize small vendors, but still infer risk from sparse documentation.
The clinical setting is simply less forgiving because the output can affect care priority. The mechanism is broader: LLMs trained on human text inherit associations from the world, then apply them inside workflows that pretend to be cleaner than the world.
Where this paper should be used carefully
The paper is useful, but its boundaries should be kept visible.
First, the model tested is gpt-4o-mini. The result should not be generalized mechanically to every LLM, every model size, every medical-tuned system, or every deployment architecture. Different models may have different sensitivities. A retrieval-augmented clinical system with strict field filtering may behave differently from a direct prompt-based scorer.
Second, the proxy qualifiers are constructed. That is a strength for causal probing, because it allows controlled comparison. It is also a boundary: real ED notes are messier, longer, and often contain context that is partly clinical and partly social. The audit tells us that proxy wording can matter; it does not fully characterize all real-world documentation pathways.
Third, the measured outcome is model-assigned ESI shift, not patient outcome. The paper does not prove that a deployed model would change treatment decisions, mortality, wait time, or resource allocation. It identifies a failure mode that could feed into those outcomes if the model were used for triage support without guardrails.
Fourth, some proxy phrases may contain information that clinicians would reasonably consider in specific cases. For example, recent hospitalization or delayed care may sometimes indicate clinical risk. The key is not to ban all context. The key is to prevent non-causal context from becoming unexamined scoring authority.
Finally, Figure 5’s ambulance example is best read as a pathway illustration, not a complete causal model of racial triage disparity. It shows how one socially patterned proxy can transmit unequal treatment logic. It does not claim that this is the only or dominant pathway.
These boundaries do not weaken the paper’s practical value. They tell us how to use it: as an audit design, not as a universal verdict.
The deployment lesson: audit the quiet variables
The most useful thing about this paper is its audit pattern. It gives clinical AI teams a way to move beyond the usual fairness theater.
A weak audit asks whether race or gender fields change outputs.
A stronger audit asks whether the model changes outputs when socially patterned, clinically questionable context appears.
A serious audit goes one step further: it tests whether those proxy variables are unevenly distributed across patient groups, and whether the model’s response to them could create different care recommendations for clinically similar cases.
That sequence matters:
- Identify proxy variables.
- Create controlled positive and negative variants.
- Hold the clinical scenario fixed.
- Measure output shift from baseline.
- Test whether proxy exposure differs across groups.
- Decide whether each field should be removed, transformed, constrained, or justified.
This is the difference between checking whether the front door is locked and noticing that the windows are open.
For hospitals, the message is clear: do not let every piece of patient context flow into an LLM simply because it is available in the record. For vendors, proxy-sensitivity tests should become part of validation evidence. For regulators, fairness review should include indirect pathways, not only protected-attribute masking. And for executives, the ROI case is not “ethical AI,” though that sounds lovely on a slide. The ROI case is fewer avoidable triage errors, cleaner governance, lower liability exposure, and better trust in clinical decision support.
The model does not need to hate anyone to make an unfair recommendation. It only needs to be statistically obedient to a biased world.
That is precisely why proxy-variable audits matter.
Cognaptus: Automate the Present, Incubate the Future.
-
Ethan Zhang, “Uncovering Latent Bias in LLM-Based Emergency Department Triage Through Proxy Variables,” arXiv:2601.15306, 2026. ↩︎