Lost in Translation: When Multilingual LLMs Miss the Medical Plot

Accuracy is a seductive number.

It is tidy, executive-friendly, and easy to put in a slide deck. A model gets 82% accuracy, someone says “good enough,” and suddenly a clinical workflow is being “transformed.” Healthcare, as usual, has a way of punishing this kind of optimism. Not loudly at first. Quietly. Through false negatives, silent majority-class prediction, and a dashboard that looks reassuring until someone asks the rude question: what exactly did the model miss?

That is the useful discomfort in Are LLMs Truly Multilingual? Exploring Zero-Shot Multilingual Capability of LLMs for Information Retrieval: An Italian Healthcare Use Case.¹ The paper does not test a glamorous medical reasoning agent. It tests something more operationally familiar: can open-source multilingual LLMs extract comorbidities from Italian electronic health records in a zero-shot, on-premises setting?

The answer is not “LLMs are useless.” That would be too easy, and also wrong. The sharper answer is: in this setting, headline accuracy is not enough evidence that a multilingual LLM understands clinical text well enough to replace a validated extraction pipeline.

That distinction matters. Many healthcare AI discussions still carry a hidden assumption: if a model is multilingual, strong on leaderboards, and semantically richer than regular expressions, it should be able to read non-English clinical notes better than old pattern-matching rules. After all, regex is the duct tape of NLP. Surely a modern LLM can beat duct tape.

This paper is a reminder that duct tape, when designed by clinicians who know the hospital’s language habits, can be annoyingly hard to replace.

The study asks a practical question, not a leaderboard question

The paper’s setup is deliberately close to a realistic hospital constraint. The authors use 8,223 Italian anamnesis records from electronic health records, focusing on five clinically relevant comorbidities in the cardiac domain:

Comorbidity in Italian	English meaning
Fibrillazione atriale	Atrial fibrillation
Insufficienza Renale	Kidney failure
BPCO / Broncopneumopatia cronica ostruttiva	COPD
Diabete mellito	Diabetes mellitus
Ipertensione arteriosa	Hypertension

The extraction task is binary: for each record and each comorbidity, determine whether the condition is present.

The models are also chosen for a realistic reason. Because healthcare data creates privacy and licensing constraints, the study uses open-source models that can run on-premises: OpenLLaMA 3B and 7B, Mistral 7B, Mixtral 8x7B, and Qwen2.5 3B and 7B. The authors use a zero-shot setup with a standard prompt, and they classify one comorbidity at a time rather than asking the model to identify all conditions in one pass.

That design choice is important. The paper is not asking whether the best possible prompt-engineered or fine-tuned clinical LLM can solve the task. It is asking whether a clinician or hospital team can take a multilingual open-source model, apply a simple direct prompt, and safely use it for comorbidity extraction.

That is a much more dangerous question, because it resembles how AI often enters organizations: not through a polished research protocol, but through a practical hope that a general model can remove a boring workflow.

The first baseline is boring, but not naive

Before testing the LLMs, the authors build a regular expression-based annotation pipeline. This is not a straw-man baseline assembled by someone who has never seen a clinical note. The regex patterns are created with clinician involvement, because clinical records use abbreviations, variants, and domain-specific phrasing.

For example, diabetes may appear as “Diabetes mellitus,” “DM,” type-specific diabetes, insulin-dependent diabetes, or disease variants with complications. The paper uses this example to make a familiar but often forgotten point: pattern matching in medicine is not just string matching. It is institutional language engineering.

The regex labels then serve as an automated reference over the 8,223 records. This reference is imperfect, so the authors also manually annotate a subset of 100 records that regex classified as negative, with two clinicians reviewing the five comorbidities case by case until agreement.

This creates an evidence ladder:

Evidence layer	Likely purpose	What it supports	What it does not prove
Regex annotation over 8,223 records	Main scalable reference	How LLMs compare against a clinician-designed extraction baseline	That regex is flawless ground truth
Manual annotation of 100 regex-negative records	Ground-truth check on false negatives	Whether regex misses clinically relevant positives	Full-dataset human-labeled performance
Accuracy comparison	Initial screening metric	Whether model outputs appear aligned at a high level	Whether the model catches positives reliably
Precision, recall, and F1 for class 1	Main diagnostic evidence	Whether positive comorbidity extraction works	Full semantic competence across clinical language
Confusion matrices on the manual subset	Diagnostic / failure-mode analysis	Whether models are learning or defaulting to a class pattern	Generalized model behavior across all hospitals or tasks

The sequencing matters. If one reads only the accuracy charts, the models can look competitive. If one reads the precision, recall, F1 scores, and confusion matrices, the story changes.

That is the whole plot. The medical plot, as it happens.

The first accuracy result looks more encouraging than it is

When compared against the automated regex labels, the overall accuracy results initially seem to leave room for optimism. OpenLLaMA 3B and Mixtral 8x7B perform poorly overall, below 35% accuracy. But OpenLLaMA 7B, Mistral 7B, Qwen2.5 3B, and Qwen2.5 7B all show overall accuracy above 70%. Mistral 7B reaches 82.67% against the regex reference.

That number sounds respectable. In another context, someone might put it in a procurement memo.

Then the paper opens the box.

For the positive class — the clinically interesting class, because the task is to detect whether a comorbidity is present — the behavior is uneven and often troubling.

Model	Positive-class behavior against regex labels	Practical interpretation
OpenLLaMA 3B	Recall is 1.0 for all five comorbidities, but precision is very low for several conditions	It catches positives by over-predicting positives; that creates many false alarms
OpenLLaMA 7B	Better than 3B, but precision and recall remain weak or uneven across comorbidities	Scaling helps, but does not solve generalization
Mistral 7B	Very high precision for several conditions, but recall is low for kidney failure, COPD, diabetes, and hypertension	It avoids many false positives but misses many real positives
Mixtral 8x7B	High recall for most comorbidities, but low precision	Larger or more advanced architecture does not guarantee useful extraction
Qwen2.5 3B	Precision, recall, and F1 are zero for all five comorbidities	High overall accuracy can coexist with failure to identify positives
Qwen2.5 7B	Almost all positive-class metrics are zero, except a small signal for atrial fibrillation	The model largely fails the clinically relevant class

This is the metric trap. Accuracy can reward a model for being correct on the dominant negative class while failing the positive cases that the workflow actually cares about. In a balanced textbook dataset, this is already a problem. In clinical extraction, it becomes operationally dangerous.

Mistral 7B illustrates the subtle version of the trap. It has strong precision: when it predicts a positive, it is often right. But its recall is weak for several comorbidities. Against regex labels, its recall is 0.11 for kidney failure, 0.15 for COPD, and 0.42 for both diabetes and hypertension. In a screening-style extraction task, low recall means missed conditions.

OpenLLaMA 3B shows the opposite failure mode. It has perfect recall across comorbidities, but low precision for several of them. It finds positives by calling too many things positive. That is not understanding. That is a smoke alarm that goes off whenever someone makes toast.

Qwen2.5 shows the most executive-dashboard-friendly failure. Overall accuracy can look respectable, but positive-class metrics collapse. If a model mostly predicts “not present” in a dataset where many condition-record pairs are negative, accuracy can look fine while the model does not do the job.

So the first important result is not “Mistral beats OpenLLaMA” or “Qwen underperforms.” The important result is that model ranking depends heavily on which metric is allowed to speak.

Manual annotation makes the reference cleaner, not the model safer

The authors know regex is imperfect. That is why they manually review 100 regex-negative records. This review is not a decorative validation step. It targets the part of the pipeline where clinical risk is most obvious: false negatives.

The manual review finds that regex misses some positives. Hypertension has the most false negatives, followed by atrial fibrillation; both are reported as having 10% or more false classification in the reviewed negative subset. Kidney failure, COPD, and diabetes show lower false-negative rates, at 4% or less. When compared with manual annotation, the regex approach reaches 92.2% overall accuracy. COPD performs best at 99%, while hypertension is weakest at 80%.

This is a useful result because it prevents the article from becoming a cartoon in which regex is perfect and LLMs are bad. The regex system has blind spots. It misses some clinically meaningful mentions. Its performance varies by comorbidity.

But here is the less convenient part: even after moving to the manual reference subset, the LLMs still do not clearly become safe replacements.

On the manual comparison, several LLMs show high overall accuracy. OpenLLaMA 7B, Mistral 7B, Mixtral 8x7B, and both Qwen2.5 models exceed 80% overall accuracy, while OpenLLaMA 3B remains below 10%. The paper also reports that Mistral 7B and OpenLLaMA 7B improve by about 10 percentage points compared with their automated-label comparison; Qwen2.5 models improve by about 20 points; Mixtral 8x7B improves by 47.48 points.

Again, it sounds promising.

Then the confusion matrix ruins the party, as good diagnostics often do.

The authors compare confusion matrices for the least accurate and best-performing models in the manual setting. OpenLLaMA 3B tends to classify comorbidities as positive. Mistral 7B tends to classify most comorbidities as negative on the manually annotated subset. The paper concludes that even the model that appears to generalize well in parts of the regex comparison does not show real semantic understanding when checked against manual annotation; it largely predicts the majority class.

This is the second major lesson: manual validation does not merely correct the baseline. It exposes whether model accuracy comes from extraction or from class habit.

A model that gets many negatives right may be acceptable for some low-risk filtering tasks. But for comorbidity extraction, especially when the operational purpose is clinical review, cohort building, risk stratification, or downstream analytics, missed positives can distort the entire workflow.

The paper is really about evidence discipline

A lazy summary of the paper would say: “Six multilingual LLMs were tested on Italian EHRs, and they did not beat regex.”

That is true, but too flat. The more useful reading is that the paper demonstrates an evidence discipline for evaluating clinical LLMs.

The study asks three questions:

Can LLMs extract comorbidities from Italian EHRs in zero-shot mode?
Can they substitute a regular expression-based approach?
Is there a best model among the selected models?

The evidence answers them with increasing skepticism.

Yes, some LLMs produce outputs that align with references at a superficial level. No, they do not safely substitute the regex approach in this setting. And while Mistral 7B appears strongest in some comparisons, its manual-subset behavior weakens the idea that there is a clean “best” model for this task.

The paper’s strongest contribution is not that it discovers a universal flaw in multilingual LLMs. It does not. The stronger contribution is narrower and more useful: it shows that in a zero-shot, on-premises, Italian clinical extraction workflow, model evaluation must move past aggregate accuracy before anyone makes deployment claims.

That sounds obvious. Many expensive mistakes do.

Why multilingual capability does not equal clinical extraction capability

The reader misconception here is understandable. Multilingual LLMs are trained across many languages. Clinical notes are text. Comorbidity extraction is a classification task. Therefore, the model should use semantic understanding to outperform brittle rules.

The missing piece is that “understanding Italian” is not the same as understanding Italian clinical documentation in a specific hospital context.

Clinical notes are compressed, local, abbreviated, repetitive, and uneven. They may contain disease names, acronyms, medication clues, historical mentions, negated mentions, and institution-specific conventions. The paper does not need to prove every one of these mechanisms separately to show the operational result: the tested models do not generalize consistently across comorbidities under a simple zero-shot prompt.

There is also a second issue: the deployment constraint changes the model universe. Hospitals may prefer on-premises models for privacy, licensing, and governance reasons. That pushes the evaluation toward open-source models that can be locally deployed. The paper therefore tests a practical class of models, not necessarily the strongest closed proprietary systems available through APIs.

That distinction should not be brushed aside. It is tempting to respond, “A better model would solve this.” Perhaps. But “use a better model” is not a governance strategy. It is a procurement reflex wearing a lab coat.

For a hospital or health-tech company, the relevant question is not whether some model somewhere could do better. The relevant question is whether the model being deployed, under the constraints actually faced, has been validated for the clinical text it will process.

This paper says: not by accuracy alone.

The business meaning is governance-first, not LLM-last

The practical lesson for healthcare organizations is not “never use LLMs for EHR extraction.” It is “do not replace a validated extraction pipeline with a zero-shot LLM just because the model is multilingual and locally deployable.”

There are at least four business implications.

Business decision	What the paper directly shows	Cognaptus interpretation	Boundary
Replacing regex with zero-shot LLM extraction	The tested LLMs do not reliably outperform clinician-designed regex and show unstable positive-class behavior	Replacement is not justified without task-specific validation	Applies to six open-source models, Italian cardiac anamnesis text, five comorbidities
Using accuracy as the main KPI	High accuracy can hide poor recall, false positives, or majority-class prediction	KPI design must include class-specific precision, recall, F1, and confusion matrices	Exact thresholds depend on clinical use case
Choosing a model from leaderboards	Models selected for multilingual capability still vary sharply by comorbidity and metric	Leaderboards are weak evidence for local clinical extraction readiness	The paper does not test all possible models
Keeping regex in the workflow	Regex reaches 92.2% overall accuracy against manual review but misses some negatives	Regex should be audited and improved, not dismissed as obsolete	Manual review subset is limited

This creates a more sober adoption pattern.

LLMs may still be useful as assistive tools. They can help identify candidate regex gaps, surface suspicious negative cases for review, suggest synonym expansions, or support a human-in-the-loop annotation process. They may also perform better with in-context learning, fine-tuning, better prompt design, or domain adaptation. The authors explicitly leave future work in that direction.

But the paper does not support the idea that a hospital can take an open-source multilingual model, give it a simple prompt, and let it replace a clinician-designed pattern-matching pipeline for comorbidity extraction.

That is not conservatism. That is reading the confusion matrix.

The uncomfortable ROI lesson: cheap automation can make expensive data

The business case for LLM-based EHR extraction is easy to imagine. Manual review is expensive. Regex requires expert maintenance. LLMs promise flexible extraction across diseases, languages, and document styles. If they work, they could reduce annotation cost, accelerate cohort discovery, and improve downstream analytics.

But the paper points to a quieter cost: unreliable automation can create cheap structured data that is expensive to trust.

If an extraction system misses kidney failure or hypertension cases, the downstream damage may not appear immediately. It may surface later as biased cohort selection, inaccurate risk profiles, weak quality metrics, or flawed research datasets. The organization gets a clean table, but the table lies politely.

That is why the proper ROI question is not:

Can an LLM extract comorbidities faster than clinicians?

The better question is:

Can the LLM produce structured clinical variables with measurable error behavior that is acceptable for the downstream decision?

Those are different questions. The first is about labor substitution. The second is about information reliability.

Regex has a cost: it needs domain experts, maintenance, and adaptation to new language patterns. LLMs have a different cost: validation, monitoring, prompt/version control, drift detection, and failure-mode analysis. In clinical data workflows, the cheaper-looking option is not necessarily cheaper once error governance is priced in.

The study therefore supports a staged adoption model:

Keep the existing validated extraction pipeline as the reference system.
Use LLMs to generate candidate cases or candidate pattern expansions.
Evaluate class-specific precision and recall, not only accuracy.
Audit performance separately by disease category.
Use clinician review on strategically chosen disagreement cases.
Consider prompt engineering, in-context examples, or fine-tuning only after baseline failure modes are understood.
Monitor the deployed system continuously if it ever reaches production.

This is less glamorous than “AI reads the medical record.” It is also more likely to survive contact with actual medical records.

What the paper does not prove

The boundary of the study is important.

The paper tests six open-source multilingual models, not every LLM. It uses Italian EHR anamnesis text from a cardiac-domain setting, not all clinical specialties, languages, or note types. It focuses on five comorbidities. It intentionally uses a zero-shot prompt to simulate direct model use by clinicians without prompt engineering. It does not evaluate fine-tuning, in-context learning, retrieval augmentation, specialized clinical models, proprietary API models, or hybrid systems that combine regex, LLMs, and human review.

The manual annotation subset is also specific: 100 regex-negative records reviewed by clinicians. That is valuable because false negatives matter, but it is not the same as a fully manual gold standard over all 8,223 records.

These boundaries do not weaken the paper’s practical message. They sharpen it. The study is not a final verdict on LLMs in healthcare. It is a warning against premature substitution under a common deployment pattern: general multilingual model, local deployment, simple prompt, high-risk extraction task, and an accuracy number that looks better than the underlying behavior.

The article’s real takeaway: multilingual is not medical

The paper’s title asks whether LLMs are truly multilingual. The business question is slightly different: even if they are multilingual, are they operationally reliable in a local clinical workflow?

For this study, the answer is no.

The tested LLMs can produce superficially strong accuracy under some comparisons. But when the authors inspect positive-class metrics and confusion matrices, the weakness becomes clear: some models over-predict positives, some miss positives, and some appear accurate because they lean into the negative majority class. Mistral 7B looks strongest in parts of the evaluation, yet manual comparison suggests that its apparent performance may rely heavily on predicting negatives rather than demonstrating robust semantic extraction.

That is the uncomfortable lesson. A model can be multilingual enough to process the words and still not be clinical enough to support the workflow.

For healthcare leaders, the practical rule is simple: do not ask whether the model sounds intelligent. Ask whether it fails in a way your process can detect, measure, and tolerate.

Regex may be old. It may be ugly. It may require clinicians to list awkward abbreviations and local phrasing. But in this paper, the boring baseline has something the zero-shot LLMs do not yet have: validated behavior that can be inspected.

In healthcare AI, that is not a minor feature. That is the plot.

Cognaptus: Automate the Present, Incubate the Future.

Vignesh Kumar Kembu, Pierandrea Morandini, Marta Bianca Maria Ranzini, and Antonino Nocera, “Are LLMs Truly Multilingual? Exploring Zero-Shot Multilingual Capability of LLMs for Information Retrieval: An Italian Healthcare Use Case,” arXiv:2512.04834, submitted December 4, 2025, https://arxiv.org/abs/2512.04834. ↩︎

The study asks a practical question, not a leaderboard question#

The first baseline is boring, but not naive#

The first accuracy result looks more encouraging than it is#

Manual annotation makes the reference cleaner, not the model safer#

The paper is really about evidence discipline#

Why multilingual capability does not equal clinical extraction capability#

The business meaning is governance-first, not LLM-last#

The uncomfortable ROI lesson: cheap automation can make expensive data#

What the paper does not prove#

The article’s real takeaway: multilingual is not medical#