CURE Enough: When Multimodal EHR Models Finally Grow Up

Hospitals do not run on clean datasets. They run on discharge notes, lab panels, repeated admissions, missing context, and the occasional clinical abbreviation that looks like it escaped from a tax form.

That is the awkward reality behind chronic-disease prediction. The patient record is not just text. It is not just lab values. It is not just a sequence of visits. It is all three, with timing doing much of the quiet work. A patient returning after 42 days does not mean the same thing as a patient returning after 420 days, even when the diagnosis code looks identical. Healthcare operations already know this. Many AI models, bless their expensive little hearts, still behave as if they do not.

The paper behind this article introduces CURENet, a multimodal EHR model designed for chronic-disease prediction across clinical notes, textualised abnormal lab results, and irregular visit histories.¹ Its useful contribution is not that it adds a large language model to medicine. That sentence has been printed so often it should come with a recycling fee. The more interesting move is that CURENet tries to build a unified patient representation: one that treats clinical language, lab abnormality, and visit cadence as mutually informative rather than as separate evidence piles waiting to be concatenated at the end.

That distinction matters. In chronic care, the “prediction” problem is rarely a one-shot diagnosis. It is trajectory recognition.

CURENet starts with the right clinical nuisance: the record is fragmented

The paper frames chronic-disease prediction as a multilabel task over patient visit sequences. Given a patient’s prior visits, the model predicts which diseases are likely to be present at a later visit. The authors also evaluate heart-failure prediction as a binary classification task.

The important detail is not merely the label format. It is the input design.

CURENet uses three streams of EHR information:

Information source	How CURENet represents it	Why it matters operationally
Clinical notes	Chief complaint, current illness, medical history, and admission medication sections are processed by a fine-tuned Medical-LLaMA3-8B model	Notes contain symptoms, context, medication history, and physician judgement that structured codes often flatten
Abnormal lab results	Lab values are converted into templated text sentences and passed through the same language-model pathway	This sidesteps part of the LLM-tabular mismatch by making abnormal labs semantically readable
Visit timing	Visit duration and inter-visit gaps are encoded through a Time-Series Transformer	Chronic instability often appears as longer stays, shorter gaps, or irregular recurrence patterns

This is a mechanism-first paper. The architecture is the argument.

A standard medical LLM can read notes. A time-series model can process visits. A tabular model can consume labs. CURENet’s premise is that none of those alone is enough for chronic disease because the signal is distributed. A lab abnormality means one thing when embedded in a worsening clinical narrative and another when it appears as an isolated blip. A diagnosis history means one thing when visits are stable and another when readmissions accelerate. The model is trying to learn those cross-modal interactions.

Technically, the paper does this through a two-stream representation extractor. The language stream uses Medical-LLaMA3-8B to encode notes plus text-converted abnormal labs. The temporal stream uses a Time-Series Transformer to encode visit duration and inter-visit intervals. The resulting semantic and temporal embeddings are concatenated and passed into a multilayer perceptron, which produces the final patient representation for prediction.

This is not exotic in the theatrical sense. It is not a moonshot architecture with seven Greek letters and one mystery module called “reasoning”. It is closer to a practical engineering correction: stop pretending that EHR modalities are independent just because the hospital database stores them that way.

The lab-text trick is inelegant, which is partly why it is interesting

One of CURENet’s more pragmatic choices is to convert abnormal lab results into templated text. Instead of asking the LLM to directly reason over raw tabular lab data, the model rewrites abnormal results into sentences such as a list of recorded abnormal test items, values, and units.

This may sound slightly clumsy. It is. It is also a reasonable workaround.

LLMs are strong at textual semantics and still uneven with tabular and time-series structure. The paper acknowledges this limitation and chooses a translation strategy: make part of the structured data legible to the language model, while leaving visit timing to a transformer built for sequences. That division of labour is the useful part. CURENet does not pretend the LLM is magically good at everything. It gives the LLM the part of the record it can plausibly interpret and gives temporal irregularity to a separate encoder.

For business readers, that is a subtle but important lesson. The near-term value of clinical AI may not come from one giant model swallowing the whole hospital. It may come from modular systems that route each data type through the least-bad representation pathway. Less glamorous, more deployable. Terrible news for keynote slides, excellent news for actual systems.

Timing is not metadata; it is clinical evidence

The paper’s second major contribution is its handling of irregular visits. CURENet explicitly models two time features: the duration of each visit and the gap between visits. These are coarse signals, but they are clinically meaningful.

A longer stay can indicate severity, complexity, complications, or care intensity. A shorter interval between visits can suggest instability, poor disease control, or emerging comorbidity. Chronic disease is not merely a list of diagnoses; it is a pattern of recurrence and deterioration over time.

The Time-Series Transformer processes these temporal features across patient histories, using masking and padding to handle variable-length sequences. The model is not using high-frequency physiological monitoring or intra-visit trajectories. That boundary matters. CURENet is not a continuous patient-monitoring model. It is a visit-sequence model, built around admission and discharge timestamps.

Still, this is enough to address a common weakness in EHR prediction models. Many systems flatten patient history into a static vector, then act surprised when chronic disease refuses to behave like a spreadsheet row. CURENet treats the patient record as a sequence. For chronic care, that is not a luxury feature. It is the assignment.

What the experiments actually show

The authors evaluate CURENet on MIMIC-III and a private Far Eastern Memorial Hospital dataset from Taiwan. Patients with at least two documented visits are included. The reported split is patient-level, intended to avoid leakage from the same patient appearing in both training and testing partitions.

The multilabel task focuses on the ten most prevalent chronic conditions in each dataset. For MIMIC-III, these include hypertension, cardiac arrhythmias, diabetes without chronic complications, valvular disease, congestive heart failure, chronic pulmonary disease, fluid and electrolyte disorders, neurological conditions, renal failure, and complicated hypertension. For FEMH, the top conditions include hypertension, diabetes, heart disease, cancer, asthma, liver disease, hyperlipidaemia, cerebrovascular disease, kidney disease, and lung disease.

The main evidence is the performance comparison against BERT, Mistral, Llama-family baselines, Meta-Llama models, and Medical-LLaMA3-8B variants. CURENet performs best across the main classification metrics reported in Table 2.

A careful reading is needed here. The abstract says CURENet achieves over 94% accuracy in predicting the top 10 chronic conditions across the two datasets. The table is more nuanced. For MIMIC-III, CURENet reports 0.9492 training accuracy and 0.9166 validation accuracy. For FEMH, it reports 0.9458 training accuracy and 0.9425 validation accuracy. So yes, there is a 94%-plus figure in the results. No, it should not be read as “the model diagnoses chronic disease with 94% real-world accuracy everywhere.” That would be the usual AI translation error: converting a benchmark number into a procurement slogan.

The more informative comparison is against the strongest baseline, LoRA Medical-LLaMA3-8B:

Dataset	Model	Validation F1 macro	Validation accuracy	Interpretation
MIMIC-III	LoRA Medical-LLaMA3-8B	0.8466	0.9068	Strong clinical language baseline
MIMIC-III	CURENet	0.8551	0.9166	Modest but consistent gain from multimodal temporal fusion
FEMH	LoRA Medical-LLaMA3-8B	0.8435	0.9252	Strong baseline on private hospital data
FEMH	CURENet	0.8720	0.9425	Larger gain, suggesting useful benefit from local multimodal signals

That is the business-relevant version of the result. The gain is not magic. It is not a replacement for clinical judgement. It is evidence that, when notes, abnormal labs, and visit timing are represented together, the model extracts more useful signal than a language model operating mainly through text.

The heart-failure experiment is a secondary task. Its purpose is to test whether the same architecture helps in a clinically important binary prediction setting. The paper reports that CURENet performs better than LoRA Medical-LLaMA3-8B across recall, AUC, and F1, especially on FEMH. The exact numeric values for the figure are not exposed in the HTML text, so the safe interpretation is directional: the architecture appears to improve sensitivity and balance for heart-failure prediction, but the article should not pretend the figure provides procurement-grade thresholds.

The ablation is the paper’s most useful reality check

The ablation study asks a simple question: do the modalities actually matter?

This is where the paper becomes more convincing. On MIMIC-III, the authors remove clinical text and lab text separately. The full model performs best. More importantly, removing clinical notes hurts badly.

Variant	Precision	Recall	F1 macro	F1 weighted	Accuracy
Without TEXT	0.7144	0.6686	0.6634	0.7903	0.8097
Without LABTEXT	0.8399	0.8209	0.8124	0.8838	0.8910
Full CURENet	0.8839	0.8571	0.8551	0.9110	0.9166

The ranking metrics tell the same story. Full CURENet reaches Recall@5 of 0.9580 and NDCG@5 of 0.9172, compared with 0.9283 and 0.8759 without lab text, and 0.8023 and 0.7058 without clinical text.

This ablation is not a side decoration. It is the strongest support for the mechanism. It shows that the architecture’s advantage is not merely because the authors picked a good medical LLM. The combined inputs matter. Clinical notes matter most. Lab text still adds value. Temporal fusion helps organise the evidence into a trajectory-aware representation.

The heart-failure ablation reinforces this. Removing text collapses recall to 0.0348, while the full CURENet reaches recall of 0.6585 and accuracy of 0.9063. That does not mean the model is ready for autonomous screening. It means that, for this setup, clinical narrative is not optional. If a vendor offers chronic-disease risk scoring while treating notes as inconvenient unstructured waste, the correct response is a raised eyebrow and a longer due-diligence call.

The case study and embeddings are interpretive, not decisive

The paper includes a case study comparing predictions over three visits for a MIMIC-III patient. CURENet predicts hypertension during the first visit even though it is not yet in the ground truth, and that diagnosis appears in a later visit. The authors interpret this as evidence that CURENet can detect latent clinical signals before formal documentation.

That is plausible, but it needs discipline. A single case study is not proof of early diagnosis. It is an explanatory example showing how the model may integrate longitudinal cues. Its likely purpose is interpretability, not main evidence.

The embedding analysis has a similar role. Using t-SNE visualisation, the paper reports that CURENet produces more separated disease clusters than LoRA Medical-LLaMA3-8B across MIMIC-III and FEMH. In MIMIC-III, the separation is especially noted for diabetes mellitus, congestive heart failure, and hypertension. In FEMH, clusters are more dispersed overall, which the authors attribute to greater clinical heterogeneity, but CURENet still maintains more coherent disease-specific grouping.

Again, useful but not definitive. Cleaner clusters suggest the model learns more discriminative representations. They do not prove clinical causality, fairness, safety, or deployment readiness. Embedding plots are a microscope, not a regulatory dossier.

A simple way to read the evidence stack:

Paper component	Likely purpose	What it supports	What it does not prove
Main performance table	Main evidence	CURENet improves multilabel prediction over language-model baselines on the reported datasets	Real-world diagnostic accuracy across hospitals
Heart-failure experiment	Secondary task / comparison	The architecture may transfer to a focused high-risk condition	Safe automated heart-failure screening
Ablation study	Mechanism test	Notes and lab-text both contribute; notes are especially important	That all modalities will help equally in every EHR system
Case study	Interpretability example	The model can produce clinically plausible trajectory-aware predictions	Reliable early diagnosis
t-SNE embedding plots	Exploratory representation analysis	CURENet learns more separated disease representations	Causal disease understanding or deployment readiness

This table is also a useful antidote to AI paper-reading theatre. Not every figure carries the same evidentiary weight. Some figures are the argument. Some are supporting scenery. Some are there because reviewers, like everyone else, enjoy pictures.

The real business value is earlier risk stratification, not autonomous diagnosis

For hospitals, payers, and clinical AI vendors, CURENet’s most relevant implication is not “LLMs can diagnose chronic disease.” That interpretation is both lazy and dangerous.

The practical pathway is more specific: use multimodal EHR histories to improve risk stratification for chronic-care management. A system like CURENet could help identify patients whose notes, abnormal labs, and visit cadence suggest worsening disease trajectory or under-recognised comorbidity. That could support care coordination, follow-up prioritisation, readmission-risk workflows, and specialist referral queues.

The operational value would come from ranking and triage, not from replacing physicians. In chronic care, even a modest improvement in recall may matter if it helps clinical teams find high-risk patients earlier. But the economics depend on workflow design. A model that produces more alerts without changing staffing, scheduling, or accountability is not decision support. It is a notification-shaped liability.

The architecture also points toward a more mature vendor conversation. Buyers should not only ask, “What is your model accuracy?” They should ask:

Procurement question	Why CURENet makes it relevant
Which EHR modalities are actually used?	Chronic risk often lives across notes, labs, and timelines
Are lab values handled as raw tables, text, embeddings, or rules?	Representation choices affect portability across hospital systems
Does the model account for irregular visits?	Visit gaps and duration can signal instability
Are results validated at patient level?	Leakage can make EHR models look smarter than they are
What happens when notes are missing or low quality?	The ablation suggests text is a major driver
Is the model optimised for ranking, recall, calibration, or diagnosis?	Clinical workflows need different metrics for different decisions

This is where the paper quietly becomes useful. It gives executives and clinical AI teams a better checklist. Not a buying recommendation. A sharper interrogation script.

The boundaries are real, and they affect deployment

CURENet’s limitations are not generic “more research is needed” wallpaper. They directly affect how the result should be used.

First, the datasets are limited to MIMIC-III and one private hospital dataset from Taiwan. That is broader than a single benchmark, but it is not enough to establish global generalisability. Hospital coding practices, note styles, lab panels, follow-up patterns, and population characteristics vary. A model tuned to one institution’s rhythm may stumble when the music changes.

Second, the model excludes imaging and continuous physiological monitoring. It uses admission and discharge timestamps, not dense intra-visit data. That makes the model more practical for ordinary EHR deployment, but it also limits the clinical granularity of its temporal reasoning.

Third, the model depends heavily on EHR completeness. The ablation makes this obvious. If clinical notes drive much of the performance gain, then missing, templated, low-quality, or inconsistently written notes will degrade the system. In healthcare AI, “garbage in, garbage out” is not a cliché. It is an implementation plan nobody wanted.

Fourth, the paper does not provide prospective real-world validation. It evaluates retrospective prediction using existing datasets. That is appropriate for research. It is not enough for clinical deployment. Prospective validation would need to test how the model behaves in live workflows, how clinicians respond to its outputs, whether it improves outcomes, and whether it creates biased or excessive interventions.

Finally, bias and privacy remain central. The paper notes de-identification and institutional review approval for FEMH, but unstructured clinical notes can carry sensitive information and historical bias. A model that learns from past care patterns may also learn past care inequities. This is not a reason to avoid multimodal EHR modelling. It is a reason not to deploy it like a dashboard widget and call that governance.

CURENet is a sign of maturity because it is not trying to be magical

The most encouraging thing about CURENet is that it does not rely on the fantasy that a medical LLM alone can absorb all clinical reality. It separates what language models are good at from what temporal models are good at, then fuses their representations for the actual prediction task.

That is a more adult version of healthcare AI. Less “ask the chatbot what the patient has”. More “build a patient trajectory representation from the messy evidence clinicians already use”.

The result is not a finished clinical product. It is a research architecture with promising retrospective performance, meaningful ablation support, and clear deployment boundaries. The important lesson is structural: chronic-disease prediction improves when the model treats the EHR as a multimodal timeline rather than a text document with attachments.

For business leaders, the takeaway is equally simple. The winners in clinical AI will not be the vendors with the loudest medical LLM wrapper. They will be the ones that make fragmented hospital data behave like a coherent patient history.

Annoyingly practical. Usually the best kind of progress.

Cognaptus: Automate the Present, Incubate the Future.

Cong-Tinh Dao et al., “CURENet: Combining Unified Representations for Efficient Chronic Disease Prediction,” arXiv:2511.11423, 2025, https://arxiv.org/abs/2511.11423. ↩︎

CURENet starts with the right clinical nuisance: the record is fragmented#

The lab-text trick is inelegant, which is partly why it is interesting#

Timing is not metadata; it is clinical evidence#

What the experiments actually show#

The ablation is the paper’s most useful reality check#

The case study and embeddings are interpretive, not decisive#

The real business value is earlier risk stratification, not autonomous diagnosis#

The boundaries are real, and they affect deployment#

CURENet is a sign of maturity because it is not trying to be magical#