Timeline Triage: How LLMs Learn to Read Between Clinical Lines

Hospital notes are not databases that forgot to wear a spreadsheet costume.

They are fragments of care: treatment names, planned cycles, delayed doses, discontinued regimens, relative dates, typos, abbreviations, and the occasional phrase that looks obvious until two clinicians disagree about what it actually means. For oncology, that mess matters. A chemotherapy timeline is not just a historical summary; it is the skeleton of a patient’s treatment journey. Get the timeline wrong, and downstream systems may misunderstand what was given, when it started, when it ended, and whether a patient fits a registry, audit, research cohort, or trial-matching rule.

The ChemoTimelines 2025 paper from UW-BioNLP, “Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction,” is useful because it does not present one shiny model and then ask everyone to clap politely.¹ It compares several ways to make LLMs extract systemic anticancer treatment events from raw clinical notes, then shows how those extractions survive — or fail — when normalized and aggregated into patient-level timelines.

That comparison is the real story. The fine-tuned Qwen3-14B system won the shared task with an official test score of 0.678. But the paper’s practical lesson is not “fine-tune a 14B model and go home.” Clinical extraction pipelines are not judged at the moment the model emits JSON. They are judged after the JSON is normalized, deduplicated, mapped into timelines, and punished by strict exact-match evaluation. In other words: the model is only the first employee in the bureaucracy.

The rest of the bureaucracy matters.

The task is not extraction; it is extraction that survives chronology

The paper focuses on Subtask 2 of ChemoTimelines 2025: generating patient-level chemotherapy timelines from raw clinical notes. The dataset covers breast cancer, melanoma, and ovarian cancer, with 69 patients and 2,910 note files in the training set, 27 patients and 1,272 notes in development, and 53 patients with 2,121 notes in the private test set.

The target output is a timeline of triplets:

a systemic anticancer therapy entity;
a time expression;
a relation such as BEGINS-ON, ENDS-ON, or CONTAINS-1.

This sounds tidy until one remembers what clinical notes look like. A note may say a drug was started “today,” completed “last week,” administered as part of a regimen, scheduled but not yet given, mentioned in a past history, or listed in a table where a date appears nearby but does not mean what a model thinks it means. The model also has to know that some medication-like strings are not systemic anticancer therapy at all.

UW-BioNLP’s design splits the work into two stages:

Note-level extraction: an LLM extracts treatment-time-relation triplets from individual notes.
Timeline aggregation: extracted time expressions are normalized, relative dates are anchored to the document time, duplicated events are merged, and patient-level timelines are produced using the official aggregation script.

That split is more than an engineering convenience. It lets the authors observe an uncomfortable truth: better note-level extraction usually helps, but it does not mechanically translate into better timeline-level performance. Normalization and aggregation can rescue bad outputs, discard useful ones, or amplify small errors. Clinical AI, as usual, refuses to be impressed by a single metric.

The comparison matrix: five ways to make an LLM read a chart

The paper compares five main strategies for note-level extraction, plus a simple ensemble.

Strategy	Likely purpose in the paper	Main evidence it provides	What it does not prove
Prompting baseline	Implementation baseline	Shows that careful instructions alone are weak and unstable	Does not show the ceiling of prompting with retrieval or heavier prompt engineering
Thinking mode	Inference-time comparison	Shows that chain-of-thought-style self-checking can reduce false positives and recover missed events	Does not make reasoning perfectly aligned with final output
Dictionary + LLM verification	Precision-recall and cost tradeoff	Shows high recall, interpretability, and lower token burden when candidate sentences are filtered first	Does not guarantee test-set coverage of abbreviations, typos, or local naming styles
Supervised fine-tuning	Main performance evidence	Produces the best development and official test performance	Does not prove broad clinical deployment readiness
SFT + DPO	Exploratory preference-alignment extension	Tests whether recall-favoring preference optimization improves downstream timelines	Does not clearly outperform SFT
Ensemble	Error-interaction stress test	Shows that concatenating predictions can accumulate errors instead of correcting them	Does not rule out smarter calibrated ensembling

This table matters because it blocks the lazy reading of the paper. The result is not a model ranking page. It is closer to an operational decision matrix.

A hospital AI team, a registry vendor, or a clinical research platform would not choose among these methods only by leaderboard score. They would also care about annotation cost, latency, model hosting, auditability, ontology maintenance, failure modes, and whether the pipeline behaves predictably when documentation style changes. The paper does not solve all of those questions, but it gives unusually concrete evidence for asking them.

Fine-tuning wins, but the win is not just “more learning”

The strongest system is supervised fine-tuning on Qwen3-14B. On the development set, SFT with Qwen3-14B reaches an official score of 0.644. On the private test set, the corresponding final submission reaches 0.678, the highest reported official score.

That is the headline result. But the mechanism is more interesting than the headline.

The SFT setup trains the model to perform note-level extraction using the same general prompt format as the baseline, with outputs serialized as JSON objects containing SACT, relation, and time. Compared with older sentence-level approaches, the authors emphasize three differences: they use note-level context rather than only a sentence window, they output structured JSON rather than specialized triplet linearization, and they fine-tune a newer 14B Qwen3 model rather than older or smaller architectures.

The practical interpretation is simple: when enough annotated examples exist, fine-tuning teaches the model the local extraction style of the task. It learns not only what counts as systemic anticancer therapy, but also the annotation conventions: what surface forms to preserve, how to express relations, and when not to infer.

That last phrase is doing work. Clinical extraction systems often fail not because they lack medical vocabulary, but because they over-help. They normalize when asked not to normalize. They infer exact dates from vague phrases. They merge drug aliases that the gold standard wants separated. They treat a planned event as a real event because, frankly, the note almost invites the mistake.

Fine-tuning makes the model less like a general medical assistant and more like a disciplined annotation worker. Less glamorous. More useful.

Still, the result has boundaries. The training data reflects the documentation style, cancer types, and annotation conventions of this challenge. SFT’s victory does not prove that the same model would generalize cleanly across hospitals, EHR systems, cancer types, treatment eras, or multilingual notes. It proves that under this task design, with these data and this evaluation, fine-tuning was the strongest tested strategy.

That is already valuable. It is just not a deployment certificate. Sorry, procurement slide deck.

Thinking mode behaves like a slower reviewer, not a magic clinician

The thinking approach is the paper’s most tempting result for teams that lack annotated data. With thinking enabled, Qwen3 models perform far better than the prompting baseline. On development, Qwen3-30B-A3B with thinking reaches a note-level micro F1 of 0.526 and an official timeline score of 0.596. With rule-based postprocessing, the final thinking submission reaches 0.625 on development and 0.644 on test.

The authors’ error analysis suggests why thinking helps. The model often checks whether candidate therapies are actually systemic anticancer treatments and whether time expressions are specific enough to extract. It can reject examples like supportive medications or ambiguous expressions, and it can notice hidden events that a one-pass model may miss.

That makes thinking useful as a baseline for future systems. It is a way to buy reasoning behavior at inference time instead of buying annotation and fine-tuning work upfront.

But this tradeoff is not free. Thinking increases output budget substantially: the paper raises the maximum token limit for thinking models from 4,096 to 20,480 to allow complete outputs. That is not a rounding error in production. More reasoning means more latency, more compute cost, and more operational variability.

There is also a deeper problem: reasoning traces do not guarantee faithful final outputs. The appendix gives a wonderfully annoying example. The model correctly reasons that “Tc” in Tc-99m MDP is a radiopharmaceutical used in a bone scan and should not be tagged as chemotherapy. Then the final output still keeps the erroneous tag. The model passed the oral exam and failed the form.

For business use, this makes thinking attractive but dangerous to oversell. It can improve extraction behavior and provide useful diagnostic traces. But the trace is not a binding contract. In regulated clinical workflows, the final structured output still needs validation, not vibes with a transcript.

Dictionaries are interpretable, high-recall, and allergic to the real world

The dictionary-enhanced system is the most operationally interesting method because it resembles how many practical clinical NLP systems are actually built. It uses a chemotherapy dictionary from HemOnc.org, generic chemotherapy mentions, and training/development annotations from Subtask 1. It tags candidate mentions, asks Qwen3 to verify and augment them, then performs relation extraction using local sentence context.

This pipeline is not as elegant as a fine-tuned model reading full notes. It is also not naive. The authors use the dictionary to reduce the search space: fewer than 6% of development-set sentences contain systemic anticancer treatment annotations, so there is a strong efficiency argument for first finding candidate sentences and only then spending LLM compute.

The supplementary table is important here. Dictionary tagging alone achieves nearly perfect recall across cancer types on development, but lower precision for annotated sentences. Adding LLM verification improves precision while keeping recall near 1.0. For breast cancer annotated sentences, precision rises from 0.7322 to 0.8244 while recall remains 1.0000. For melanoma, precision rises from 0.8021 to 0.8296 with recall at 0.9962. For ovarian cancer, precision rises only slightly, while recall falls from 1.0000 to 0.9943.

This is not a secondary curiosity. It is an ablation-style test of the dictionary layer. The dictionary finds almost everything it knows how to spell. The LLM cleans up some false positives. Together, they create a more interpretable and potentially cheaper extraction path.

Then the test set ruins the party, as test sets are paid to do.

The dictionary-enhanced final submission scores 0.632 on development but only 0.545 on test. The authors attribute this partly to term coverage problems: the test set includes variants such as “bev” for Bevacizumab and “interfuron” for interferon, which the dictionary missed. The system also produces false positives when abbreviations collide with non-therapy contexts, such as recognizing “FEC” inside a pulmonary-function expression.

This is the dictionary bargain. You get interpretability, controllability, and efficiency. In return, you inherit the maintenance burden of synonyms, typos, local abbreviations, and clinical shorthand. Every hospital has its own dialect. Some of those dialects were apparently trained by raccoons walking across a keyboard.

For enterprise use, the dictionary approach is not obsolete. It may be exactly right when auditability and cost matter more than maximum score. But it should be treated as a maintained knowledge asset, not a one-time lookup table.

DPO is sensible in theory and modest in this experiment

The DPO experiment tests a reasonable hypothesis: if downstream aggregation can deduplicate and resolve conflicts, maybe the note-level extractor should favor recall over precision. Missing a true event is costly, while extra events might be filtered later.

The authors warm up models with SFT, generate multiple candidate outputs, choose high-recall candidates as preferred responses, and train using Direct Preference Optimization. This is an exploratory extension rather than the central result.

The evidence is mixed. On the test set, SFT + DPO reaches 0.666, close to SFT’s 0.678 but not better. On development, Qwen3-14B SFT scores 0.644, while Qwen3-14B SFT + DPO scores 0.622. Smaller models show small or inconsistent changes. The preference datasets are also tiny: 9 pairs for Qwen3-14B, 27 for Qwen3-8B, and 30 for Qwen3-4B.

So the right conclusion is not “DPO failed.” It is more precise: this particular recall-favoring DPO setup did not outperform SFT under the official timeline metric.

That distinction matters. Preference optimization may still help if preference pairs are larger, clinician-reviewed, calibrated against timeline-level outcomes rather than note-level recall, or designed around specific failure modes such as start/end relation errors. But in this paper, DPO is not the hero. It is the intern with promise and not enough data.

Ensembling fails because clinical extraction errors do not politely cancel

The ensemble method concatenates note-level predictions from SFT, SFT + DPO, and thinking with postprocessing, then sends them into the same normalization and aggregation pipeline. The intuitive hope is familiar: combine models, cover each other’s misses, improve final performance.

The actual result is blunt. The ensemble scores 0.562 on development and 0.603 on test, lower than each of its individual ingredients in the final submissions.

This is one of the paper’s most useful practical warnings. Ensembling does not automatically work when errors are structured and downstream processing is brittle. In this task, extra predictions are not harmless. They can create false positives, duplicate variants, conflicting start/end relations, and irregular time expressions. Aggregation can deduplicate some repetition, but it cannot transform a pile of inconsistent guesses into a reliable patient timeline.

The business lesson is direct: do not merge clinical extraction systems just because each one has a respectable score. Before ensembling, inspect whether their errors are complementary at the level that matters. Here, the level that matters is not token classification or note-level recall. It is patient-level timeline correctness after normalization.

That is a more expensive evaluation. Naturally, it is also the one that counts.

The hidden middle layer: normalization can rescue or punish the model

The paper repeatedly shows that the timeline score is not a clean mirror of note-level extraction quality. Under thinking, Qwen3-30B-A3B has much stronger note-level micro precision, recall, and F1 than Qwen3-14B and Qwen3-32B, but their official development scores are almost the same. Under prompting, Qwen3-14B gets an unusually high official score despite poor note-level behavior, partly because irregular time expressions are discarded by Timenorm.

This is not a statistical oddity. It is the result of the middle layer.

The pipeline normalizes time expressions using a Timenorm module anchored to each note’s document time. Relative expressions that cannot be normalized are discarded. Events are then deduplicated and aggregated into patient-level timelines using the official script.

That means a model can be punished for extracting a clinically meaningful event with a time expression the normalizer cannot handle. Another model can be accidentally rewarded when its bad outputs are thrown away before scoring. The garbage chute becomes a performance feature. Elegant? No. Operationally real? Absolutely.

The error analysis makes the problem concrete. “Last week” relative to one date may be normalized into a full date, while “next week” relative to another becomes a week-level expression. A month-day expression like “January 9” can be anchored to the wrong year. Surface variants such as il2, il-2, and interleukin-2 can remain as separate entities. Regimen names and component drugs can overlap. Start and end events can be retained differently from the gold timeline.

For builders, this is the most important architectural lesson in the paper. A clinical LLM extractor is not a standalone product. It is part of a pipeline whose deterministic pieces may be as important as the neural model.

What businesses should actually take from the comparison

The paper directly shows that, in this shared task, a fine-tuned Qwen3-14B note-level extractor plus normalization and aggregation achieved the best official performance. It also shows that thinking mode is competitive without fine-tuning, that dictionary-enhanced extraction can provide high recall and interpretability, that DPO did not surpass SFT in this setup, and that simple ensembling made things worse.

Cognaptus’ business interpretation is broader but still bounded: clinical AI teams should choose extraction strategy based on workflow economics, not leaderboard aesthetics.

Business situation	Likely technical preference	Reason	Boundary
High-volume registry abstraction with recurring note styles	Fine-tuned extraction plus validated aggregation	Annotation cost can be amortized; consistent formats reward adaptation	Requires curated labels and monitoring for documentation drift
Early prototype with limited labels	Thinking-mode extraction	Lower upfront annotation burden and useful error inspection	Higher latency and no guarantee that reasoning matches output
Cost-sensitive screening with strong terminology control	Dictionary + LLM verification	Reduces token volume and improves auditability	Needs active dictionary maintenance and local abbreviation coverage
Safety-critical timeline generation	Hybrid system with human review	The paper reveals multiple failure layers beyond model extraction	The paper does not test live clinician-in-the-loop deployment
Multi-model production stack	Calibrated ensemble only after error analysis	Simple concatenation accumulated errors	Must evaluate at patient-timeline level, not just note-level metrics

This is where the paper becomes more than an academic benchmark. Many organizations want to “use LLMs on EHR notes.” That phrase is almost content-free. The real question is: which layer of the workflow should be learned, which should be rule-based, which should be auditable, and where should humans enter the loop?

The answer will differ by use case. Trial matching may tolerate candidate generation with review. Registry submission may demand stricter standardization. Real-world evidence extraction may care about recall across messy historical records. Care coordination may care about whether the latest treatment status is usable today, not whether every historical event gets an exact-match score.

The paper does not decide those product choices. It gives the comparison needed to make them less theatrical.

The boundary: this is a strong benchmark result, not a hospital deployment study

The limitations are not decorative; they affect how the result should be used.

First, the systems are customized to ChemoTimelines. The task uses three cancer types and a specific evaluation framework. The official test gold labels remain private, so detailed test-set error analysis is not possible.

Second, the model comparison is selective. The authors test Qwen3 variants and MedGemma-27B, but not the full landscape of contemporary open and closed models. MedGemma is the only medicine-specialized model included.

Third, the evaluation is strict and task-specific. Exact triplet matching is useful for benchmarking, but real clinical workflows may care about different error costs. Confusing BEGINS-ON and CONTAINS-1 may matter greatly in one setting and less in another. Missing a discontinued therapy may be critical for safety; duplicating a regimen component may be annoying but fixable.

Fourth, no live deployment is shown. There is no prospective EHR integration, no clinician-in-the-loop validation, no regulatory workflow, and no measurement of operational cost under production load. The paper is about extraction systems under a shared-task benchmark. That is a valuable stage, but not the final stage.

The practical conclusion is therefore disciplined: use the paper to design and compare clinical timeline extraction architectures, not to claim that one model is ready to automate oncology documentation without supervision.

The better mental model: clinical LLMs are pipeline components, not oracle boxes

The cleanest lesson from this paper is that clinical timeline extraction is a systems problem.

Fine-tuning wins because it adapts the model to the annotation style. Thinking helps because it adds inference-time self-checking. Dictionaries help because they reduce search space and expose controllable terminology. DPO is plausible but not decisive here. Ensembling fails because errors do not cancel at the timeline layer. Normalization quietly shapes the final score more than a model-centric reader might expect.

That is the article’s callback: the model is only the first employee in the bureaucracy.

For business leaders, the boring version of this lesson is the profitable one. Do not ask, “Which LLM should we use for clinical notes?” Ask instead:

What exactly must be extracted?
Which mistakes are tolerable, reviewable, or dangerous?
Do we have enough labels to fine-tune?
Can we maintain a domain dictionary?
How expensive is inference-time reasoning?
Does the normalizer handle the hospital’s actual date language?
Are we evaluating note-level outputs or final patient-level timelines?

The paper’s winning score is useful. Its comparison is more useful. It shows that the future of clinical AI will not be built by throwing larger models at unstructured notes and hoping chronology emerges out of politeness.

Chronology needs design.

And in healthcare, design is where the expensive mistakes usually hide.

Cognaptus: Automate the Present, Incubate the Future.

Tianmai M. Zhang, Zhaoyi Sun, Sihang Zeng, Chenxi Li, Neil F. Abernethy, Barbara D. Lam, Fei Xia, and Meliha Yetisgen, “UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction,” arXiv:2512.04518, 2025. ↩︎

The task is not extraction; it is extraction that survives chronology#

The comparison matrix: five ways to make an LLM read a chart#

Fine-tuning wins, but the win is not just “more learning”#

Thinking mode behaves like a slower reviewer, not a magic clinician#

Dictionaries are interpretable, high-recall, and allergic to the real world#

DPO is sensible in theory and modest in this experiment#

Ensembling fails because clinical extraction errors do not politely cancel#

The hidden middle layer: normalization can rescue or punish the model#

What businesses should actually take from the comparison#

The boundary: this is a strong benchmark result, not a hospital deployment study#

The better mental model: clinical LLMs are pipeline components, not oracle boxes#