TL;DR for operators

Belmadani et al. study a question every serious enterprise LLM team eventually meets after the prototype stops looking magical: which adaptation bill is actually worth paying?1 In French medical question answering, they compare continual pretraining (CPT), supervised fine-tuning (SFT), and CPT followed by SFT across Gemma, Mistral, and Llama-family models, with general, instruction-tuned, and medical initializations.

The headline is not “domain adaptation works.” That would be the polite brochure version. The useful result is more specific: for multiple-choice medical QA, SFT is the strongest cost-effective default when labeled QA data exists. CPT+SFT most often reaches the top score, but the margins over SFT are usually small and frequently not statistically decisive. CPT alone is the awkward guest: expensive, sometimes useful, but rarely persuasive by itself.

The paper also punctures a familiar assumption: starting from a medically pretrained model is not automatically better. Medical initialization alone does not reliably dominate; instruction tuning often matters more, especially when the model must follow structured answer formats rather than merely recognize biomedical-looking words.

For open-ended QA, the results are messier. CPT improves overlap metrics such as ROUGE-L and BERTScore, but the authors’ error analysis shows a verbosity problem: longer answers can score better without necessarily being clinically better. LLM-as-a-Judge evaluation sometimes prefers CPT+SFT or instruction-tuned behavior, but statistically significant improvements are rare. In other words, the free-form answer section is useful, but it is not where procurement should pretend certainty lives.

The business translation is simple: do not make expensive continual pretraining your default ritual because the domain is serious. Serious domains need better evidence, not grander ceremonies. Start with a strong instruction-tuned baseline, run parameter-efficient SFT on high-quality labeled QA data, evaluate against native-language benchmarks, and only add CPT when the marginal gain is worth the compute, carbon, and engineering complexity.

The adaptation budget is the real experiment

The paper is framed as medical LLM adaptation, but its more general value is managerial: it turns “make the model domain-specific” into a set of budget choices.

There are three bills on the table.

First, CPT: continue training the whole model on unlabeled domain text. In this paper, that means using NACHOSsmall, a French medical corpus of roughly 4 GB drawn from medical websites, reports, drug leaflets, health authority sources, theses, clinical cases, and related material. CPT is representation-level adaptation. It asks the model to absorb more domain language before anyone tells it exactly what task to perform.

Second, SFT: train the model on instruction-response examples. Here, the authors use MedInjection-FR, a large French biomedical instruction dataset. The paper’s main setup uses the train and validation portions, containing 543,505 instruction-response pairs, with a mixture dominated by single-answer multiple-choice questions, plus smaller shares of multiple-answer MCQ and open-ended QA. SFT is behavior-level adaptation. It tells the model what kind of answer the task expects.

Third, CPT+SFT: pay both bills, in sequence. Read more medical French, then learn the task.

That comparison matters because enterprises often confuse “domain seriousness” with “pretraining necessity.” Healthcare, finance, law, engineering, compliance: the moment the domain sounds expensive, someone suggests domain pretraining as if it were a protective amulet. The paper’s evidence says: maybe. But first check whether labeled task supervision gets you most of the way for much less money.

The design is unusually useful because the authors do not only compare adaptation methods on one base model. They vary model family, size, and initialization. They include Gemma-4B, Mistral-7B, Llama-7B, and Llama-13B variants. They compare general models, instruction-tuned models, and medically adapted models. They evaluate multiple-choice QA and open-ended QA. They examine constrained decoding and greedy decoding. They also run statistical significance tests, error analyses, translated-benchmark checks, cross-lingual comparisons, and computational cost accounting.

That is the right shape for an enterprise question. Nobody should care whether CPT wins on a single leaderboard under a single model initialization unless their deployment budget is also imaginary.

What the paper directly tests

The core evaluation uses French medical QA from the MedInjection-FR test set, including both native French examples and translated examples from English medical benchmarks. The test suite contains 14,533 native French medical examples and 13,293 translated examples. The authors evaluate multiple-choice questions with single correct answers (MCQU), multiple-choice questions with multiple correct answers (MCQ), and open-ended questions (OEQA).

For MCQA, they use Exact Match for single-answer questions and both Exact Match and Hamming score for multi-answer questions. The Hamming score matters because a multiple-answer question can be partially correct: selecting two correct options out of three is not the same as selecting nonsense, even if Exact Match punishes both. For the main constrained-decoding MCQA setup, the model is restricted to valid answer options. That removes the tedious failure mode where a model produces a paragraph, a disclaimer, or a charming hallucinated fifth option when the task asked for “A, C, D.” Progress, apparently.

For OEQA, the authors use overlap metrics including ROUGE-L and BERTScore, and also LLM-as-a-Judge evaluation using MedGemma-27B, chosen based on prior physician-agreement work. This is important because open-ended medical answers are harder to score: a concise correct answer and a verbose correct answer may share little surface text, while a verbose mediocre answer can stuff enough vocabulary into the output to look impressive to overlap metrics.

The paper’s experiments can be sorted by their role:

Test or analysis Likely purpose What it supports What it does not prove
Constrained MCQA results Main evidence Which adaptation strategy performs best when answer format is controlled That the same strategy is safest in free-form clinical use
Greedy MCQA appendix Robustness / sensitivity test Whether conclusions survive when output is not forced into valid options That greedy decoding is the preferred deployment mode
OEQA automatic metrics Secondary / exploratory evidence How adaptation affects free-form generation under lexical and semantic-overlap scoring That higher overlap means better clinical reasoning
LLM-as-a-Judge OEQA Secondary qualitative check Whether generated answers look better to a medical judge model That the answers are clinically safe or physician-equivalent
Cross-lingual French-English comparison Main evidence for RQ3 Whether French adaptation transfers to English benchmarks That all languages benefit symmetrically
Translated benchmark confidence analysis Evaluation diagnostic Translated benchmarks can inflate accuracy and confidence That native and translated tasks differ only because of translation
Probability and near-miss analysis Mechanism diagnostic Multi-answer failures are often exact-set failures, not pure uncertainty That ranking confidence alone solves MCQ
PEFT vs full fine-tuning comparison Implementation detail / preliminary ablation Why DoRA PEFT is a sensible SFT implementation in this setup That PEFT always beats full fine-tuning everywhere
Contamination study Exploratory robustness check No direct evidence of verbatim NACHOS memorization was found That NACHOS was definitely absent from pretraining

This classification matters because the paper is broad enough to tempt lazy readers into treating every appendix result as another headline. It is not. The main operational recommendation comes from MCQA, especially constrained MCQA. OEQA and benchmark-translation analyses are there to keep the recommendation honest.

CPT+SFT wins often, but SFT is the default adults should budget for

The most important MCQA result has two layers.

At the leaderboard layer, CPT+SFT most often achieves the best scores. Across the model families and initializations, the combined strategy more frequently reaches the top aggregated Exact Match and top MCQ/MCQU scores than CPT or SFT alone.

At the decision layer, the story changes. When CPT+SFT beats SFT, the margin is usually small. The authors report that when CPT+SFT attains the highest score, its margin over SFT rarely exceeds 1.3 points. When SFT beats CPT+SFT, the gap can be larger: the paper gives Llama-7B Instruct as an example where SFT exceeds CPT+SFT by 3.12 points, and Mistral-7B Instruct where SFT leads by 1.44 points.

That is the paper’s central business lesson. CPT+SFT may be the maximum-score option, but SFT is the performance-efficiency default. The difference between those two phrases is where budgets go to die.

The cost table makes the point less philosophical. For representative 7B configurations, CPT costs about $1,216.42, SFT costs about $361.12, and CPT+SFT is the sum: about $1,577.54. The main text summarizes the 7B comparison as more than $1,500 for CPT+SFT versus about $360 for SFT, with a much larger emissions footprint. For 13B models, the contrast becomes more severe: CPT is listed at about $6,082.08, SFT at about $1,391.27, and CPT+SFT again means paying both.

Model size CPT cost SFT cost CPT+SFT implied cost Operational reading
4B $1,824.62 $832.48 $2,657.10 SFT is cheaper, though not tiny
7B $1,216.42 $361.12 $1,577.54 SFT is the obvious first experiment
13B $6,082.08 $1,391.27 $7,473.35 CPT+SFT needs a very good reason

The authors’ PEFT setup also matters. SFT uses DoRA, a parameter-efficient method. A preliminary comparison on FrenchMedMCQA shows DoRA outperforming LoRA, VeRA, and full fine-tuning in that setup: DoRA reaches 0.2435 Exact Match with 0.602% trainable parameters, while full fine-tuning reaches 0.1121 with 100% of parameters trainable. This is not a universal law of nature. It is an implementation justification: for this study, SFT is not “cheap because we undertrained it.” It is cheap because the adaptation method is deliberately efficient and empirically adequate.

The mistake would be to summarize this as “CPT+SFT is best.” That is technically true often enough to be misleading. The operator’s version is sharper: run SFT first; make CPT+SFT justify itself with a statistically credible and business-relevant gain.

CPT alone has the résumé of a specialist and the habits of a consultant

CPT is intuitively attractive. If the model will answer medical questions, surely it should read more medical text. That is the white-coat fallacy: domain exposure looks like competence because the vocabulary improves.

The paper is not kind to that assumption.

For MCQA, CPT alone is unstable. It can improve performance in some cases, but it can also degrade performance relative to the base model. The authors report that CPT is the strategy that most often fails to produce statistically significant improvements over the base model, including MedGemma, all Llama-13B variants, Llama-7B General and Instruct, and Mistral-7B General and Medical.

The mechanism is not mysterious. CPT teaches the model more about domain language distribution. It does not necessarily teach the model to answer a structured medical QA prompt, choose exactly one option, return multiple correct letters, or obey the evaluation format. It changes internal representations; it does not guarantee task behavior.

This is why CPT becomes more useful when paired with SFT. SFT supplies the task interface. CPT may improve the underlying domain substrate, but SFT teaches the model what the business process actually wants. In operational language: CPT can improve the warehouse; SFT trains the picker. If the picker still grabs the wrong item, nobody cares that the warehouse shelves are medically themed.

For companies, the implication is uncomfortable but useful. Unlabeled domain data is often easier to obtain than labeled task data, especially in regulated sectors. But easier data does not automatically create the right behavior. CPT-only programs can become expensive consolation prizes: impressive training runs whose practical value appears mainly in controlled metrics or vague claims of domain fluency.

Medical initialization is not a free competence badge

The paper also compares starting points: general models, instruction-tuned models, and medically adapted models. This is where another expensive assumption takes a beating.

A medically initialized model sounds like the natural choice. It has seen biomedical material. It carries the right branding. It may even have a reassuring name. Excellent. Put the badge on the wall. The results still need to work.

For MCQA, initialization effects vary by metric and task format. Instruction-tuned models dominate the more demanding exact-match settings for multiple-answer MCQ. The authors report that the highest MCQ Exact Match and aggregated MCQA Exact Match are achieved by Llama-13B Instruct, while the best MCQ Hamming score is achieved by Gemma-4B Instruct. For MCQ Exact Match, the best score within each family always comes from an instruction-tuned variant.

For single-answer MCQU, the pattern is less instruction-dominated. General models most frequently achieve the best performance, followed by medical models, while instruction-tuned models do not consistently lead. The authors interpret this as a task-format difference: selecting one plausible answer may depend more on answer ranking and model knowledge, while exact multi-label output needs stronger instruction alignment.

That distinction is useful. In business deployments, “QA” is not one task. A model that ranks a single answer well may fail when asked to emit a precise set, follow a schema, or satisfy a routing rule. Domain knowledge and output discipline are different capabilities. Medical pretraining may help one; instruction tuning may help the other.

For OEQA, medical initialization is even less glamorous. The paper reports that medical models never achieve the top ROUGE-L score, and LLM-as-a-Judge comparisons show that when differences are significant, medical models are consistently outperformed across families, especially under SFT and CPT. The safe interpretation is not “medical models are bad.” It is narrower and more useful: medical initialization alone does not reliably improve downstream French medical QA, and in this evaluation it is not a substitute for instruction alignment or task-specific adaptation.

The business rule is therefore: do not buy the label; test the behavior. A medically pretrained base model may be a good ingredient. It is not a deployment strategy.

Constrained decoding hides one problem and reveals another

The paper evaluates MCQA under constrained decoding and greedy decoding. This is not a trivial implementation detail. It changes what kind of failure the model is allowed to show.

Under constrained decoding, the model can only choose valid answer options. That is appropriate for evaluating the underlying ranking among options, and it prevents format errors from contaminating the score. Under greedy decoding, the model must produce the answer format on its own. That tests instruction-following and output discipline more directly.

The appendix shows that under greedy decoding, instruction-tuned models become much more important. The authors conclude that for greedy medical MCQA, the best configuration is to start from an instruction-tuned model and apply SFT only. CPT and CPT+SFT offer no consistent benefit in that setting and can be avoided unless constrained decoding is explicitly required.

This gives operators a clean deployment distinction.

If your system controls output through a classifier head, constrained decoder, tool wrapper, schema validator, or post-processor, then the constrained MCQA results are closer to your operating environment. In that case, SFT remains the sensible default, and CPT+SFT is a premium upgrade only if it clears a meaningful margin.

If your model must generate the answer directly, instruction tuning and SFT become more important. The model needs not only to know the answer, but to express it in the correct shape. This is where many enterprise LLM evaluations quietly cheat: they score knowledge while deploying generation.

The paper’s error analysis explains why the multi-answer case is stubborn. MCQ predictions are not simply more uncertain than MCQU predictions. In fact, MCQ can show lower entropy and higher maximum probability, meaning the model is locally confident. The problem is exact set construction. The model may rank the right options near the top but still omit one or include an extra one. The near-miss rates in the Mistral analysis remain relatively stable across adaptation strategies, indicating that stronger confidence does not automatically become exact multi-label correctness.

That is operationally important. If the task requires “select every applicable contraindication,” near-misses are not adorable. They are errors wearing a slightly better suit.

Open-ended QA is where metrics start negotiating with verbosity

The paper includes OEQA because medical QA is not only multiple choice. That is the correct instinct. It is also where interpretation becomes more fragile.

On overlap-based metrics, CPT looks helpful. It improves ROUGE-L and BERTScore-F1 across many configurations, especially for Mistral and Llama families. This suggests that domain-adaptive language modeling can help generate more medically relevant text.

But the paper does not stop there, which is fortunate. The authors analyze answer lengths and find a verbosity bias. CPT-adapted models systematically produce longer responses, with higher mean and median word counts across model families. Longer answers can cover more terms and phrases, which helps ROUGE-L and BERTScore. That does not necessarily mean they are more correct.

Instruction-tuned models, especially under SFT, often generate shorter and more controlled answers. These shorter answers can be penalized by overlap metrics while receiving better LLM-as-a-Judge scores. The judge appears to prefer concise, controlled answers in some settings. This creates a familiar evaluation comedy: one metric rewards saying more; another rewards not rambling. Both may be partially right, and neither is a physician signing off on clinical correctness.

The authors report that CPT is statistically significant in only three OEQA cases, SFT in two, and CPT+SFT never yields statistically significant gains over the base model in OEQA. That should calm everyone down. OEQA evidence is useful for diagnosis, not for sweeping claims.

For business users, OEQA should be evaluated with stronger task-specific rubrics. A medical answer is not good because it has high lexical overlap, nor because one judge model likes it. It is good because it is clinically correct, complete enough for the use case, appropriately cautious, traceable to evidence where needed, and formatted for the workflow. The paper gestures toward this by using LLM-as-a-Judge based on prior physician-agreement evaluation, but it does not turn OEQA into a solved procurement metric. Nor should it pretend to.

French adaptation can transfer to English, but language advantage is model-family specific

The cross-lingual section tests whether French medical adaptation improves English benchmark performance. This is not a side curiosity. For multilingual organizations, the question is whether investment in one language creates reusable capability or just a local patch.

The answer is: sometimes broadly, but not symmetrically.

Mistral models perform better on translated French benchmarks than on the original English data, and this pattern persists after adaptation. The authors interpret this as evidence that Mistral encodes French more effectively in this setting. Gemma and Llama show the opposite pattern: they generally perform better on native English benchmarks, and adaptation gains are often larger in English than in French, even though the adaptation data is French.

That is a useful warning. Cross-lingual transfer is real here, but it rides on the model’s pre-existing language geometry. French medical data can improve English benchmark performance, but the size and direction of gains depend on the model family. The adaptation data is not the only language signal in the system; the base model’s pretraining mixture and multilingual capacity are already steering the result.

For enterprise deployment, this means multilingual adaptation should be treated as portfolio design, not translation plumbing. If the organization supports French, English, Arabic, Spanish, and Tagalog medical workflows, it should not assume that one fine-tuning run transfers equally. It should test native-language performance, translated-benchmark performance, and calibration behavior separately. Convenient symmetry is not a strategy. It is a spreadsheet wish.

Translated benchmarks make models look better and more confident

One of the most practically valuable parts of the paper is the translated-benchmark analysis.

The authors compare behavior on a native French benchmark, MediQAl, and a translated benchmark, MedMCQA. Both are MCQU datasets of comparable size. The result is clear: all models achieve higher Exact Match on the translated benchmark. That higher accuracy is systematically accompanied by lower predictive entropy, meaning models are also more confident. Even worse, the paper reports that models often become more confident when wrong on translated benchmarks.

This is a serious evaluation problem. A translated benchmark is not just a native benchmark in another outfit. It may carry artifacts from the original data, translation process, answer style, or benchmark construction. Models can find it easier, and their confidence can become less trustworthy.

The business consequence is blunt: translated benchmarks can make a model look production-ready before native users get the privilege of discovering otherwise. If an organization operates in a non-English domain, native-language evaluation is not a nice-to-have. It is the part where the test stops flattering the vendor.

A sensible evaluation suite should include:

Evaluation item Why it belongs
Native-language benchmark Measures performance in the actual linguistic environment
Translated benchmark Supports comparison with established English tasks but should not stand alone
Confidence and entropy analysis Detects overconfidence, especially on translated data
Format adherence checks Separates knowing the answer from producing usable output
Human or expert rubric sampling Catches clinical, legal, or domain-specific failures metrics miss
Length-controlled OEQA review Prevents verbosity from masquerading as quality

This is not bureaucratic caution. It is how one avoids deploying a model that is fluent in benchmark artifacts.

The practical decision framework

The paper’s conclusion already gives guidelines by data availability. Cognaptus can translate those into an enterprise decision framework.

Situation Recommended first move What to avoid When to upgrade
Only unlabeled domain text is available Treat CPT as exploratory; expect modest and unstable MCQA gains Presenting CPT-only as production adaptation Upgrade when labeled task data becomes available
Labeled QA data is available Run parameter-efficient SFT first Starting with CPT because the domain sounds important Add CPT only if SFT leaves measurable, valuable gaps
Both unlabeled text and labeled QA data are available Compare SFT against CPT+SFT under statistical testing Paying for CPT+SFT when gains are tiny or insignificant Use CPT+SFT when marginal score gains justify compute and complexity
Output must be directly generated Start from instruction-tuned models and SFT Evaluating only constrained decoding Add output validators or constrained generation
Open-ended answers matter Use multiple metrics plus expert review Trusting ROUGE-L, BERTScore, or judge scores alone Add clinical rubric scoring and length controls
Non-English deployment matters Use native benchmarks Relying on translated benchmarks Add calibration and confidence audits

The deeper rule is that adaptation strategy should follow failure diagnosis.

If the model knows the domain but fails the format, SFT and instruction alignment are the likely fix. If the model lacks domain vocabulary or background exposure, CPT may help, especially for generation. If the model ranks correct options but fails exact set output, adaptation alone may not solve it; decoding constraints, multi-label calibration, or task-specific selection mechanisms may be needed. If the model performs well only on translated benchmarks, the issue is evaluation validity, not adaptation glamour.

The paper is valuable because it makes these distinctions visible. It does not let “medical” become one undifferentiated excuse for larger training runs.

What Cognaptus infers, and what the paper does not show

The paper directly shows that, in this French medical QA setting, SFT is the most practical default for MCQA when labeled QA data exists. It also directly shows that CPT+SFT often reaches the highest MCQA scores, but with small, sometimes statistically uncertain gains over SFT. It directly shows that CPT alone is unstable for MCQA, that medical initialization does not reliably dominate, that translated benchmarks inflate performance and confidence, and that OEQA metrics are sensitive to answer length.

Cognaptus infers that enterprise LLM adaptation should be treated as a staged investment decision. Start with the cheapest intervention likely to fix the observed failure. Do not pretrain because the domain has a regulatory department. Do not buy a medical base model and assume the workflow is solved. Do not trust translated benchmarks without native checks. Do not use open-ended overlap metrics as if they were clinical QA.

What remains uncertain is also important.

The study is specific to French medical QA. Other languages, domains, and resource regimes may behave differently. The experiments are zero-shot; few-shot prompting is excluded to preserve controlled comparison. Reinforcement learning, preference optimization, and reward-driven clinical alignment are not tested. The cost analysis prices compute and emissions, not the human labor required to create high-quality instruction data. The OEQA evaluation does not fully establish clinical correctness, semantic equivalence, or reasoning validity. The contamination study finds no direct evidence of verbatim NACHOS memorization, but it is inconclusive because reliable non-member biomedical controls are hard to construct.

These boundaries do not weaken the paper’s main message. They prevent people from turning it into a slogan, which is usually where useful research goes to become a sales deck.

The white coat still needs a task

The most tempting story in domain adaptation is that more domain exposure produces better domain intelligence. Sometimes it does. But this paper shows why that story is too coarse for operational work.

CPT gives the model more domain language. SFT gives it task behavior. Instruction tuning gives it output discipline. Constrained decoding changes the failure surface. Native benchmarks change the evaluation truth. Open-ended metrics reward behaviors that may or may not be clinically desirable. Medical initialization provides a starting point, not a guarantee.

The result is not anti-CPT. It is anti-ritual. CPT can be useful, especially when paired with SFT or when generation quality genuinely depends on richer domain language modeling. But for French medical multiple-choice QA, parameter-efficient SFT is the first serious move, not the cheap compromise.

The white coat is not the treatment. It is wardrobe. The treatment is matching the adaptation method to the actual failure, then making the improvement pay rent.

Cognaptus: Automate the Present, Incubate the Future.


  1. Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, and Benoit Favre, “Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA,” arXiv:2606.19266, 2026, https://arxiv.org/abs/2606.19266↩︎