Heart of Scale: Why Bigger ECG Models Don’t Always Beat Better Biases

A hospital does not buy an ECG model because it enjoys leaderboard furniture. It buys one because somebody wants a cheap, reliable signal from a noisy waveform: rhythm abnormality, structural heart disease, ICU risk, mortality risk, maybe a demographic or physiological clue that was not explicitly labeled during pre-training.

The usual AI procurement instinct is simple: bigger model, more data, more labels, better result. It is tidy. It is also the sort of tidy belief that survives mainly because it has not met enough out-of-distribution tests.

The paper How Do Electrocardiogram Models Scale? asks a narrow but commercially important question: when ECG models get larger, when pre-training datasets get larger, and when supervised label sets get larger, what exactly improves?¹ The answer is not “scale wins.” The answer is more useful and less comforting: ECG scaling depends on the coupling between architecture and pre-training paradigm.

That phrase sounds academic, so let us make it operational. In this paper, architecture mostly means ResNet versus Transformer. Pre-training paradigm means supervised learning (SL) versus self-supervised learning (SSL). The paper’s central result is that these two choices do not merely add independent performance points. They change the economics of scaling.

A ResNet can be more parameter-efficient for ECG transfer because its convolutional inductive bias fits local waveform morphology. SSL can be more data- and transfer-efficient because it does not force the representation to collapse around the source labels. Supervised label scaling can improve in-distribution performance, but once the target task moves away from the source labels, the payoff becomes much less obedient.

That is the mechanism. The rest of the article explains why it matters.

The paper is not asking whether foundation models help; it is asking why scaling looks inconsistent

The background problem is familiar in medical AI. ECG foundation models have become larger and more varied. Some use convolutional backbones. Some use Transformers. Some are trained with diagnostic labels. Others use masked reconstruction, contrastive learning, or hybrid self-supervised objectives.

Existing comparisons therefore mix several moving parts. If one model is larger, trained with SSL, built as a Transformer, and evaluated on a new disease task, while another is smaller, supervised, and convolutional, the result may be interesting but not diagnostic. It tells us who won that race, not which wheel mattered.

The authors argue that this is why the ECG foundation-model literature has produced a slightly awkward picture. Compact models sometimes match or beat much larger ones. SSL sometimes helps strongly, but previous comparisons often used SL baselines that were much smaller. Larger datasets sometimes help, but not uniformly. The field is not exactly confused; it has been running multi-variable experiments and then acting surprised when the variables interact.

This paper’s design is valuable because it separates the variables. The authors pre-train more than 120 ECG models, ranging from 20K to 200M parameters, mainly on the CODE dataset with 2.3M records from 1.6M patients. They also use MIMIC-IV ECG, with 0.8M records and 927 labels, to study supervised label scaling. The core comparison covers four model families:

Architecture	Paradigm	Implementation role in the paper	What it isolates
ResNet	SL	A supervised convolutional ECG model	Local waveform bias plus label-driven learning
ResNet	SSL	Contrastive multi-segment coding	Local waveform bias plus label-free representation learning
Transformer	SL	HeartLang-style classifier	Tokenized long-range modeling plus label-driven learning
Transformer	SSL	HeartLang-style masked autoencoding	Tokenized long-range modeling plus label-free reconstruction

The evaluation separates in-distribution (ID) performance on held-out CODE from out-of-distribution (OOD) transfer across 11 datasets, including CPSC2018, CSN, EchoNext, PTB-XL subsets, and MIMIC-IV derived tasks such as sex, ICU admission, and mortality. These tasks cover adult ECG interpretation, structural heart disease detection, acute-care prediction, and patient characteristics.

That variety matters. If a model only performs well on source-like rhythm labels, it may be useful, but it is not a general ECG foundation model. If it transfers to structural heart disease or patient-characteristic tasks, it is learning something less tied to the original label vocabulary. That is where the scaling story becomes interesting.

The first mechanism: supervised ECG models hit a model-size ceiling before they hit a data ceiling

The first result looks at ID scaling. In simple terms, the authors fit loss as a function of model size and pre-training dataset size. The standard form is:

$$ L(N, D) = E + A \cdot N^{-\alpha} + B \cdot D^{-\beta} $$

Here, $N$ is model size, $D$ is pre-training data size, $E$ is the irreducible loss floor, and $\alpha$ and $\beta$ measure how efficiently loss falls as model size or data size increases.

For supervised models on the ID CODE task, the model-size curves saturate quickly. The paper reports ID parameter-scaling exponents of $\alpha_{ID}=0.943$ for Transformer-SL and $0.912$ for ResNet-SL, and the curves flatten beyond roughly $10^6$ parameters as the loss approaches the estimated floor. The interpretation is not that supervised models are bad. It is that, for the source diagnostic labels, moderate capacity can already absorb much of the label structure.

The data curves tell the other half. For SL models, data scaling remains far from saturation, with $\beta_{ID}=0.129$ for Transformer-SL and $0.39$ for ResNet-SL. In business language: once the supervised ECG model is large enough for the available label task, more parameters become a less attractive purchase. More relevant data may still matter.

SSL behaves differently. The paper reports no visible saturation for SSL across the observed model and data ranges. That is not magic; it is objective design. SSL tasks such as masked reconstruction or contrastive learning are broader than predicting a finite diagnostic label set. They continue to reward capacity and data because there is more structure to learn.

So the first correction is simple:

Reader belief	Correction from the paper	Operational meaning
“Bigger supervised ECG models should keep improving.”	SL models saturate on ID parameter scaling around moderate capacity.	Do not spend compute on larger supervised backbones unless the source task, label space, or deployment target justifies it.
“If SSL helps, it is just because it is fashionable.”	SSL remains unsaturated across model and data scales in the observed range.	SSL can be a better scaling target when representation breadth matters.
“More labels solve the same problem as SSL.”	Label scaling helps ID, but OOD payoff depends on architecture and label composition.	Labels are useful, but they are not a universal substitute for representation learning.

This is the first mechanism: supervised ECG learning can become label-task-limited, while self-supervised learning keeps exposing representational work for larger models and datasets.

The second mechanism: ResNet buys parameter efficiency; SSL buys data and transfer efficiency

The OOD results are the paper’s real business payload.

In OOD evaluation, all scaling exponents are positive. Increasing model size or pre-training data generally improves transfer. So this is not an anti-scaling paper. The point is subtler: not all forms of scaling buy the same unit of improvement.

For parameter scaling, ResNet has the advantage. The authors find that ResNet models have mean parameter-scaling exponents $\alpha$ that are 1.3× larger than Transformers under SL and 2.5× larger under SSL. They translate this into a striking comparison: to match the excess-loss reduction achieved by a 10-fold increase in ResNet parameters, a Transformer would require a 19-fold increase in the SL setup and a 288-fold increase in the SSL setup.

That is not a small efficiency gap. That is the sort of gap that turns “let’s use a Transformer because the slide deck looks modern” into an expensive sentence.

For data scaling, the paradigm gap dominates. Under SSL, the mean data-scaling exponent $\beta$ is 16× larger than the SL counterpart for Transformers and 4× larger for ResNets. In other words, SSL makes additional pre-training data more useful for OOD transfer. This is exactly where many ECG deployments live: new hospital, new patient mix, different label definitions, different downstream question.

The mechanism is not hard to see. ResNets encode a bias that suits ECG morphology: local patterns, rhythm structure, waveform features, and temporal regularities. Transformers can model long-range dependencies, but in this study they need more scale before that flexibility pays off. SSL, meanwhile, learns representations not restricted to the source diagnostic labels. It can preserve information that supervised training might treat as irrelevant.

That gives a practical four-cell map:

Choice	Main scaling strength shown in the paper	Business interpretation	Boundary
ResNet-SL	Strong parameter efficiency and strong absolute OOD performance on many adult ECG tasks	Often a cost-effective baseline when target tasks resemble source labels	Less compelling when transfer target is far from supervised label space
ResNet-SSL	Robust low-data and OOD behavior, especially where labels are limited	Good candidate for representation reuse under constrained annotation budgets	Still not automatically dominant over ResNet-SL when large relevant labels are available
Transformer-SL	Weaker OOD scaling; vulnerable to supervised bottlenecks	Risky as a default scaling path for ECG transfer	May still work for particular label-rich, source-aligned settings
Transformer-SSL	Stronger data and transfer scaling; can win at very large model sizes	Attractive if compute budget supports large models and target tasks are farther from source labels	Less parameter-efficient; small and medium deployments may not see the payoff

The key word is allocation. A buyer or builder should not ask only, “Which model is best?” A better question is: “What kind of scaling budget do we actually have — parameters, unlabeled ECGs, labels, fine-tuning data, or deployment validation time?”

Different budgets point to different models.

The third mechanism: in-distribution improvement is a weak proxy for clinical transfer

The paper also uses loss-to-loss scaling to test whether improvements in ID loss predict OOD improvements. The transfer relationship is modeled as:

$$ \Delta L_{OOD} \approx K \cdot (\Delta L_{ID})^\kappa $$

The exponent $\kappa$ measures transfer efficiency. A value below 1 means diminishing returns: ID improvements still help, but each additional unit of ID improvement translates less efficiently to OOD gains.

Across nearly all conditions, $\kappa < 1$. This confirms what many applied teams discover after spending money the entertaining way: a cleaner validation curve on the source distribution is not the same as robust transfer.

The SSL models show higher average transfer exponents. Transformer-SSL reaches a mean $\kappa$ of 0.466, compared with 0.128 for Transformer-SL. ResNet-SSL reaches 0.270, compared with 0.133 for ResNet-SL. The difference is especially clear on tasks that do not look like the source adult ECG interpretation task. For EchoNext, a structural heart disease detection task, Transformer-SSL reaches $\kappa=0.339$, while Transformer-SL sits at only 0.050. For sex prediction, ResNet-SSL has a 7.6× higher transfer exponent than ResNet-SL.

This is where the paper’s mechanism-first reading becomes sharper. SL models can look strong because they approach the ID loss floor. But once they are near that floor, further ID gains may have little remaining transferable signal. SSL models are less tightly optimized around the source labels, so their ID improvement can carry broader representation changes.

That does not mean SSL always wins. It means SSL is more likely to help when the target task asks for information not directly aligned with the source labels. Structural heart disease from ECG is a good example. Biological sex prediction is another. Whether these are clinically valuable in a given product is a separate question, but as transfer diagnostics they reveal how much representation survives beyond the source task.

The evidence map: what each experiment is doing

The paper is dense because it contains main scaling experiments, label-scaling experiments, compute allocation, external model benchmarking, and appendix analyses. These are not all doing the same job. Treating every figure as an equal headline would be a fine way to produce a confused article and a worse investment memo.

Test or section	Likely purpose	What it supports	What it does not prove
ID parameter/data scaling on CODE	Main evidence	SL is data-bottlenecked in-distribution; SSL remains unsaturated across observed scales	That SSL is always clinically superior
OOD parameter/data scaling across 11 datasets	Main evidence	ResNets are more parameter-efficient; SSL is more data-efficient for transfer	That one architecture dominates every task and budget
Loss-to-loss scaling	Main evidence	ID gains transfer imperfectly; SSL has higher transfer efficiency, especially for unseen tasks	That ID validation can be ignored
Absolute OOD performance curves	Main evidence plus interpretation	ResNet-based models often achieve the lowest OOD loss; Transformer-SSL can win at very large scale	That small Transformers are a good default for ECG
Compute-optimal allocation	Analytical extension	Different paradigms should allocate marginal compute differently	Exact deployment cost without local infrastructure assumptions
MIMIC-IV label scaling	Mechanism test / sensitivity test	More labels improve ID, but OOD gains depend on label choice and architecture	That supervised label expansion is useless
External foundation-model benchmark	Comparison with prior work	Real public models show the same architecture-paradigm pattern	A perfectly controlled comparison, since fine-tuning protocols differ
Fine-tuning sample-efficiency appendix	Exploratory extension	Pretraining can reduce downstream data needs, especially for aligned tasks	A general claim about all architectures, since the analysis focuses on Transformer-SL family

This table matters because the paper’s conclusion is not built from a single leaderboard. It is built from converging mechanisms: the same architecture-paradigm interaction appears in controlled scaling, transfer exponents, compute allocation, label scaling, and external benchmarks.

The compute result changes the procurement question

The compute-allocation section is easy to skip because it looks like a technical appendix wearing a main-section hat. That would be a mistake.

The authors adapt the logic of compute-optimal scaling. If a fixed compute budget $C$ must be divided between model size and data size, the optimal allocation follows:

$$ N^\ast \propto C^{\beta/(\alpha+\beta)} $$

and

$$ D^\ast(C) \propto C^{\alpha/(\alpha+\beta)} $$

The exponents tell us where marginal compute should go. On CPSC2018, Transformer-SL assigns 90% of marginal compute to data, while ResNet-SL assigns 79% to data. Transformer-SSL flips the picture, directing 73% toward model size. The paper reports that these trends hold across the remaining OOD datasets in the supplementary analysis.

For a business team, this is more useful than a generic “scale the model” recommendation. If the organization is building a supervised ECG model, the paper suggests that the next marginal dollar may be better spent on broader or cleaner data rather than a larger backbone. If the organization is building a Transformer-SSL model, larger model capacity may be a more defensible use of compute.

This distinction also changes vendor evaluation. A vendor that says “our model is larger” has not answered the relevant question. Larger under which paradigm? Larger relative to what data budget? Larger for which target distribution? Larger after what local validation? Bigger is not a strategy. It is an invoice line.

More supervised labels help, but they do not erase the architecture-paradigm problem

The label-scaling experiment is important because it attacks a plausible counterargument: perhaps SSL only looks useful because supervised models do not have enough labels. Give supervised learning hundreds of labels, and the problem disappears.

The authors test this using MIMIC-IV ECG. They pre-train models on subsets ranging from 9 to 927 classes. The result is split.

First, increasing the number of pre-training labels consistently improves ID performance across both frequent and tail labels. That is the expected result, and it matters. Supervised labels are not decorative. If the target is close to the label space, more and better labels can help.

Second, OOD performance is much more variable. It depends strongly on label composition and architecture. Under the 927-class regime, Transformer-SL fails to convert increased model capacity into stable OOD gains and performs worse than both ResNet-SL and Transformer-SSL. ResNet-SL, however, performs comparably to its SSL counterpart under the large-label setting.

That is a more nuanced message than “SSL beats SL.” The better interpretation is:

More labels improve the source task.
More labels may improve average transfer.
More labels do not guarantee stable transfer when architecture and target distribution are unfavorable.
ResNet can make supervised label scaling much more competitive.

For healthcare AI teams, this is a warning against label-count vanity. A 927-label pre-training corpus sounds impressive. It may be impressive. But if many labels are long-tailed, noisy, weakly related to the deployment target, or attached to a model architecture that struggles to translate capacity into OOD gains, the label count alone is not a due-diligence answer.

The external benchmark confirms the pattern, but it is not the cleanest proof

The paper also compares public ECG foundation models and train-from-scratch baselines across 10 downstream tasks. This section is useful because it links the controlled experiment to recognizable models.

The best overall foundation model in the benchmark is ECG-FM, a 90.9M-parameter Transformer-SSL model pre-trained on 1.5M samples, with a mean AUROC of 0.882. ECGFounder, a 30.7M-parameter CNN/ResNet-style SL model pre-trained on 10M samples, follows with 0.866. The best from-scratch CNN baseline, InceptionTime, reaches 0.812; the best from-scratch Transformer baseline, PatchTST, reaches 0.784.

A few details are more informative than the ranking itself:

Model or group	Paper-reported benchmark signal	Interpretation
ECG-FM	Mean AUROC 0.882	Large Transformer-SSL can excel at sufficient model scale
ECGFounder	Mean AUROC 0.866	Large supervised CNN/ResNet-style pre-training remains highly competitive
ResNet (Ribeiro)	Mean AUROC 0.856 with 7.1M parameters and 2.3M data	Smaller ResNet-SL can remain strong when architecture fits the signal
HuBERT (Base)	Mean AUROC 0.839 despite 93M parameters and 9.1M samples	Size and data do not rescue every architecture/objective/fine-tuning combination
InceptionTime / PatchTST	0.812 / 0.784 as best from-scratch CNN / Transformer baselines	Large-scale pre-training helps, but architecture still affects baseline strength

This benchmark should be read as comparison with prior work, not as the paper’s cleanest causal evidence. Public foundation models differ in pre-training data, objectives, hyperparameters, input processing, and fine-tuning recipes. The authors use best-practice fine-tuning rather than linear probing here, which makes sense for benchmarking but makes the comparison less controlled than the main scaling experiments.

Still, the benchmark supports the same story. ResNet-like models remain highly efficient. Large Transformer-SSL models can win when scale is large enough. Data size alone does not decide the ranking. Parameter size alone does not decide the ranking. One can almost hear the spreadsheet sighing.

What Cognaptus infers for business use

The paper directly shows scaling behavior under specific ECG datasets, architectures, objectives, and evaluation protocols. The business implications require a second step. Here is the clean separation.

Paper result	Cognaptus inference for practice	Uncertainty boundary
SL models saturate in ID parameter scaling, while data scaling remains important	For supervised ECG systems, prioritize data quality, label relevance, and validation before simply increasing model size	Depends on label set, disease prevalence, and deployment distribution
ResNets are 1.3× to 2.5× more parameter-efficient than Transformers for OOD scaling	Small and medium ECG products should treat ResNet baselines as serious production candidates, not old-fashioned leftovers	Other architectures, such as state-space models, were not tested
SSL is up to 16× more data-efficient and up to 7.6× more transfer-efficient on unseen tasks	SSL deserves priority when target tasks differ from source labels or labels are scarce	The SSL methods tested are CMSC and HeartLang-style masking; other SSL methods may behave differently
Transformer-SSL can overtake at very large model sizes	Large organizations with compute and validation capacity may rationally pursue Transformer-SSL	The cost-benefit curve depends on infrastructure, regulatory demands, and downstream task mix
Label scaling improves ID but gives variable OOD gains	Label expansion should be evaluated by target-task relevance, not by raw label count	MIMIC-IV labels are long-tailed ICD-10 cardiac labels; private curated labels may differ
Fine-tuning can save downstream data, especially for aligned tasks	Foundation models can reduce annotation burden for productization	The appendix analysis is narrower and should not be generalized too aggressively

The immediate product lesson is that ECG AI strategy should begin with the deployment target, not the architecture fashion cycle.

For a low-cost diagnostic support tool focused on source-like adult ECG interpretation, a ResNet-SL or ResNet-heavy supervised approach may be hard to beat on ROI. It is parameter-efficient, empirically strong, and operationally simpler.

For a platform intended to support many downstream ECG tasks, especially tasks not directly represented in the source label set, SSL becomes more attractive. It keeps more representation options alive. That is not poetic; it is a transfer-efficiency argument.

For a large-scale foundation-model vendor, Transformer-SSL may be rational, but only if the model is actually scaled enough to overcome its parameter inefficiency and if the business can afford the validation burden. A small Transformer-SSL model sold as a general ECG foundation model should be asked some unfriendly questions. Politely, of course. We are professionals.

The boundaries are narrow enough to matter

The paper is careful about limitations, and they matter for business interpretation.

First, the main pre-training analysis uses CODE, a large public ECG dataset with six expert-annotated labels. The authors deliberately avoid Harvard-Emory despite its larger cohort size because its machine-generated, long-tailed label set would complicate the controlled architecture-paradigm analysis. This is a reasonable research choice, but it means the conclusions come from a single large public source rather than a fully heterogeneous multi-source corpus.

Second, only two major architecture families are tested: ResNet and Transformer. These are important and widely used, but they are not the entire design space. State-space models, hybrid CNN-Transformer systems, and other biosignal-specific architectures may change the frontier.

Third, SSL is represented by specific methods: contrastive multi-segment coding for ResNet and HeartLang-style masked autoencoding for Transformer. The paper argues these are canonical and simple enough to represent broad paradigms, but different masking strategies, contrastive sampling rules, or multimodal objectives could shift the results.

Fourth, the main OOD scaling evaluation uses linear probing on frozen representations. This is good for isolating representation quality and avoiding fine-tuning hyperparameter chaos. But production systems often use full fine-tuning, partial freezing, calibration, ensembling, and site-specific adaptation. The external benchmark partly addresses this, but with less experimental control.

Finally, clinical deployment is not only model performance. Calibration, fairness across patient subgroups, signal quality, lead configuration, hospital workflow, explainability, clinician trust, regulatory classification, and post-market monitoring all remain outside the scaling-law frame. A model can scale beautifully and still be annoying, unsafe, or commercially useless. Nature is generous that way.

The article-level takeaway: scaling is not a ladder; it is a routing problem

The most tempting reading of this paper is “ResNet good, SSL good, Transformer big.” That is not wrong, but it is too flat.

The better reading is that ECG model scaling is a routing problem. If you route compute into model size under supervised learning, you may hit a label-task ceiling. If you route compute into data under SSL, you may get better OOD transfer. If you route a small budget into Transformers, you may pay for flexibility before it pays you back. If you route a large label expansion into a poorly matched architecture or target task, the ID curve may applaud while the OOD curve shrugs.

The paper’s contribution is therefore not just empirical. It gives ECG AI builders a better procurement grammar:

Do not ask whether the model is large. Ask whether its architecture is parameter-efficient for ECG signals.
Do not ask whether the dataset is large. Ask whether the training paradigm can convert that data into transferable representations.
Do not ask whether there are many labels. Ask whether those labels cover the target mechanism or merely decorate the source distribution.
Do not ask whether ID performance improved. Ask how much of that improvement survives distribution shift.

For Cognaptus readers, the broader lesson extends beyond ECG. In narrow technical domains, especially biosignals, scaling laws are not a license to import language-model instincts without inspection. Architecture carries assumptions. Objectives decide what information survives. Labels define what the model learns to ignore. Compute only amplifies the path you chose.

A bigger model is not a better model. It is a louder bet on a mechanism. The smart work is knowing which mechanism you are buying.

Cognaptus: Automate the Present, Incubate the Future.

Jiawei Li, Fabio Bonassi, Ming Jin, Stefan Gustafsson, Johan Sundström, Thomas B. Schön, and Antônio H. Ribeiro, “How Do Electrocardiogram Models Scale?”, arXiv:2605.17276, 2026. https://arxiv.org/abs/2605.17276 ↩︎

Heart of Scale: Why Bigger ECG Models Don’t Always Beat Better Biases#

The paper is not asking whether foundation models help; it is asking why scaling looks inconsistent#

The first mechanism: supervised ECG models hit a model-size ceiling before they hit a data ceiling#

The second mechanism: ResNet buys parameter efficiency; SSL buys data and transfer efficiency#

The third mechanism: in-distribution improvement is a weak proxy for clinical transfer#

The evidence map: what each experiment is doing#

The compute result changes the procurement question#

More supervised labels help, but they do not erase the architecture-paradigm problem#

The external benchmark confirms the pattern, but it is not the cleanest proof#

What Cognaptus infers for business use#

The boundaries are narrow enough to matter#

The article-level takeaway: scaling is not a ladder; it is a routing problem#