The Heart of the Model: ECG Foundation Models Need the Right Backbone Before More Data

Cost is not always about size.

That is an inconvenient sentence for anyone trying to sell a larger medical foundation model by waving parameter counts like a hospital procurement trophy. In ECG modeling, the expensive question is not simply whether one can pretrain on more recordings. The harder question is whether the model architecture and pretraining task actually match the structure of the signal.

A recent arXiv paper by M. A. Al-Masud and Nils Strodthoff gives a unusually useful answer. It studies ECG foundation models under a controlled setup: five self-supervised pretraining objectives, multiple backbone architectures, pretraining data up to 11 million ECG samples, and downstream evaluation across 26 clinically relevant tasks drawn from 10 public datasets with 1,622 classification and regression targets.¹ That matters because most foundation-model comparisons are aesthetically impressive and scientifically annoying. One model changes the dataset, another changes the architecture, a third changes the objective, and then everyone pretends the leaderboard explains causality. Very elegant, if one enjoys fog machines.

This paper is valuable because it removes some of that fog. Its strongest practical message is not merely “ECG foundation models scale.” They do, but that is the second-order lesson. The stronger lesson is that domain fit is a first-order design variable. For ECG time series, a structured state-space backbone, especially S4, consistently beats Transformer and CNN alternatives under the same pretraining objectives. Among the self-supervised objectives, contrastive predictive coding, or CPC, produces the most transferable representations across diverse clinical tasks, with JEPA usually close behind.

For business readers, the implication is simple but not small: the ROI of a medical foundation model may depend less on buying the largest general-purpose architecture and more on diagnosing whether the model’s inductive bias fits the operational signal. In plainer language: before paying for scale, check whether the model is listening in the right way.

ECG foundation models are judged by transfer, not by architectural fashion

An ECG is not an image with twelve decorative stripes. It is a physiological time series: ordered, rhythmic, multi-lead, and clinically interpreted through temporal morphology. The same QRS complex, rhythm irregularity, segment shift, or longer-horizon pattern can carry diagnostic meaning. That makes ECG modeling a good test case for whether foundation-model recipes imported from language, vision, or speech survive contact with a different modality.

Readers already familiar with self-supervised learning and sequence-model backbones can skip the next paragraph. For a quick refresher on applied AI concepts and deployment thinking, Cognaptus Academy is the better starting room than yet another breathless thread about “AI transformation.”²

Self-supervised pretraining asks a model to learn useful representations from unlabeled data by solving a proxy task: predict masked regions, match teacher representations, distinguish true future states from false ones, or assign inputs to learned clusters. The bet is that solving the proxy task forces the model to learn structure that later transfers to supervised clinical tasks. A foundation model is useful only if that transfer works. A model that learns a beautiful internal code but fails downstream is not a foundation model. It is expensive pattern art.

The paper tests this transfer in three evaluation modes:

Evaluation mode	What changes during downstream training	What it tells us
Full finetuning	Encoder and prediction head are adapted	Whether the pretrained model is a good starting point
Frozen evaluation	Encoder stays fixed; a stronger head adapts	Whether the representation is useful without changing the encoder
Linear evaluation	Encoder stays fixed; only a linear head is trained	Whether the representation is already cleanly separable

This distinction is important. Full finetuning can hide weak representations because enough task-specific adaptation may rescue them. Frozen and linear evaluation are harsher. They ask whether pretraining created reusable features rather than merely a tolerable initialization.

The controlled comparison is the real contribution

The study compares five self-supervised objectives: data2vec, DinoSR, JEPA, CPC, and HuBERT++. It also compares three broad backbone families: S4-based structured state-space models, Transformers, and a CNN-style Net1D backbone. The authors use a shared encoder design where possible, with a lightweight CNN stem followed by a sequential backbone. That design choice is not glamorous, but it is what makes the comparison interpretable.

The pretraining corpus is also deliberately staged. The authors train on subsets ranging from small HEEDB samples to a full combined corpus of HEEDB, HEEDB-Emory, and CODE-15%, totaling roughly 11 million ECG samples. They also run a matched-scale comparison between HEEDB and MIMIC-IV-ECG at around 753K to 759K samples. This is not a random tour through datasets. It is a controlled attempt to separate objective choice, architecture choice, dataset scale, and downstream transfer.

The downstream evaluation is broad enough to punish narrow tricks. It covers adult ECG interpretation, pediatric ECG interpretation, cardiac structure and function, cardiac outcomes, non-cardiac outcomes, acute care predictions, and patient characteristics. That breadth matters because a model aligned too closely with one diagnostic category may look strong in a familiar evaluation and weaker elsewhere.

The paper also uses statistical rankings with confidence intervals rather than pretending every tiny difference in AUROC deserves a trophy. Good. Small differences in medical-model tables often look like scientific certainty only because the table has enough decimal places.

S4 is the first-order design choice

The most important empirical result is architectural. Across all five self-supervised objectives, the S4 backbone consistently outperforms Transformer and Net1D alternatives on most downstream tasks, while performing on par on the remainder. The gap is especially visible for JEPA, CPC, and HuBERT++, and on harder tasks such as pediatric ECG interpretation and cardiac structure prediction.

The practical magnitude is not only predictive performance. It is also computational efficiency. In the paper’s efficiency comparison, the S4 variants use about 3 million parameters, while the Transformer variants use about 19.2 million and Net1D about 10 million. The S4 forward/backward compute is reported at about 1.741 / 5.213 GFLOPs, compared with roughly 27.410 / 82.207 for the Transformer and 8.845 / 73.817 for Net1D. Under the measured inference setting, S4 also uses far less GPU memory and generally higher throughput.

That combination is the useful part: better task performance, smaller model, lower compute. This is not the usual “small model is cheaper but weaker” compromise. In this domain, the smaller architecture can be better because its inductive bias better matches the data.

The authors support this with CKA representation-similarity analysis. S4 layers develop progressively distinct representations. Transformer blocks, by contrast, often show high similarity across intermediate layers, and Net1D displays more uniform similarity patterns. The interpretation is not that CKA proves clinical validity. It does not. The interpretation is narrower and more useful: the S4 backbone appears to create a more differentiated internal hierarchy for ECG signals, which is consistent with its stronger downstream transfer.

Recall the earlier point about transfer. Full finetuning can rescue a model. Frozen and linear evaluation are less forgiving. The fact that the S4 advantage appears across objectives and is mechanically supported by representation analysis makes the architecture result harder to dismiss as a training accident.

CPC wins because ECG is a temporal prediction problem before it is a benchmark table

Among the five pretraining objectives, CPC performs best overall, with JEPA usually the strongest competitor. DinoSR and HuBERT++ occupy the middle. data2vec lags most consistently across evaluation modes and scaling regimes.

The paper’s explanation is plausible: CPC’s sequential prediction task aligns naturally with ECG data. CPC learns by predicting future latent states from causal context and distinguishing true future steps from negative alternatives. ECG signals are temporal, rhythm-sensitive, and sequentially structured. A pretraining objective that forces the model to anticipate what comes next in the signal may teach features that remain useful across clinical tasks.

JEPA also performs strongly, especially under finetuning. But CPC appears more transferable in broader categories such as cardiac structure, outcomes, and patient characteristics. Under frozen and linear evaluation, the differences become more revealing. CPC’s relative position changes depending on evaluation mode, but the overall pattern still points to stronger task transfer than data2vec and most other alternatives.

This is where business readers should resist a lazy conclusion. The lesson is not “always use CPC.” The paper tests ECG foundation models under specific architectures, datasets, and evaluation protocols. The lesson is that pretraining objectives are operational choices. They shape what kind of representation the organization is buying.

A hospital analytics team, a medical-device company, and an AI vendor may all say they are “using ECG foundation models.” That phrase hides the real question: what proxy task taught the model to understand the signal? If the answer is vague, the deployment risk is not vague. It is merely postponed.

Scaling helps, but scale does not forgive bad design

The paper does find meaningful scaling behavior. Models are pretrained on progressively larger datasets, including 18K, 45K, 106K, 753K, and 11M samples. CPC and JEPA show the clearest power-law loss scaling, with reported validation-loss scaling exponents of 0.189 and 0.062 respectively. The authors also examine whether lower pretraining loss correlates with lower downstream residual error, and they find positive correlations, especially for CPC, JEPA, and DinoSR.

The scaling form used for downstream residual error is conceptually simple:

$$ E(N) = aN^{-b} + c $$

Here, $N$ is training-set size, $b$ governs the rate of improvement, and $c$ is the residual error floor. If $b$ is meaningful and the fit is good, more data continues to buy improvement. If the curve is noisy or flat, scale is either being wasted or blocked by another bottleneck.

The paper’s scaling results are encouraging, but they are not a blank check. Scaling is clearer for some tasks than others. PTB-XL superclasses show comparatively clean behavior, while other downstream datasets are more mixed. That distinction matters. A vendor can truthfully say that a model improves with scale and still be hiding the fact that the improvement is uneven across deployment tasks.

For business planning, scaling should be treated as a budget decision after architecture and objective are credible. More unlabeled ECG data may improve performance. But if the backbone is poorly matched or the proxy task teaches the wrong abstraction, additional data may just finance a more elaborate mistake.

The appendix is mostly quality control, not a second thesis

The appendices are useful because they test whether the main story survives changes in setup. They should not be read as a pile of disconnected tables. Their likely purposes are different.

Test or result	Likely purpose	What it supports	What it does not prove
S4 model-dimension ablation	Ablation	A 512-dimensional S4 configuration is a sensible default; larger is not automatically better	That 512 is universally optimal for every ECG corpus or deployment constraint
Learning-rate ablations for SSL objectives	Implementation sensitivity test	Some performance differences are not merely one unlucky learning rate	Complete hyperparameter optimality across all methods
Backbone comparison across objectives	Main evidence	S4 is consistently stronger than Transformer and Net1D in this controlled ECG setup	That Transformers are inferior for all physiological signals or all model sizes
CKA representation analysis	Mechanistic support	S4 and CPC develop more differentiated internal representations	Direct clinical interpretability of learned features
HEEDB vs. MIMIC matched-scale comparison	Robustness / dataset sensitivity test	Matched-scale pretraining source changes results only modestly in this setup, with MIMIC often slightly ahead	That dataset composition never matters
Input-size comparison	Implementation detail / sensitivity test	Shorter 2.5-second inputs often perform well in this setup	That short inputs are sufficient for every clinical endpoint
Domain pretraining before finetuning	Adaptation test	Continued domain pretraining can improve downstream performance	That every organization should always run another pretraining stage
HuBERT++ versus HuBERT-ECG	Method refinement comparison	EMA targets, Sinkhorn-Knopp soft assignments, and S4 backbone improve over the prior HuBERT-style setup	That HuBERT++ beats CPC overall

The table makes one thing clear: the paper’s main argument rests on the controlled backbone/objective/scaling comparison. The appendices mostly check configuration choices, sensitivity, computational cost, and method refinements. That is exactly where appendices are useful. They reduce the chance that the headline is a fragile artifact of setup.

The business value is cheaper diagnosis of model design, not just cheaper inference

The obvious business reading is that S4 may reduce training and inference cost. That is true, but incomplete. The deeper business value is diagnostic. The paper gives teams a way to ask better procurement and development questions before committing to a medical-AI architecture.

Paper finding	Cognaptus business interpretation	Operational action	Boundary
S4 dominates Transformer and Net1D across objectives	Domain-specific inductive bias can beat general architectural fashion	Benchmark candidate models by modality fit, not only parameter count or brand familiarity	Evidence is ECG-specific, not a universal anti-Transformer verdict
CPC and JEPA are strongest among tested SSL objectives	The proxy task determines what representation the model buys	Evaluate pretraining objective as a product requirement, not a research footnote	The best objective may shift for other physiological signals
Scaling improves loss and often downstream performance	More unlabeled data can pay off after design is credible	Build data-scaling plans only after objective/backbone validation	Scaling gains are uneven across tasks
Frozen and linear evaluation reveal representation quality	Finetuning alone can hide weak pretraining	Include frozen-feature and linear-probe checks in vendor evaluation	These checks do not replace clinical validation
Domain-adaptive pretraining helps in appendix tests	Local data adaptation may improve fit before supervised training	Use continued pretraining where workflows have enough safe, governed unlabeled data	Adds governance, compute, and monitoring burden
Models are research-use, not clinically validated	Deployment requires validation, review, and accountability	Keep human review and clinical governance in the loop	The paper is design guidance, not a clinical product approval

For healthcare AI vendors, the lesson is product discipline. Do not sell a foundation model by saying it was trained on a lot of ECGs. Say what architecture it uses, why that architecture fits ECG signals, which pretraining objective shaped the representation, and how it performs under frozen and task-adapted evaluation. A buyer who asks those questions will be irritating. Good. Irritating buyers prevent expensive nonsense.

For hospitals and clinical analytics teams, the lesson is governance. A foundation model should not enter workflow simply because its benchmark table looks better than a smaller baseline. The evaluation should map to actual use: triage support, risk prediction, downstream report generation, clinical-review prioritization, or feature extraction for internal models. Each use case changes the acceptable error profile, human-review requirement, and audit trail.

For CFOs, the lesson is that model cost is not only GPU cost. It includes data curation, validation, integration, model monitoring, exception handling, legal review, and the clinical time spent checking outputs. A smaller, better-matched architecture may save money twice: once in compute, and again by producing representations that require less downstream contortion.

The main boundary is deployment validity, not research usefulness

The paper’s own limitations are important and refreshingly specific. It focuses on pure self-supervision and does not include weak supervision from diagnostic statements, text, or multimodal clinical signals. It does not deeply interpret learned representations beyond probing and CKA-style analysis. It also notes that data2vec might improve under more extensive hyperparameter optimization.

Those limitations do not weaken the paper’s core value. They define its proper use. The study should guide research and development choices for ECG foundation models. It should not be treated as a deployment certificate, a clinical validation study, or proof that one architecture should dominate every medical time-series problem.

The broader impact statement is also clear: the models are intended for research use and have not been validated for clinical application. In business terms, the result belongs in model-design strategy and technical due diligence, not in a hospital workflow without further validation. A model that wins a research benchmark still has to survive local data drift, device variation, demographic differences, integration failures, and clinician trust. The paper does not solve those problems. It makes the upstream design conversation less foolish.

The useful lesson is to scale after you understand the signal

The cleanest takeaway is this: ECG foundation models should not be built by copying the architecture fashion of the month and then compensating with more data. The paper provides evidence that for ECG, architecture fit comes first, pretraining objective comes next, and scaling is useful once those choices are credible.

That order matters. In many AI deployments, organizations reverse it. They buy scale, discover the workflow is fragile, then add governance language afterward like parsley on a badly cooked steak. ECG modeling is too clinically consequential for that ritual.

The better sequence is more disciplined:

Choose a backbone that matches the signal.
Choose a pretraining task that teaches transferable structure.
Test representation quality under frozen and finetuned conditions.
Scale data only after the first three steps survive.
Validate locally before clinical or operational use.

The paper does not make ECG foundation models simple. It makes them less mystical. That is already a service. In a market where “foundation model” is often used as a spell rather than a specification, a controlled study that says “this design works better, here is where, here is why, and here is what we still do not know” is unusually useful.

The heart of the model is not its size. It is whether the model’s assumptions beat in time with the signal.

Cognaptus: Automate the Present, Incubate the Future.

M. A. Al-Masud and Nils Strodthoff, “Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study,” arXiv:2605.12241v1, 12 May 2026, https://arxiv.org/html/2605.12241. ↩︎
Cognaptus Academy, https://cognaptus.com/academy/. ↩︎

ECG foundation models are judged by transfer, not by architectural fashion#

The controlled comparison is the real contribution#

S4 is the first-order design choice#

CPC wins because ECG is a temporal prediction problem before it is a benchmark table#

Scaling helps, but scale does not forgive bad design#

The appendix is mostly quality control, not a second thesis#

The business value is cheaper diagnosis of model design, not just cheaper inference#

The main boundary is deployment validity, not research usefulness#

The useful lesson is to scale after you understand the signal#