The Clean Label Fairy Is Not Coming

TL;DR for operators

Hospitals do not label images the same way. Radiologists disagree on contours. Pathologists disagree on grades. Automatically generated masks miss structures, add structures, or quietly confuse one target for another. In centralized AI, those errors are already irritating. In federated learning, they become operationally awkward because the data cannot simply be pooled, inspected, cleaned, and morally forgiven by a heroic annotation team.

The paper behind this article introduces a benchmark suite for federated noisy-label learning in medical image segmentation, using six real-world noisy datasets, four client-noise scenarios, and noise-type-specific evaluation metrics.¹ Its useful contribution is not another clever model with a commemorative acronym. The useful contribution is a controlled comparison environment where noisy-label methods can be selected according to the kind of label failure actually present.

The headline result is simple enough: FedSelect is the strongest overall method in the benchmark, IOP-FL is the main competitor, and FedAvg remains a strong baseline. That last part is the small corporate tragedy. A method designed for noisy labels does not automatically beat the default federated averaging baseline. Performance gains are dataset-dependent, scenario-dependent, and metric-dependent. Apparently, reality declined to become a leaderboard.

For healthcare AI teams, the business interpretation is this: do not buy or deploy a federated noisy-label method as if “noisy label” were one disease. Diagnose whether the dominant problem is contour disagreement, missed or extra structures, or class confusion. Then consider the client-noise pattern: is every site partly noisy, are only some sites fully noisy, or is everyone noisy all the way down? The paper’s decision guide is useful precisely because it converts “label quality” from a vague complaint into a selection problem.

The boundary matters. Evidence is strongest for contour-related noise because every dataset contains contour variation. Evidence is moderate for instance-level noise, especially with MMIA carrying much of the signal. Evidence for class confusion is preliminary because it mainly comes from GleasonHD, a difficult and noisy pathology dataset with severe rater disagreement. Use the benchmark as a governance tool, not as scripture engraved on a GPU.

The easy mistake is to blame federation

A familiar deployment story goes like this. Several hospitals want a segmentation model. Nobody wants to centralize patient data. Federated learning looks like the civilized compromise: each site trains locally, the server aggregates updates, privacy constraints remain manageable, and the model allegedly benefits from everyone’s data without anyone surrendering the raw images.

Then performance disappoints.

The convenient diagnosis is that federation is the culprit. Distributed data are heterogeneous. Client models drift. The aggregation step dilutes local specialization. The privacy-preserving setup blocks convenient debugging. All true enough, and all incomplete.

This paper points to a less glamorous source of trouble: the labels themselves. In medical image segmentation, labels are not just class names attached to images. They are spatial masks. A mask says where a tumor begins, where an organ ends, which structure belongs to which class, and whether a suspected object exists at all. That makes label noise geometrically richer than ordinary classification noise. A wrong image label says “cat” instead of “dog.” A wrong segmentation mask can be slightly too large, slightly too small, missing an instance, adding a false instance, swapping classes, or doing several of these while maintaining a serene expression.

Federated learning amplifies the problem because label conventions vary by institution, rater, scanner, workflow, and source system. Worse, the whole point of federation is that nobody gets to inspect all labels centrally. A noisy site may reveal itself only through weaker model behavior. By then, the model has already been trained through aggregation, which is a polite word for “letting one institution’s annotation habits affect everyone else.”

The paper’s move is to stop treating label noise as an abstract defect and instead benchmark methods against observable noise forms.

The benchmark compares methods, not vibes

The benchmark suite is built around six medical segmentation datasets with real-world label variability rather than synthetic corruption. The datasets cover CT, fundus photography, microscopy, micro-CT, and MRI. They include LIDC, RIGA, GleasonHD, MouseT, MMIS, and MMIA. The clean references come from majority voting, STAPLE fusion, or expert labels, depending on the dataset. Noisy labels are drawn from individual raters or automatically generated masks.

That design choice matters. Synthetic label noise is tidy. Real clinical label noise is not. Synthetic noise often changes labels according to a neat probability rule. Real segmentation noise emerges from ambiguous anatomy, rater practice, image quality, institutional procedure, and the limits of human agreement. It is not merely “random wrongness.” It has structure.

The benchmark evaluates four client-noise scenarios:

Scenario	What it means	Operational analogue
clean	all clients use clean labels	idealized reference condition
noisy	all clients use noisy labels	everyone’s annotation process is imperfect
roa	each client contains a fraction of noisy samples	every hospital has mixed-quality labels
roc	some clients are fully noisy and others clean	some hospitals or annotation pipelines are systematically unreliable

The distinction between roa and roc is especially useful. A hospital network where every site has some imperfect masks is not the same as a network where one or two sites are mostly unreliable. The first is a diffuse quality-control problem. The second is a client-level trust problem. Many benchmark designs flatten this difference because flattening things is how benchmarks stay tidy and how deployment teams later discover surprises.

The methods compared are also deliberately varied:

Method	Family	Core idea
FedAvg	baseline	average local model updates
FedA3I	noise-aware aggregation	estimate client segmentation tendencies and adjust aggregation
IOP-FL	personalization	combine local and global optimization information
FedCorr	label correction	identify noisy clients or samples and correct labels using global predictions
FedSelect	sample and client selection	use training dynamics to select useful samples and weight clients

This is not a comparison of five interchangeable tricks. It is a comparison of assumptions. FedA3I assumes that aggregation can be made smarter by estimating client reliability or tendency. IOP-FL assumes that local-client specificity should be preserved rather than averaged away. FedCorr assumes noisy labels can be detected and corrected. FedSelect assumes useful samples and clients can be selected from training dynamics.

The paper’s value is that these assumptions are forced to compete under the same federated segmentation framework.

What the paper directly shows

The benchmark’s main evidence comes from the comparative performance analysis across datasets, methods, and client-noise scenarios. Dice is used for general overlap performance. Noise-specific metrics are used to avoid pretending that all segmentation errors are the same: HD95 for contour disagreement, foreground-background instance-level F1 for missed or extra structures, and class confusion for foreground class swaps.

The broad result is this:

Finding	Evidence in the paper	Business meaning	Boundary
FedSelect is strongest overall	highest overall rank stability across Dice and noise-specific analyses	a strong default candidate when no single noise type dominates	not always best by every robustness definition
IOP-FL is the main competitor	best Dice in many dataset-scenario comparisons; significant Dice gains over FedAvg in noisy regimes	personalization can matter when client heterogeneity and label noise interact	not uniformly dominant across all noise-specific metrics
FedAvg remains competitive	often hard to beat; some FNLL methods fail to improve over it	baseline discipline is mandatory; “advanced” is not a result	strong baseline does not mean label noise is harmless
FedCorr can reduce degradation in harsher settings	smallest clean-referenced Dice degradation in roc and fully noisy scenarios	label correction may help when client-level noise is severe	much of this behavior is linked to difficult cases such as GleasonHD
FedA3I is weakest overall in this benchmark	frequently underperforms FedAvg and other FNLL methods	noise-aware aggregation alone may be insufficient under diverse real-world noise	not a universal judgment on all aggregation-based methods

The most important correction is that “best absolute performance” and “least degradation from clean training” are different questions. FedSelect ranks best overall in absolute performance. FedCorr, however, shows the smallest Dice degradation in the harsher roc and fully noisy settings. That does not make FedCorr the universal winner. It means robustness has multiple definitions. Procurement decks hate this. Reality remains unmoved.

For operators, the practical version is simple. Decide what failure you are optimizing against. Are you trying to maximize final clean-reference segmentation quality? Are you trying to minimize deterioration when some client labels are corrupted? Are you trying to protect boundary accuracy, instance detection, or class identity? These are not the same target.

The noise type changes the method choice

The paper’s comparison-based structure is valuable because it does not treat label noise as one bucket. It separates three practical failure modes.

First, contour noise. This is boundary disagreement: two raters see the same structure but draw its edge differently. The paper evaluates this using HD95. Because contour variation appears across all datasets, this is the best-supported part of the benchmark. FedSelect ranks best overall for contour robustness, and the authors report that it is the only FNLL method consistently matching or outperforming FedAvg in that analysis. The separation is clearest in contour-dominated datasets such as RIGA, MouseT, and MMIS.

Second, instance noise. This means structures are missed or extra structures are added. A model trained on such labels may learn not merely a fuzzy boundary but an unreliable sense of whether something exists. The paper evaluates this using foreground-background instance-level F1 across datasets where that noise is present. FedSelect again ranks first overall, with IOP-FL and FedAvg as strong competitors. The signal is clearest in MMIA, where instance-level noise is especially prominent.

Third, class confusion. This is the multiclass case where foreground voxels are swapped between classes. The paper evaluates it only on GleasonHD, because that is the relevant dataset for this noise type in the benchmark. FedSelect ranks best overall, followed by FedCorr, but the authors are appropriately cautious. GleasonHD is difficult, mixed-noise, low-performance, and characterized by strong inter-rater disagreement. A single dataset is a thin foundation for sweeping claims. Even in AI, sometimes one example is still one example.

The decision guide in the paper converts this into a compact selection map:

Dominant concern	Clean	roa	roc	noisy	Overall
Contour noise, HD95	FedSelect	FedAvg	FedSelect	FedAvg	FedSelect
Instance noise, F1	FedSelect	IOP-FL	FedSelect	FedSelect	FedSelect
Class confusion	FedSelect	FedCorr	FedAvg	FedSelect	FedSelect
General Dice	FedSelect	IOP-FL	FedSelect	FedSelect	FedSelect

This table is not a vendor shortlist. It is a diagnostic scaffold. Its purpose is to force the deployment team to ask: what kind of label imperfection dominates our data, and how is it distributed across clients?

That question is much better than “which noisy-label method is best?” The latter sounds decisive and is usually a trap.

The benchmark also tests the benchmark

The paper includes several components that should be read with different evidentiary weights. Treating every table and appendix as the same kind of proof is how technical reports become interpretive fog machines.

Paper component	Likely purpose	What it supports	What it does not prove
Dataset noise analysis, Figures 2–3	main evidence for dataset characterization	real-world label noise differs by dataset and noise type	universal prevalence of each noise type in all clinical settings
Dice comparison, Table 3 and Figure 4	main comparative evidence	FedSelect and IOP-FL form the top tier; FedAvg remains strong	that any method wins for every site and objective
Clean-referenced degradation, Figure 5	robustness/sensitivity test	`roc` and fully noisy regimes are more damaging than `roa`; robustness depends on definition	that absolute best performance equals smallest degradation
Noise-specific metrics, Figure 6 and appendix tables	main evidence for method selection by noise type	contour, instance, and confusion errors need different metrics	that the class-confusion conclusion is as strong as contour-noise evidence
Wilcoxon tests against FedAvg	statistical comparison with prior baseline	some gains are significant, especially IOP-FL for Dice and FedSelect for instance F1	that all apparent rank differences are robust
Hyperparameter grid search appendix	implementation detail	comparisons are standardized by average-performance tuning	dataset-specific optimal tuning has been exhausted
Metric edge-case handling appendix	implementation detail with practical importance	empty masks and absent classes are handled explicitly	clinical validity of the metric choices by itself

This matters because the paper’s strongest claim is not “FedSelect wins.” The stronger claim is that the benchmark is discriminative enough to reveal when “winning” changes meaning.

For example, the paper reports that within-client partial noise in roa, with roughly half the samples noisy per client, induces only minor degradation. In contrast, roc, where some clients are fully noisy and others clean, causes larger losses; the effective noise levels in roc range from 29.33% to 73.97%, with a mean of 53.2%. Fully noisy training degrades performance the most. That pattern is operationally important because many real federated deployments are less damaged by diffuse imperfection than by systematic site-level unreliability.

A little mess everywhere is annoying. A lot of mess concentrated in one client is governance.

FedAvg survives because baselines are not decorations

The most useful misconception to kill is the belief that a dedicated noisy-label federated method should automatically beat FedAvg. It sounds reasonable. FedAvg is old, simple, and easy to underestimate. Surely methods with noise estimation, correction, personalization, or selection should dominate it.

They do not.

The paper explicitly finds that FedAvg remains a strong baseline, and that FedA3I and FedCorr frequently fail to improve over it in this benchmark. Statistical testing reinforces the point. Corrected significant Dice gains over FedAvg are mainly observed for IOP-FL in the roa, roc, and fully noisy scenarios. FedSelect shows particularly strong improvements for instance-level F1 across scenarios. But the overall pattern is selective, not universal.

This is not embarrassing for the field. It is useful. In enterprise AI, baseline survival is a governance gift. It prevents teams from mistaking algorithmic novelty for deployment value. If the advanced method only wins under certain noise regimes, then the business case must include the cost of diagnosing those regimes.

That cost is not imaginary. To choose the right method, a healthcare AI team needs some way to characterize label noise: boundary disagreement, missing or extra instances, class swapping, and client-level concentration. That means sampling, audit design, inter-rater analysis, metadata inspection, model-error review, or some combination. The paper’s benchmark reduces uncertainty about method behavior, but it does not eliminate the operational work of knowing one’s data.

The clean-label fairy is not coming. Neither is the method-selection fairy.

What Cognaptus infers for deployment design

The paper directly shows benchmark performance under its chosen datasets, scenarios, methods, and metrics. From that, Cognaptus would infer three practical design rules for healthcare AI programs using federated segmentation.

First, label-noise diagnosis should precede method selection. Do not start with FedSelect, IOP-FL, or FedCorr as brands of magic. Start by asking what kind of label failure the deployment has. Boundary disagreement suggests contour-sensitive evaluation. Missed lesions, missing organs, or spurious structures suggest instance-level analysis. Multiclass grading or tissue-category confusion suggests class-confusion checks. If a team cannot describe the dominant noise pattern, it is not ready to confidently select a mitigation method. It is only ready to perform an expensive experiment while wearing a lab coat.

Second, client-noise distribution should be treated as a risk model. A federation where each hospital has 50% imperfect masks is not the same as a federation where one hospital is mostly unreliable and another is clean. The paper’s roa and roc scenarios make that distinction visible. In business terms, this affects contracting, data governance, monitoring, and escalation. A noisy client is not merely a data issue. It may be a workflow issue, a vendor issue, a staffing issue, or a protocol issue.

Third, baseline comparison should be mandatory. FedAvg is not glamorous, but it is the reference point that keeps the evaluation honest. A noisy-label method that cannot beat FedAvg for the relevant noise type and scenario is not a “more advanced solution.” It is an additional dependency with an acronym.

The operational workflow implied by the paper looks like this:

Step	Operator question	Technical action	Decision consequence
1. Characterize labels	What kind of errors dominate?	compare noisy masks against consensus, expert labels, or audit samples	choose relevant metrics
2. Characterize clients	Is noise diffuse or concentrated by site?	estimate within-client and across-client label quality	choose robustness scenario
3. Run baseline	How strong is FedAvg?	benchmark under clean-referenced validation where possible	establish minimum acceptable gain
4. Select method	Which method improves the relevant metric?	compare FedSelect, IOP-FL, FedCorr, or others using scenario-specific evidence	avoid universal-method mythology
5. Monitor drift	Does label quality change over time?	track performance by site and error type	trigger audit or reconfiguration

The return on investment is not “higher Dice score” in the abstract. It is fewer blind deployments, fewer wasted federation cycles, and earlier detection of site-level annotation problems. In clinical AI, that is not a decorative benefit. It is the difference between a model governance process and a shared hallucination with compliance paperwork.

The decision guide is useful because it is not universal

A weak decision guide pretends to know too much. This one is useful because it is constrained.

FedSelect is the strongest overall default. That does not mean FedSelect should be selected automatically for every deployment. It means that, across this benchmark’s combination of real-world noisy datasets, scenarios, and metrics, FedSelect is the most consistent performer. Consistency is valuable, especially when early deployment evidence is limited. It is not the same thing as guaranteed superiority.

IOP-FL deserves attention because it performs strongly in many Dice comparisons and is recommended for general Dice under roa and instance F1 under roa. That hints at a practical point: when every client has mixed-quality labels, preserving or exploiting client-specific optimization may be more valuable than aggressively selecting or correcting. Diffuse imperfection may reward personalization.

FedCorr is not the overall winner, but its smaller clean-referenced degradation in harsher regimes is informative. Label correction may matter most when the problem is severe and concentrated. The method’s behavior on GleasonHD also suggests that difficult mixed-noise datasets can change the interpretation of robustness. There is no shame in a method being situational. Most useful things are.

FedA3I’s weak overall showing is also informative. Noise-aware aggregation sounds attractive because aggregation is the federated bottleneck. But if label noise is not reducible to the aggregation signal being estimated, or if the noise types are too diverse, aggregation adjustment may not be enough. The method may be aiming at the right layer of the federation stack but the wrong granularity of error.

That is the core business lesson: method families encode assumptions. A benchmark is valuable when it exposes those assumptions before procurement, not after deployment.

The boundaries are part of the result

The paper’s limitations are not boilerplate. They materially affect how the benchmark should be used.

The largest boundary is noise-type balance. Contour-related noise appears across the datasets, so conclusions about contour robustness are the strongest. Instance-level noise is represented across several datasets but is especially driven by MMIA for the clearest separation. Class-confusion evidence is mainly from GleasonHD, which is difficult, noisy, and affected by strong inter-rater disagreement. Therefore, any strong claim about class-confusion robustness should be treated as preliminary.

Another boundary is hyperparameter strategy. The paper tunes method-specific hyperparameters by grid search and selects final settings based on average performance across datasets. That supports fair comparison. It does not guarantee the best possible setting for a specific hospital network, disease area, or label process. Average tuning is a benchmarking choice, not a deployment law.

A third boundary concerns roa versus roc. The paper sensibly warns that these are complementary probes, not perfectly matched causal comparisons. In roa, the noisy fraction is around half within each client. In roc, the effective noisy-sample fraction varies widely because client sizes differ. That is not a flaw; it is a realistic nuisance. Federated clients are rarely equal-size laboratory cubes.

Finally, the benchmark uses public datasets and clean references derived from consensus, STAPLE, or expert annotations. That is appropriate for evaluation, but real deployments may not have such clean references available. An enterprise team may need to construct smaller audit sets, use adjudication panels, or rely on targeted review rather than full clean-label validation.

In other words, the benchmark tells teams how methods behave when label-noise structure is known well enough to evaluate. It does not remove the need to build the machinery that knows.

The practical lesson: benchmark the mess before automating it

The paper is not exciting because it announces that noisy labels are bad. Everyone already knew that, although some people still need quarterly reminders. It is useful because it makes noisy-label federated segmentation comparable across real datasets, client-noise structures, and error types.

The industry temptation is to compress this into a simple recommendation: use FedSelect. That is not wrong, but it is too shallow. The better reading is that FedSelect is a strong default in this benchmark, IOP-FL is a serious alternative, FedAvg must remain in the comparison, and method choice should follow noise diagnosis.

Healthcare AI does not need more confidence theater. It needs deployment processes that can say: this site has boundary inconsistency, that site has missing instances, this task has class confusion, and this method is appropriate because it improves the metric that corresponds to the actual failure mode. Dry, specific, and much less likely to become a postmortem.

The clean label fairy is not coming. The benchmark suite is what we get instead. Frankly, it is more useful.

Cognaptus: Automate the Present, Incubate the Future.

Markus Bujotzek, Dimitrios Bounias, Stefan Denner, Ralf Floca, Maximilian Fischer, Peter Neher, and Klaus H. Maier-Hein, “Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection,” arXiv:2606.16868v1, 2026, https://arxiv.org/abs/2606.16868. ↩︎

TL;DR for operators#

The easy mistake is to blame federation#

The benchmark compares methods, not vibes#

What the paper directly shows#

The noise type changes the method choice#

The benchmark also tests the benchmark#

FedAvg survives because baselines are not decorations#

What Cognaptus infers for deployment design#

The decision guide is useful because it is not universal#

The boundaries are part of the result#

The practical lesson: benchmark the mess before automating it#