Label Me Twice, Generate Me Once: The New Discipline of Data-Efficient AI

In enterprise AI, the glamorous part is still the model. Bigger context windows, better agents, faster inference, shinier demos—the usual fireworks display. But for many real deployments, especially in healthcare, legal review, insurance, industrial inspection, and compliance, the real bottleneck is less theatrical: labeled data.

Not just data. Labeled data.

Not just labeled data. Correct labeled data.

And not just correct labeled data. Correct labels for the cases that actually move the model’s decision boundary, expose rare failure modes, or determine whether a system misses a clinically important abnormality. Tiny detail. Easy to overlook, unless one enjoys watching expensive AI projects become very confident filing cabinets.

Two recent arXiv papers are useful because they attack this bottleneck from different ends of the same operational chain. One paper, “Deep Active Re-Labeling: Toward Noise-Resilient Annotation Efficiency,” studies what happens when deep active learning selects highly informative samples but human annotators make mistakes.¹ The other, “Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios,” studies whether synthetic MRI data can help train automated focal cortical dysplasia detection when voxelwise expert annotations are scarce.²

Read together, they make a sharper point than either paper makes alone: annotation efficiency is not a single trick. It has at least two levers.

First, expand coverage where labeled data is scarce.

Second, protect correctness where labels matter most.

That sounds obvious. Most useful things do, once someone else has done the work.

The shared problem: expert labels are expensive, slow, and unevenly reliable

The business problem behind both papers is familiar. Deep learning systems often need many labeled examples, but expert labeling is slow and expensive. In generic image tasks, this is inconvenient. In medical imaging, legal evidence review, underwriting, fraud investigation, industrial defect inspection, and regulatory monitoring, it becomes structural.

The FCD paper gives the concrete clinical version. Automated focal cortical dysplasia detection requires MRI scans paired with voxelwise lesion delineations. These labels are not casual tags created by someone clicking through a weekend annotation platform. They are expert-created lesion regions, often involving subtle abnormalities such as gray-white matter junction blurring, cortical thickening, or abnormal FLAIR signal. The paper notes that manual voxelwise annotation is time-consuming and labor-intensive, while multi-site dataset collection creates logistical and regulatory barriers.²

The active re-labeling paper gives the algorithmic version. Deep active learning tries to reduce labeling cost by selecting the most informative unlabeled samples for annotation. The problem is that these selected samples are influential by design. If the annotator labels them incorrectly, the error is no longer harmless background noise. It can damage the model more than an incorrect label on a randomly chosen sample.¹

This is the central tension:

Data strategy goal	What can go wrong	Operational consequence
Label fewer examples	The chosen examples become more important	Mistakes on selected examples become more damaging
Use synthetic data	Generated cases may not match real-world distributions	Coverage improves, but artifacts and false positives may appear
Rely on experts	Experts are scarce and imperfect	Review time must be allocated, not merely requested
Scale model training	More data does not guarantee better signal	Data quality becomes a workflow problem, not a storage problem

The lazy interpretation is: “AI can reduce annotation cost.” True, but incomplete. The slightly less lazy interpretation is: “AI can help decide where annotation effort should go.” Better. The useful interpretation is: “AI data operations should manage marginal value per expert minute.”

There. Less catchy, more accurate. Business often works that way.

Step 1: synthetic data expands coverage when real labels are scarce

The FCD paper addresses the scarcity side of the chain. The authors generate synthetic MRI volumes exhibiting focal cortical dysplasia by first creating anatomically plausible binary lesion masks and then using a SPADE-based conditional image generation framework to synthesize T1w and FLAIR images. The training and evaluation design compares three detection models:

Model	Training data	Business interpretation
FCDD-1	35 FCD patients + 35 controls	Low-data baseline
FCDD-2	Same real data + 35 synthetic FCD + 35 synthetic healthy images	Synthetic augmentation strategy
FCDD-3	70 FCD patients + 70 controls	Expanded real-data upper-bound reference

This design matters because it does not merely ask, “Can synthetic data help?” It asks a more operational question: “How does synthetic augmentation compare with having more real labeled data?”

The answer is properly inconvenient.

Synthetic augmentation helped in the low-data setting, but it did not beat equivalent real data. In the full test cohort, subject-level sensitivity increased from 36.1% without augmentation to 44.3% with synthetic augmentation, while predicted probabilities at true lesion sites improved from 0.83 ± 0.11 to 0.89 ± 0.12. But the expanded real-data model achieved much higher sensitivity, 73.8%, and slightly higher confidence at lesion sites, 0.90 ± 0.14.²

So synthetic data is not magic replacement data. It is a bridge when real labeled examples are scarce. A useful bridge, not a teleporter.

The paper’s limitations are also important. Synthetic augmentation improved detection of some missed cases, but it also introduced false positives. The authors report that nine cases were missed without augmentation, five of those were identified by the synthetic-augmented model, but this came with four false detections. They also argue that synthetic augmentation has diminishing utility once real data reaches sufficient scale.²

That is the business lesson hiding inside the technical result: generated data can improve coverage, but coverage is not correctness.

Synthetic data answers the question:

“What types of examples are missing from the training set?”

It does not automatically answer:

“Which labels are wrong, misleading, or decision-critical?”

For that, we need the other paper.

Step 2: active re-labeling protects correctness where mistakes are costly

The active re-labeling paper starts from a nice, uncomfortable observation: active learning can be outsmarted by its own intelligence.

In ordinary active learning, the model selects samples expected to be most informative. In deep active learning, these samples can have high influence on the learned decision surface. If a human annotator makes errors on such samples, the model may suffer more than it would under passive random sampling. In the authors’ framing, noisy labels on actively selected samples can cause active learning to lose its advantage and sometimes perform worse than passive learning.¹

Their proposed remedy is not to ask for unlimited new labels. It is to spend part of the fixed annotation budget on re-labeling already labeled samples.

That choice is important. In budget-constrained expert workflows, the question is usually not:

“Can we buy infinite expert attention?”

The question is:

“Should the next expert minute label a new case or inspect an old one?”

The paper proposes deep active re-labeling, where a portion of the annotation budget is allocated to identifying and re-annotating likely noisy samples. The method combines learned representations from the deep model with maximum-margin classifier logic to detect two types of suspicious labels:

Suspicious label type	Detection intuition	Why it matters
Decision-boundary noise	A labeled point lies close to the decision surface	It can distort the model’s boundary
Inconsistency noise	A point’s label conflicts with nearby samples	It may represent a broader labeling inconsistency

The authors also use a dynamic weighting strategy so the re-labeling process shifts focus across active learning rounds. Early on, the classifier’s margin is still unstable; later, margin-based correction becomes more useful. They include an exponential moving average to reduce variance in re-labeling scores.¹

The experiments use MNIST, FashionMNIST, CIFAR-10, and PathMNIST, with noisy labels introduced through label flipping. The main experiments use a 30% noise rate, with additional tests at 10% and 50%. Under the same annotation budget, their method outperforms baselines including no re-labeling, random re-labeling, DFAL, and ActiveLab in the reported settings.¹

The most useful part for business readers is not the exact architecture. It is the budget logic.

The paper treats labels as assets that can degrade or become suspicious. Re-labeling becomes a form of audit. Not a ceremonial audit, where everyone nods gravely and updates a spreadsheet. A targeted audit, where the system identifies which labels are likely to create the largest downstream damage.

The combined chain: coverage first, correctness always

The two papers fit together as a complementary logic chain.

Logic-chain step	FCD synthetic MRI paper	Deep active re-labeling paper
Constraint	Not enough real expert-labeled cases	Expert labels may contain errors
Main lever	Generate synthetic labeled examples	Re-inspect high-risk labels
Primary benefit	Better coverage in low-data regimes	Better correctness under fixed budgets
Primary risk	Synthetic data may introduce false positives or fail to generalize	Re-labeling consumes budget that could label new samples
Business interpretation	Use generation to stretch scarce data	Use active review to protect high-value labels

This chain is more valuable than a serial summary because most organizations face both problems at once. Their datasets are incomplete and noisy. Their experts are scarce and imperfect. Their models need more examples and cleaner labels. Unfortunately, the budget is usually allergic to doing everything.

A practical AI data workflow should therefore ask four questions:

Where is coverage thin? These are candidates for synthetic augmentation, simulation, weak labeling, or targeted data acquisition.
Where is correctness fragile? These are candidates for re-labeling, second review, adjudication, or audit.
Which labels have high model leverage? A wrong label near a decision boundary may be more costly than a wrong label in an easy region.
Where does expert time have the highest marginal value? The best use of expert time may be labeling a new rare case, correcting a suspicious old case, or validating a generated example.

The formula is simple enough:

$$ \text{Annotation ROI} = \frac{\text{Expected model improvement} - \text{Risk introduced}}{\text{Expert time and operational cost}} $$

The difficult part is estimating the numerator. Naturally, that is the part most dashboards ignore.

What the papers show—and what they do not

The FCD paper shows that synthetic MRI augmentation can improve automated detection in a low-data clinical setting. It also shows a hierarchy: equivalent real data remains more effective than synthetic augmentation when available. This is not a failure of synthetic data. It is a useful boundary condition.

The active re-labeling paper shows that active learning under noisy annotation benefits from revisiting selected labels. It also shows that not all re-labeling is equally useful. Random re-labeling may help somewhat, but strategic re-labeling can do better because it targets labels likely to be noisy and influential.

Together, the papers do not prove that businesses can replace expert annotators with generators. They also do not prove that active learning plus re-labeling is automatically safe in every domain. The active re-labeling experiments use controlled label-flip noise, and the FCD paper is a specialized medical imaging study with modest sample sizes and site-generalization challenges.

The correct conclusion is narrower and more useful:

In expert-heavy AI systems, data efficiency should be designed as an operating system for expert attention.

Synthetic generation supplies breadth. Active re-labeling supplies discipline. Evaluation decides whether either one is actually helping.

The business framework: build a data flywheel, not a label factory

A label factory tries to produce as many labeled examples as possible. This sounds productive, especially to managers who enjoy unit-count metrics. It is also how organizations accidentally optimize for annotation volume instead of model value.

A data flywheel is different. It routes examples through different treatments depending on their role in model performance.

Data item type	Recommended action	Business metric to watch
Rare but plausible cases	Generate or actively acquire similar examples	Coverage of minority scenarios
High-uncertainty real cases	Expert label or adjudicate	Error reduction per expert hour
Boundary-critical labeled cases	Re-label or second review	Change in validation performance
Generated cases	Validate realism and failure modes	Synthetic-to-real generalization
Repeated false positives	Trace synthetic artifacts or label ambiguity	Precision, review burden
Repeated false negatives	Expand coverage around missed regions	Sensitivity, recall, loss severity

This is especially relevant for regulated or high-consequence domains. A medical AI team should not simply ask whether synthetic data improves average sensitivity. It should ask whether it increases false positives in specific imaging patterns, whether generated lesions generalize across scanners, and whether expert review should target disagreement cases.

An insurance AI team should not simply generate more claim scenarios. It should ask whether rare synthetic claims distort fraud thresholds or whether high-value disputed cases need re-labeling.

A legal AI team should not simply label more documents. It should ask whether privileged, ambiguous, or precedent-sensitive documents should receive second review because their labels influence retrieval and classification behavior.

A manufacturing AI team should not simply augment defect images. It should ask whether generated defects match real sensor noise, lighting, and material variability—and whether mislabeled near-boundary defects are silently teaching the system the wrong tolerance.

The common principle is the same:

Generate where the world is underrepresented. Re-check where the model is overconfident for the wrong reason.

Why “cheap labels” are a dangerous KPI

The current AI market loves cost reduction. Understandable. Annotation budgets are painful, and expert review is even worse because experts tend to want salaries, context, and occasional sleep.

But “cost per label” is often the wrong metric. The better metric is cost per useful training signal.

A cheap synthetic example that adds no new variation is not useful. A cheap label on an easy sample may be harmless but low-value. A costly re-labeling action on a boundary-critical mislabeled sample may be very valuable. A generated rare case may be valuable if it exposes a missing scenario, but dangerous if it introduces systematic artifacts.

The two papers make this point from opposite directions. The FCD paper shows that synthetic data can improve low-data performance but does not remove the value of real data. The active re-labeling paper shows that active learning can reduce label volume but becomes fragile when important labels are wrong.

So the business goal is not “less annotation.” It is better annotation allocation.

That means AI teams need workflows that record more than labels. They need metadata about source, uncertainty, reviewer disagreement, model influence, synthetic origin, domain shift, and post-deployment error patterns. Without that, the organization has no memory. It just has a dataset with confidence issues. Very modern, very expensive.

A practical implementation pattern

For teams building AI products in expert-heavy domains, the combined lesson can be turned into a simple operating pattern.

1. Start with a real labeled core

Synthetic data should not be the foundation floating alone in the void. The FCD paper’s generator depends on real imaging structure, lesion labels, and healthy scans. Even when synthetic data helps, it is anchored to real data.

For business teams, the first priority is a small but carefully governed labeled core: high-quality examples, clear definitions, reviewer guidelines, and traceable annotation provenance.

2. Map coverage gaps

Identify where the dataset is thin: rare classes, unusual contexts, minority populations, edge device conditions, uncommon document types, underrepresented lesion locations, or special operating environments.

This is where synthetic data or targeted acquisition can help. The point is not to flood the model with decorative variety. The point is to fill strategically important holes.

3. Track label risk

Not all labels deserve equal trust. Labels should be scored for uncertainty, disagreement, model influence, and downstream risk. In active learning settings, selected samples deserve special scrutiny because the model asked for them precisely because they were informative.

A practical system might track:

Risk signal	Meaning
High model uncertainty	The sample may sit near a decision boundary
Reviewer disagreement	The label definition may be ambiguous
High loss after training	The model struggles to fit the sample
Neighbor inconsistency	Similar examples have different labels
High business consequence	Mistake carries regulatory, clinical, or financial cost

4. Spend expert time dynamically

The next expert action should be chosen from multiple options: label a new sample, inspect an old label, validate a synthetic sample, resolve disagreement, or review a repeated failure pattern.

That is the operating shift. Expert review becomes a portfolio allocation problem.

5. Evaluate with failure-specific metrics

Average accuracy is too blunt. The FCD paper reports sensitivity, specificity, detection rate, false positives, lesion-site confidence, and multi-site behavior. That is the right instinct. Business systems need metrics tied to actual failure costs.

For example:

Domain	Metric that matters beyond accuracy
Healthcare imaging	Sensitivity, specificity, false positives per subject, site generalization
Insurance claims	Missed fraud, false investigation burden, high-value claim errors
Legal review	Privilege leakage, missed responsive documents, reviewer escalation rate
Manufacturing	Missed defects, false rejects, defect localization quality
Compliance	Missed violations, false alerts, audit defensibility

Cheap labels do not help if the wrong failure mode survives.

The uncomfortable conclusion

The next phase of enterprise AI will not be won only by companies with bigger models. It will be won by companies that know how to manage evidence.

These two papers are about annotation, but the broader lesson applies to AI operations generally. A model learns from the signals it receives. If the organization supplies shallow coverage, the model learns a narrow world. If it supplies noisy labels, the model learns mistakes with impressive mathematical commitment. If it supplies synthetic examples without validation, the model may learn artifacts. If it supplies expert review without prioritization, the budget evaporates politely.

The solution is not to worship real data or dismiss synthetic data. Real data is better when available, but often unavailable at the necessary scale. Synthetic data is useful when coverage is thin, but it must be tested. Active learning is useful when labels are expensive, but it must include mechanisms for correcting important mistakes.

That is the actual data strategy:

Use synthetic generation to expand the map.
Use active re-labeling to repair the compass.
Use evaluation to decide whether the journey is improving or merely becoming more automated.

A less poetic version: stop treating annotation as a one-way production line. Treat it as a governed feedback system.

Yes, that is less exciting than “AI replaces labeling.” It is also more likely to survive contact with reality, which remains annoyingly well funded.

Cognaptus: Automate the Present, Incubate the Future.

Md Abdullah Al Forhad and Weishi Shi, “Deep Active Re-Labeling: Toward Noise-Resilient Annotation Efficiency,” arXiv:2606.08718, 2026. https://arxiv.org/html/2606.08718 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Prabhjot Kaur, Hakim Ouaalam, Sedat Kandemirli, Sanjay P. Prabhu, and Simon K. Warfield, “Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios,” arXiv:2606.07381, 2026. HTML was unavailable during reading, so the arXiv PDF was used. https://arxiv.org/pdf/2606.07381 ↩︎ ↩︎ ↩︎ ↩︎

The shared problem: expert labels are expensive, slow, and unevenly reliable#

Step 1: synthetic data expands coverage when real labels are scarce#

Step 2: active re-labeling protects correctness where mistakes are costly#

The combined chain: coverage first, correctness always#

What the papers show—and what they do not#

The business framework: build a data flywheel, not a label factory#

Why “cheap labels” are a dangerous KPI#

A practical implementation pattern#

1. Start with a real labeled core#

2. Map coverage gaps#

3. Track label risk#

4. Spend expert time dynamically#

5. Evaluate with failure-specific metrics#

The uncomfortable conclusion#