Scar Tissue, Synthetic Data: Teaching AI to See the Invisible

Synthetic data has a seductive sales pitch: when real data is scarce, expensive, or ethically awkward to collect, generate more of it. Simple. Almost too simple. Which, in AI, usually means the invoice has not arrived yet.

The paper behind this article, LGESynthNet: Controlled Scar Synthesis for Improved Scar Segmentation in Cardiac LGE-MRI Imaging, is interesting because it refuses that easy story.¹ It does not merely ask whether a model can generate plausible cardiac MRI images. It asks a more operational question: can generated scar tissue help a downstream model detect and segment real scar tissue better?

That distinction matters. A synthetic image can look realistic and still be useless. Worse, it can look realistic and actively harm the model trained on it. In medical imaging, where the target structure may be tiny, ambiguous, and clinically consequential, “looks good” is not a sufficient evaluation criterion. It is a mood, not a metric.

LGESynthNet focuses on late gadolinium enhancement cardiac MRI, often abbreviated as LGE MRI. In these scans, enhancement patterns can reveal myocardial scar or fibrosis. The clinical problem is important, but the machine-learning problem is irritatingly inconvenient: scar segmentation needs pixel-level annotations, those annotations are expensive, and scar regions can be small, subtle, and variable. The paper notes that myocardial scars may occupy less than 1% of the image. That is exactly the sort of target that makes generative augmentation look attractive—and exactly the sort of target that makes careless augmentation dangerous.

The useful way to read this paper is not “diffusion model beats GAN.” That would be the obvious summary, and also the less useful one. The better comparison is between three things that are often collapsed into one vague phrase called “synthetic data quality”:

whether the generated image looks like a plausible medical image;
whether the generated scar actually follows the requested location and shape;
whether the generated image improves the downstream diagnostic task.

The paper’s central lesson is that these three can diverge. Business teams building healthcare AI pipelines should pay attention to that divergence. It is where much of the ROI—and much of the risk—hides.

The problem is not image generation; it is controllable scar generation

The authors train LGESynthNet on 429 scar-positive LGE images from 79 patients. That number is small by general computer-vision standards, but realistic for specialized medical imaging work. The full dataset includes 212 patients, with 159 patients and 1,516 images used for training and 53 patients and 497 images used for testing in the downstream task. The images came from 1.5T Siemens scanners, and the pixel-wise ground truth was derived semi-automatically from expert segment-level annotations.

This last detail matters. The paper is not operating in a clean fantasy world where unlimited expert contours exist. It is closer to the world healthcare AI teams actually live in: partial clinical labels, expensive refinement, small positive cohorts, and a downstream model that still needs to work.

LGESynthNet frames scar synthesis as an inpainting task. During training, the model receives a positive LGE image with the scar region masked out, plus a scar edge map. It learns to fill in the missing scar-like enhancement. During inference, the method can take a negative LGE image and insert a synthetic scar according to a generated conditioning mask and matching anatomical caption.

That design choice is important because the model is not asked to hallucinate an entire heart scan from scratch. It receives anatomical context from the existing image and is asked to synthesize the missing enhancement. The operational logic is almost conservative: do not invent the whole patient; modify a controlled region inside a real-looking anatomical frame.

The system adds three important controls:

Component	What it does	Operational consequence
Inpainting with image context and scar edge maps	Gives the model the surrounding anatomy and the intended scar boundary	Reduces the generation problem from “make a plausible scan” to “insert a plausible scar here”
Anatomy-aware captions	Describe scar location and transmurality using myocardial segments and radial layers	Gives text conditioning clinical meaning rather than generic image labels
Reward-guided supervision and quality filtering	Uses a scar segmentation model to encourage and later filter conditioning fidelity	Connects generation to the downstream task instead of relying on appearance alone

The reward model is especially worth unpacking. The authors train a scar segmentation model on real data, freeze it, and use it during LGESynthNet training as an additional supervisory signal. The generated image is passed through this fixed segmentation model; the predicted scar mask is compared with the intended mask. The total training loss combines the diffusion loss with this reward loss at early diffusion steps. In plainer language: the generator is nudged not only to make a plausible image, but also to make an image where another scar detector can recover the scar where it was supposed to be.

This is not a magic guarantee. The authors themselves point out that selecting an effective reward model is difficult when strong general-purpose LGE segmentation models are not available. But it is a pragmatic move. When perfect annotation is scarce, a weaker task model can still provide useful pressure on the generator. A small teacher can still tell the student, “at least put the scar where you said you would.”

The cleanest image was not the most useful image

The strongest editorial hook in the paper is Table 1. It compares image quality and conditioning adherence across several methods: SPADE with limited context, SPADE with full context, ControlNet, ControlNet with template guidance, and LGESynthNet.

SPADE-FC is the trap. It achieves extremely high image-quality metrics: whole-image SSIM of 0.974 and cropped-region SSIM of 0.986. If a team stopped there, SPADE-FC would look like the winner. A beautiful table row. Very suitable for a slide deck.

But its conditioning performance is weak. SPADE-FC has Dice-TP only of 0.272 and a pass rate of 18.8%, where pass rate means the share of generated samples whose predicted scar mask overlaps the conditioning mask with Dice above 0.6. In the authors’ interpretation, SPADE-FC often reproduces the input image without realistic scar synthesis. It is visually faithful, but not necessarily faithful to the requested scar.

LGESynthNet, by contrast, has much lower SSIM: 0.587 for the whole image and 0.582 around the left ventricle. On an image-quality-only leaderboard, that looks worse. Yet its conditioning metrics are comparable to SPADE-FC and more useful in downstream testing. It is not the prettiest generator. It is the generator that better survives contact with the actual task.

The paper’s comparison can be summarized this way:

Method	What looks good	What breaks	Business interpretation
SPADE-FC	Very high SSIM and RMSE performance	Poor downstream impact; likely copies image context without useful scar synthesis	Visual realism alone can overstate synthetic-data value
SPADE-LC	Better conditioning than several alternatives	Incoherent background and limited downstream segmentation gain	Simplified control can help detection but may distort anatomy
Template-guided ControlNet	Intended to preserve structural consistency	Very poor conditioning adherence in this setting	A technique useful elsewhere may fail when the small target is too hard to steer
LGESynthNet	Balanced realism and conditioning; best downstream result among tested methods	Still low pass rate and lower SSIM than SPADE-FC	Task-aligned filtering beats cosmetic realism

This is the paper’s most useful business idea: synthetic data should be evaluated as a pipeline asset, not as a visual artifact. A synthetic MRI image does not create value because it looks plausible to a model evaluator. It creates value if it improves the performance, robustness, or coverage of the production model that depends on it.

There is a quietly brutal lesson here for AI product teams. If your evaluation dashboard has only “fidelity” metrics, you may be optimizing for a model that is very good at producing synthetic screenshots of your ambitions.

The downstream task exposes what image metrics hide

The downstream experiment is where the paper becomes practically interesting.

The authors train a DenseNet121-based model to jointly predict scar and myocardium. They compare real-only training against hybrid training where 300 synthetic samples from each generative method are added. Synthetic images are generated from negative LGE images using parametrically defined scar masks. Before being used for training, the samples go through a quality-control filter: the reward model predicts scar masks on the generated images, and only samples with Dice overlap above 0.6 against the conditioning mask are retained. For selected samples, the predicted mask—not the original ellipsoid—is used as the downstream training ground truth, to better reflect realistic scar boundaries.

That is an important implementation detail, not a decorative one. The pipeline does not blindly trust the original synthetic instruction mask. It lets the quality model reinterpret the generated output and uses that predicted mask as the training label. This is a form of task-aware synthetic-data cleaning.

The results are concise:

Training setup	Dice	Dice, TP-only	Accuracy	Balanced accuracy	Confusion matrix [TN, FP, FN, TP]
Real images only	0.72	0.32	0.77	0.73	[11, 11, 1, 30]
Real + SPADE-LC	0.77	0.31	0.85	0.86	[20, 2, 6, 25]
Real + SPADE-FC	0.75	0.19	0.68	0.72	[21, 1, 16, 15]
Real + LGESynthNet	0.77	0.35	0.89	0.90	[21, 1, 5, 26]

LGESynthNet improves Dice from 0.72 to 0.77 and balanced accuracy from 0.73 to 0.90 when 300 synthetic images are added. SPADE-LC also improves Dice to 0.77 and balanced accuracy to 0.86, but LGESynthNet performs better on TP-only Dice and patient-level detection. SPADE-FC, despite its excellent image-quality metrics, reduces accuracy and balanced accuracy relative to the real-only baseline.

That is the paper’s core comparison in miniature. The generator that looks most faithful to the input image is not the one that makes the downstream model best. In fact, it appears to produce a harmful training signal for scar detection.

For a business reader, the distinction is not academic. Many synthetic-data vendors, internal platform teams, and product demos can produce attractive generated examples. Attractive examples are easy to screenshot. Downstream lift is harder. But downstream lift is where economic value lives.

In healthcare AI, the relevant question is not “can we create more images?” It is:

Can we create the right distribution of additional cases, with controlled pathology, filtered for task fidelity, in a way that improves the target clinical model on real test data?

LGESynthNet gives a partial “yes” for a narrow LGE scar segmentation setup. It does not give a universal “yes” for synthetic medical imaging. That boundary is not a weakness. It is the difference between research evidence and brochure prose.

The number of synthetic samples helps, until it stops being the story

The paper includes an additional experiment varying the number of LGESynthNet samples added to downstream training. This is best read as a sensitivity test rather than the paper’s main thesis.

The reported results are:

Synthetic samples	Dice	Dice, TP-only	Accuracy	Balanced accuracy	Confusion matrix [TN, FP, FN, TP]
500	0.78	0.36	0.91	0.91	[21, 1, 4, 27]
1000	0.77	0.36	0.92	0.93	[21, 1, 3, 28]
1500	0.77	0.37	0.91	0.91	[20, 2, 3, 28]

The trend is encouraging but not linear. Balanced accuracy improves at 500 and peaks at 1000, while Dice stays around 0.77–0.78. Adding 1500 samples does not continue improving everything. That is exactly what one should expect if synthetic data is useful but not infinite nutrition. At some point, more generated examples may become redundant, distributionally narrow, or limited by the quality of the conditioning process.

This matters for implementation. A healthcare AI team should not treat synthetic augmentation as a bulk commodity. The first 300 or 500 controlled samples may cover underrepresented scar configurations. The next 1,000 may improve detection balance. The next 10,000, if produced without better diversity and validation, may simply teach the model the quirks of the generator.

The business version is simple: the ROI curve of synthetic data is likely concave. The first useful samples can be valuable. The thousandth sample is not automatically worth the same as the hundredth. Procurement departments may dislike this sentence. Reality will survive their disappointment.

The ablations support the control design, but they are not a second victory lap

The paper’s ablation table changes three design choices: the input condition, the text encoder, and the caption type. These tests should be read as implementation evidence for LGESynthNet’s design, not as separate proof that every component will generalize across all medical imaging tasks.

The ablation results report image-quality metrics rather than downstream task metrics. Compared with the final LGESynthNet row in Table 1, using semantic masks instead of semantic edges lowers SSIM to 0.497 whole-image and 0.490 cropped. Replacing BiomedBERT with OpenCLIP gives 0.554 whole-image and 0.550 cropped. Using a constant caption gives 0.580 whole-image and 0.572 cropped, slightly below the full LGESynthNet SSIM values.

The likely purpose of these ablations is to test whether the domain-specific design choices help the generator preserve image quality under controlled scar synthesis. They suggest that edge conditioning, biomedical text encoding, and descriptive captions are useful. The paper’s discussion also states that edge inputs outperform masks and that the biomedical encoder and descriptive captions improve performance.

But the ablation table does not, by itself, prove that every hospital system needs BiomedBERT in every synthetic-data pipeline. It supports a narrower and more useful inference: when the synthetic target is anatomically specific, domain-aware conditioning can matter. Generic text conditioning may not understand the difference between “posteroseptal,” “endocardial,” and “some bright thing inside the myocardium.” The last phrase is how models behave when humans are too generous with the word “understanding.”

A practical synthetic-data pipeline for medical AI should therefore treat conditioning as a domain object, not just a prompt. In this paper, the conditioning includes AHA segment logic, radial layers, scar masks, edge maps, and captions. The text is not decoration; it is an interface to anatomical control.

The right metric stack is realism, alignment, and utility

The authors explicitly argue that generation should be assessed across image realism, conditioning consistency, and downstream utility. This is the cleanest operational framework in the paper.

For business use, I would translate it into a three-gate evaluation pipeline:

Gate	Question	Metric examples from the paper	Failure mode if ignored
Realism	Does the synthetic image remain plausible as an LGE MRI image?	SSIM, RMSE, visual inspection around myocardium	The model learns artifacts or anatomy-breaking shortcuts
Alignment	Did the generated scar follow the requested location and rough shape?	Dice-TP only, pass rate above Dice 0.6 using quality module	The dataset contains labels that do not match the image
Utility	Does the synthetic data improve the downstream model on real test cases?	Segmentation Dice, TP-only Dice, accuracy, balanced accuracy, confusion matrix	The synthetic data looks good but hurts deployment performance

This stack is valuable because each gate catches a different kind of error. Realism catches ugly images. Alignment catches fake labels. Utility catches beautiful irrelevance.

The SPADE-FC result illustrates why the third gate is essential. High image realism did not translate into downstream improvement. The synthetic data was not merely less useful than expected; it was associated with worse accuracy and balanced accuracy than real-only training. In a business setting, that is the difference between “we augmented the dataset” and “we paid to make the model worse.”

The quality-control filter also deserves business attention. The authors reuse the reward model to select generated samples whose predicted scar mask overlaps sufficiently with the conditioning mask. This filter checks location and rough shape, although the authors note it does not directly assess texture quality. That is a precise limitation, and an important one. The filter is task-aware, but it is not omniscient.

A mature production pipeline would likely need additional gates: radiologist review for a subset, scanner/vendor stratification, external-site validation, and monitoring for generator-specific artifacts. The paper does not provide those. It provides a research prototype showing why such gates are necessary.

What the paper directly shows, and what Cognaptus infers

This paper directly shows that, on the authors’ dataset and experimental setup, controlled synthetic LGE scar images from LGESynthNet improve downstream scar segmentation and patient-level detection compared with real-only training. It also shows that image-quality metrics alone can be misleading: SPADE-FC has the best SSIM but performs poorly downstream.

Cognaptus infers a broader product lesson: synthetic medical data is most valuable when it is designed as a controlled augmentation system, not a content-generation system. The core asset is not the generator by itself. The asset is the loop:

define clinically meaningful conditions;
generate images under those conditions;
filter for condition fidelity;
train the downstream model;
evaluate on real clinical cases;
adjust the generation strategy based on task failures.

That loop is where business value can emerge. It can reduce dependence on scarce positive cases, target rare pathology configurations, and improve model behavior in clinically important edge cases. But it is also where cost enters: domain rules, segmentation reward models, quality filters, compute, and validation are not free. Synthetic data is cheaper than collecting every rare case manually only when the pipeline is disciplined enough not to manufacture noise at scale.

For healthcare AI companies, the paper suggests three practical design principles.

First, synthetic data should be condition-first. A team should decide which clinical variations are underrepresented before generating anything. In LGESynthNet, the authors control scar location, size, and transmural extent. That is more useful than generating a vague pile of plausible scans.

Second, synthetic data needs task-aware filtering. If the intended scar is not actually represented in the generated image, the sample should not be treated as valid training data. This sounds obvious until one remembers how many datasets are assembled by scripts that never look back.

Third, downstream validation should outrank image fidelity. A generated sample that improves balanced accuracy is more valuable than one that wins a beauty contest against SSIM. Medical AI does not need prettier synthetic images. It needs synthetic images that teach the model something true enough to transfer.

The boundaries are narrow, and that is exactly why they are useful

The paper’s limitations are not generic footnotes. They materially affect how the result should be used.

The data is single-center and single-vendor: scans came from 1.5T Siemens MAGNETOM Aera systems. That means the paper does not establish multi-center, multi-vendor generalization. A model trained or validated on one acquisition environment may not behave the same under different scanners, protocols, reconstruction pipelines, or patient populations.

The ground truth is semi-automated. Expert readers assigned binary segment labels using the AHA 17-segment model, and pixel-wise masks were derived with an n-standard-deviation method and manually refined. This is reasonable, but it is not the same as fully manual pixel-level expert annotation for every image. The segmentation target itself inherits assumptions from the label-generation process.

The synthetic scar masks used at inference are simple ellipsoids placed in anatomically defined regions. That supports controlled experiments, but real LGE patterns can be more complex. The authors explicitly identify more realistic clinical patterns as future work.

The pass rates are also low across methods. LGESynthNet’s pass rate in Table 1 is 18.1%, and SPADE-FC’s is 18.8%. Real images have a pass rate of 35.6% under the same quality-module criterion. In other words, condition adherence remains hard. The paper’s result is not “we solved controlled scar synthesis.” It is closer to “with the right controls and filtering, enough useful synthetic samples can be selected to improve a downstream model.” That is a less glamorous claim. It is also much more believable.

Finally, the disclosure matters. Two authors are employees of Siemens Healthineers, and the paper states that the presented concepts are research results that are not commercially available, with no guarantee of future commercial availability. This does not weaken the technical result, but it does shape how business readers should interpret product readiness.

The business value is not cheaper images; it is better scarcity management

The lazy business interpretation of this paper would be: synthetic data reduces annotation costs. That is partly true, but incomplete.

The sharper interpretation is that controlled synthetic data helps manage scarcity. Scar-positive LGE examples are not just expensive; they are unevenly distributed across locations, transmural patterns, image quality, and clinical presentations. A model trained only on what happens to be available may be underexposed to the cases that matter most.

LGESynthNet shows a way to intervene in that distribution. Instead of waiting for more rare cases, the pipeline creates controlled approximations, filters them, and tests whether they improve real-case performance. This is closer to curriculum design than data dumping.

That framing changes the procurement question. A hospital AI group or medtech company should not ask a synthetic-data vendor, “How many images can you generate?” It should ask:

Procurement question	Better version
How realistic are the images?	Which downstream error modes do the images reduce?
How many samples can be generated?	Which underrepresented clinical conditions can be controlled?
Can clinicians recognize them as plausible?	Do task models trained with them improve on external real data?
Is the generator state-of-the-art?	Is the generator embedded in a validation loop with rejection criteria?

This is where the paper is useful beyond cardiac MRI. The technical details are domain-specific, but the evaluation logic travels. In fraud detection, rare-event insurance, industrial defect inspection, and medical imaging, synthetic data has the same basic temptation: fill the minority class. The danger is also the same: generate examples that match the label schema but not the real-world phenomenon.

LGESynthNet does not remove that danger. It demonstrates one disciplined way to reduce it.

Synthetic data should be audited like training infrastructure

The most important idea to take from the paper is not that latent diffusion can synthesize cardiac scar. It is that synthetic data becomes useful only when the generation process is tied to the downstream learning problem.

That means synthetic-data pipelines should be audited like training infrastructure. The generator, conditioning rules, reward model, quality filter, label assignment, sample-count policy, and downstream validation all affect the final model. If any part is weak, the word “synthetic” becomes a polite way to say “unverified.”

The paper’s best result is therefore not just the Dice improvement from 0.72 to 0.77, or the balanced accuracy improvement from 0.73 to 0.90 with 300 LGESynthNet samples. Those numbers are important, but they are tied to a specific dataset and setup. The durable lesson is the evaluation hierarchy: realism is necessary, alignment is harder, and downstream utility is the judge.

That hierarchy is useful for anyone building AI systems under data scarcity. It says: do not buy synthetic data by the kilogram. Do not trust image fidelity as a proxy for learning value. Do not assume that a visually convincing minority-class example is a useful training example.

Synthetic scar tissue is a good metaphor for synthetic data itself. The useful kind is not merely visible. It must attach to the right structure, follow the right boundaries, and change the behavior of the system around it.

Otherwise, it is just decorative fibrosis.

Cognaptus: Automate the Present, Incubate the Future.

Athira J. Jacob, Puneet Sharma, and Daniel Rueckert, “LGESynthNet: Controlled Scar Synthesis for Improved Scar Segmentation in Cardiac LGE-MRI Imaging,” arXiv:2603.18356, 2026. PDF. ↩︎

The problem is not image generation; it is controllable scar generation#

The cleanest image was not the most useful image#

The downstream task exposes what image metrics hide#

The number of synthetic samples helps, until it stops being the story#

The ablations support the control design, but they are not a second victory lap#

The right metric stack is realism, alignment, and utility#

What the paper directly shows, and what Cognaptus infers#

The boundaries are narrow, and that is exactly why they are useful#

The business value is not cheaper images; it is better scarcity management#

Synthetic data should be audited like training infrastructure#