Synthetic Data’s Ghost Problem: Auditing the Leaks That Weren’t

TL;DR for operators

Synthetic data privacy reviews should stop treating every rare match as proof of memorization. That is the useful correction in Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data, a paper that turns synthetic-data auditing into a controlled experiment rather than an anxious string search.¹

The paper’s mechanism is simple enough to be dangerous in the right way: split the source corpus into training and holdout records; generate synthetic data from the training split; extract rare features from training, holdout, and synthetic data; then ask whether synthetic matches are disproportionately concentrated in the training split. Matches against training records are potential true disclosures. Matches against holdout records are phantom disclosures: things that look like leaks but could have appeared even if that record had never been used.

That distinction matters because raw disclosure counts are a bad privacy metric. In one SFT Finance experiment, the authors report 763 PII matches, but 271 of them are holdout-side phantom matches. A naïve dashboard would call all 763 “leaks,” perhaps with a red icon and a meeting. The audit asks the better question: is there excess training-side disclosure beyond the phantom rate?

Empirically, the answer depends strongly on the generation method. Non-private rewrites and non-private supervised fine-tuning show significant evidence of disclosure across PII, verbatim strings, and semantic similarity tests. DP-SGD-generated synthetic data still produces matches, sometimes many of them, but those matches are distributed close enough to the train/holdout baseline that the tests fail to reject the zero-learning null. This is the entire point: a scary match count is not the same thing as evidence of training-data leakage. Privacy theater has paperwork; this has a control group.

For business use, the framework is best understood as a synthetic-data release gate. It can work without model access, canaries, or shadow-model training. It tells reviewers what kind of information was disclosed, whether the training set is overrepresented, and what empirical privacy-loss lower bounds or membership-inference signals are visible. Its boundaries are equally important: it needs a heldout control set, enough rare-feature matches to have statistical power, and feature extractors that actually represent the risks the business cares about. Failure to reject a leakage hypothesis is not a certificate of innocence. It is evidence that this audit, with these disclosure definitions, did not find excess training-set disclosure. Less dramatic, more useful.

The real privacy question is not “did we find a match?”

A synthetic data review often begins with a comforting ritual. Generate the synthetic dataset. Run a PII detector. Search for duplicated strings. Count suspicious overlaps. If the count is low enough, call the release “privacy-preserving.” If the count is high, call a governance meeting and use the word “risk” until everyone looks serious.

The paper’s central observation is that this ritual confuses two different events.

A synthetic record might reproduce a rare feature from the training set because the generator memorized or reconstructed it. That is a true disclosure. But the same synthetic record might also reproduce a rare feature from a holdout record that was not used for generation. That is a phantom disclosure. It looks like a leak, but it cannot have been caused by training on that specific heldout user.

This is not an edge case. Phantoms arise naturally when the generator understands structure, when common formats produce coincidences, or when the base model has inductive bias. A phone number-shaped string can be generated because the model knows what phone numbers look like. A bank-form-style document can contain plausible account-like values because the format invites them. A synthetic comment can semantically resemble a heldout comment because people on the internet, despite their best efforts, are not infinitely original.

The paper therefore replaces the crude question — “did any rare source feature appear in synthetic data?” — with a causal one:

Would this apparent disclosure have been more likely if the corresponding user had been in the training set?

That is the mechanism-first heart of the paper. The contribution is not merely that the authors found leaks. It is that they built a way to stop over-attributing every frightening overlap to memorization.

The audit pipeline: split, extract, compare, test

The framework starts from a source dataset $D$. Each record is randomly assigned to either a training split $D_{\text{train}}$ or a holdout split $D_{\text{hold}}$, with the experiments using $p=0.5$ as the training inclusion probability. The synthetic generation mechanism sees only the training split. The auditor then examines the training data, holdout data, and synthetic output using the same disclosure lens.

The pipeline can be read as:

Source corpus → random train/holdout split → train-side generation of synthetic data → feature extraction from train, holdout, and synthetic data → train/synthetic rare-feature matches = candidate true disclosures → holdout/synthetic rare-feature matches = phantom disclosures → hypothesis test: is the train side too heavy?

This structure is deliberately black-box. The audit does not require access to model weights, logits, prompts, gradients, or internal traces. It requires the synthetic output, the private corpus that was synthesized, and a holdout sample drawn from the same distribution. That makes the method unusually relevant to actual synthetic-data governance, where the release artifact is often a dataset and the model is either inaccessible, outsourced, retired, proprietary, or all four because enterprise procurement enjoys collecting failure modes.

The feature layer is also modular. The paper instantiates three disclosure types:

Disclosure class	What it detects	Operational meaning
PII leakage	Rare PII-like strings detected in source data and found in synthetic data	“The synthetic data appears to contain identity-linked or sensitive fields.”
Regurgitation	Rare contiguous token sequences, using n-grams	“The model reproduced source text verbatim or near-verbatim.”
Semantic reconstruction	Source records that are unusually close to synthetic records in embedding space	“The model preserved meaning or structure even when the words changed.”

This is a cleaner design than a one-score membership inference system. A single risk score may be good for a benchmark leaderboard. It is less helpful to a privacy officer asking whether the model leaked phone numbers, job-posting text, user comments, or only the general vibe of an HR department slowly losing the will to live.

The paper’s design keeps the disclosure witness attached to the metric. That matters. In business terms, it turns “privacy risk detected” into something closer to “rare 11–20 word substrings from the training corpus appear in the release” or “semantic reconstructions are concentrated among training records.” The second version is actionable. The first is a mood.

Phantoms are not noise; they are the baseline

The most important concept in the paper is the phantom disclosure. A phantom is a match between the synthetic data and a holdout record. Because the holdout record was not used to generate the synthetic dataset, that match cannot be attributed to that record’s inclusion in training.

The paper identifies several reasons phantoms can occur:

First, generalization. If many source records share a format, geography, vocabulary, or domain convention, the generator may produce a value that incidentally overlaps with a holdout record.

Second, inductive bias. A model may already know how emails, forms, identifiers, and phone numbers are structured. It does not need to memorize a particular phone number to produce a plausible number-shaped string.

Third, contamination. In public benchmark datasets, a foundation model may have seen some data before the experiment. For private enterprise deployments, the authors argue this is less central because the data is useful precisely because it is not public. Still, it is a reminder that public synthetic-data auditing benchmarks are rarely as clean as their tables would like us to believe.

The business consequence is straightforward: disclosure counts need denominators and baselines. A synthetic dataset with 500 apparent matches may be more concerning than one with 5, but not if 480 of those matches also occur against a holdout baseline and the training side is not overrepresented. Conversely, a small number of highly train-concentrated PII matches can be more serious than a large but balanced pile of phantoms.

This is where the paper quietly attacks a common dashboard failure. Most governance tools are good at counting. Far fewer are good at asking whether the count means what the dashboard color implies.

The statistical engine: privacy as a hypothesis test

Once the audit has train-side and holdout-side disclosures, the paper turns the comparison into formal tests.

The first is the zero-learning test. Under the zero-learning null, the synthetic output distribution does not depend on the input data. If that were true, rare features appearing in the synthetic data should not be unusually concentrated among training records. With a random 50/50 split, the train-side share should behave like chance, adjusted for feature incidence. Large train-side concentration is evidence against zero learning.

The paper uses a test statistic $T$ that aggregates the matched rare features associated with training records. It then constructs a one-sided test: reject the null when $T$ is too large. The paper also reports a lower bound $\hat{p}$ on the effective training-side match probability. When $\hat{p}$ meaningfully exceeds $0.5$ under the 50/50 split, the result indicates train-side overrepresentation.

The second is the DP-bounded learning test. This asks a weaker, more realistic question: are the observed disclosures consistent with a mechanism satisfying a target $\epsilon$-differential privacy bound? The authors derive tests with controlled type I error and empirical lower bounds on privacy loss. The practical meaning is not “we computed the true privacy of the system.” It is: given these observed disclosures and this test, privacy loss must be at least this large for the result to be compatible with the null.

The third layer is a user-match membership inference attack. Instead of only counting feature matches in aggregate, the framework can score individual records by similarity to the synthetic dataset and ask whether those scores predict training membership. The paper reports AUC for these attacks, allowing comparison with prior canary-based data auditing.

The causal language in the paper is careful. The audit supports a population-level causal claim: training-set membership caused an excess rate of disclosures. It does not claim to trace a specific synthetic output back to a specific training record with mechanistic certainty. That distinction will disappoint anyone looking for a magical leak detector. Good. Magic has had enough budget.

What each experiment is actually doing

The experiments are not one undifferentiated evidence pile. They serve different purposes, and treating every table as the same kind of proof would flatten the paper’s argument.

Evidence block	Likely purpose	What it supports	What it does not prove
Rewrites: embedding two-sample tests	Main evidence	Synthetic rewrites are closer to training records than holdout records across the six rewrite datasets at the rare-record threshold.	It does not identify the specific disclosed feature or prove individual provenance.
Rewrites: feature-match tests	Main evidence	PII and rare-string matches can be concentrated in training data despite rewrite instructions.	It does not catch all semantic leakage, especially rephrased information.
Rewrites: user-match AUC	Main evidence and comparison with prior work	Disclosure-derived scores can support membership inference and often outperform the canary-echo baseline, especially with embeddings.	AUC alone is not a legal privacy threshold; significance depends on sample size.
Table 4 and Appendix A.4 examples	Interpretability / exploratory extension	Embedding-based methods find semantic leakage that n-gram overlap can miss.	Examples are illustrative, not a second benchmark.
Rarity parameter and rarity-threshold figures	Robustness / sensitivity test	Audit power depends on feature rarity, match counts, and whether rare or common patterns are being tested.	The figures do not establish universal optimal thresholds.
SFT versus DP-SGD feature and user tests	Main evidence and generation-method comparison	Non-private SFT leaks; DP-SGD matches are largely consistent with phantom disclosures.	It does not prove DP-SGD eliminates every possible leakage class.
Canary appendix	Exploratory extension and comparison	The framework can incorporate canaries, with mixed gains; canaries are not necessary for the main audit.	Canary success does not automatically explain natural data disclosures.

This distinction matters because the paper’s strongest result is not a single number. It is the consistency of the mechanism across several audit views: feature-level witnesses, user-level inference, DP-bounded lower bounds, and sensitivity checks.

Rewrites leak even when politely instructed not to

The rewrite setting is the one most operators will recognize. Take a real record. Ask an LLM to rewrite it. Tell the model not to reproduce exact sentences. Tell it not to include PII. Release the result as “synthetic” or “anonymized.” Add a compliance slide. Everybody loves a compliance slide.

The paper tests this workflow using six public datasets: Finance, NYT comments, Panorama, Panorama+, Postings, and Tweets. The rewrite model is prompted to avoid exact sentences and PII. The paper then audits the synthetic rewrites.

The first result is broad: using a two-sample embedding similarity test, synthetic rewrites are more similar to training records than to holdout records across all six datasets at the reported rarity threshold. This is main evidence that the rewrite process carries training-set signal.

The feature-match results show why the holdout baseline matters. In the Postings dataset, the DLP audit finds 121 total PII matches, with 116 on the training side and only 5 on the holdout side. The reported lower bound $\hat{p}=0.836$ rejects the zero-learning null. That is not just “we found PII-like strings.” It is “PII-like matches are far too concentrated among records the model actually saw.”

The string results are similarly uncomfortable. For Finance rewrites, the paper reports 28,278 rare string disclosures, with 26,602 on the training side and 1,676 on the holdout side, yielding $\hat{p}=0.792$. Panorama rewrites show 8,362 rare-string disclosures, 8,293 training-side and 69 holdout-side, with $\hat{p}=0.548$. Tweets show 128 rare-string disclosures, 127 training-side and 1 holdout-side, with $\hat{p}=0.533$.

Those are not abstract privacy vibes. They are concrete witness classes: rare strings and PII-like features that appear in the synthetic release and disproportionately point back to the training split.

Still, feature matching is not the whole story. Some rewrite datasets show little or no feature-match evidence for certain extractors, yet user-match and embedding methods reveal semantic leakage. The paper reports user-match AUC scores for rewrites where embeddings reach 0.93 on NYT comments and 0.86 on Tweets. That means semantic similarity can identify training membership far better than chance in those settings, even when exact n-gram overlap is not the main signal.

This is important for enterprise practice because many privacy filters are still built around exact strings, regexes, and PII detectors. Those tools are useful. They are also insufficient. A model can preserve the substance of a record while changing enough surface text to dodge a substring detector. The paper’s examples show training records and synthetic rewrites sharing distinctive semantic structure even after substantial rephrasing. The memo was not deleted. It was paraphrased with plausible deniability.

SFT leaks; DP-SGD mostly collapses into the phantom baseline

The second major experimental block compares non-private supervised fine-tuning with DP-SGD fine-tuning. The setup uses Finance, Panorama, and Panorama+, the datasets with explicit fictitious PII insertion. The SFT model is a Gemma 3 1B instruction checkpoint fine-tuned without privacy constraints. The DP-SGD variant is trained with an $(\epsilon,\delta)$-DP optimizer, with $\epsilon=10$ and $\delta=N^{-1.1}$, where $N$ is the training-set size. For both SFT and DP-SGD, the authors sample 100,000 synthetic records from the resulting model.

The SFT results are unsurprising in the useful sense: non-private fine-tuning leaks. In Table 5, SFT rejects the zero-learning null for at least one feature type in all three datasets. Finance DLP has 763 total PII disclosures, 492 training-side and 271 holdout-side, with $\hat{p}=0.590$. Finance strings are more severe: 20,491 total disclosures, 11,942 training-side and 8,549 holdout-side, with $\hat{p}=0.827$. Panorama strings show 5,114 disclosures, 3,211 training-side and 1,903 holdout-side, with $\hat{p}=0.654$.

Now comes the more interesting part. DP-SGD does not produce zero matches. For Finance strings, the paper reports 13,693 total disclosures: 7,269 training-side and 6,424 holdout-side. A bad audit would wave around “13,693 rare string matches” and perhaps order everyone to stop using models until Q4. The paper’s audit instead asks whether those matches are disproportionately train-side. The answer is no: $\hat{p}=0.458$, and the test fails to reject the zero-learning null. For Finance DLP, DP-SGD has 26 total disclosures, 15 train and 11 holdout, with $\hat{p}=0.302$. Panorama has small or zero counts depending on extractor. Across DP-SGD cases, the authors report no significant DP-bounded lower-bound estimates.

That distinction is the paper’s practical payload. DP-SGD does not make synthetic data visually free of overlaps. It makes the overlaps look statistically like phantoms, at least for the tested feature classes and datasets. This is a much more realistic privacy story than “no matches will ever appear.” It is also much more useful for governance because real releases rarely have the decency to be perfectly clean.

The user-match results reinforce the pattern. For SFT, the paper reports significant AUC for Finance and Panorama using at least one proposed method: Finance substring features reach 0.63 versus 0.48 for the canary-echo baseline; Panorama reaches 0.53 versus 0.50. For DP-SGD, the proposed methods sit near chance: Finance 0.52 or 0.50, Panorama 0.50 or 0.46. The canary baseline produces a significant Panorama+ score even under DP-SGD, which is a reminder that canary-style tests measure a particular attack surface, not necessarily natural disclosure behavior.

The appendix is mostly about power, not a second thesis

The appendix adds useful checks, but its role is not to overturn the main story.

The rarity analysis is a sensitivity test. The main experiments focus on $k=1$, meaning features unique to one user. The appendix varies $k$, allowing features shared by multiple users. As $k$ increases, the number of valid audit features tends to fall, which reduces statistical power and widens the gap between point estimates and conservative lower bounds. The operational lesson is not “always use $k=1$.” It is that feature rarity is a power and threat-model choice. Unique secrets are clean for attribution; shared secrets may be business-relevant but statistically harder.

The rarity-threshold tests are also sensitivity checks. For rewrites, low rarity thresholds detect leakage across all datasets. As the audit includes more common records, power decreases in many cases. For SFT, increasing the threshold can sometimes help, consistent with the observation that fine-tuned models may generate common patterns more readily than rare unique features. This is exactly the kind of detail that matters in production: the best audit lens depends on how the generator learned, not on a universal threshold handed down from the mountain.

The canary appendix is an exploratory extension and comparison with prior work. The framework can incorporate canaries by placing train canaries into the training split and holdout canaries into the holdout split. The gains are mixed: the paper reports a slight boost on Panorama and negligible impact on Finance. More importantly, the authors note that canaries do not necessarily provide a causal explanation for natural data disclosures. That is the right emphasis. Canary tests are useful, but they answer a different question: “Can the system echo planted secrets?” The paper’s main framework asks: “Did the actual synthetic release disclose actual training-subject information beyond the phantom baseline?”

Those are related questions. They are not the same question. Governance systems that collapse them into one score are asking for a very expensive false sense of precision.

What the paper directly shows

The paper directly demonstrates four things.

First, black-box synthetic-data auditing can be done using only the released synthetic data, the source corpus, and a holdout control set. This is operationally significant because many real synthetic-data workflows do not give auditors access to the model internals. The audit follows the data artifact, not the model architecture.

Second, phantom disclosures are common enough that raw disclosure counts can substantially overstate privacy leakage. The Finance SFT DLP result is the clean example: 763 apparent PII matches include 271 holdout-side phantoms. This does not make the remaining training-side excess harmless. It makes the measurement less stupid.

Third, semantic leakage matters. Embedding-based user-match tests find leakage that exact n-gram methods can miss, particularly in rewrite settings. A model can leak structure, topics, relationships, or distinctive phrasing without copying a contiguous 20-token span. This is bad news for anyone whose privacy program is mostly regex plus optimism.

Fourth, explicit privacy training changes the pattern. In the tested DP-SGD settings, disclosures still appear, but they are not statistically concentrated enough in the training set to reject the zero-learning null. The authors interpret this as evidence that DP protections mitigate true disclosures and that observed matches are primarily phantoms.

What Cognaptus would infer for business use

The business interpretation is not “this paper solves synthetic-data privacy.” It does not. The useful inference is narrower and stronger: enterprises can turn synthetic-data release review into a controlled audit with interpretable witnesses.

Paper result	Business inference	Boundary
The audit needs no model access, canaries, or reference models.	It can be used as a release gate for internal generators, vendor-generated synthetic data, and archived generation workflows.	The auditor still needs the source corpus and a valid holdout set.
Phantom disclosures can be common.	Release dashboards should report train matches, holdout matches, and excess disclosure, not just raw overlap counts.	Phantom rates depend on dataset distribution, feature extractor, and base-model contamination risk in public data.
PII, n-gram, and embedding extractors reveal different risks.	Privacy review should define disclosure classes by threat model: identity fields, verbatim text, semantic reconstruction, or domain-specific secrets.	Untested disclosure classes remain unmeasured.
Non-private rewrites and SFT show leakage.	“Rewrite it with an LLM” is not a privacy guarantee. Neither is “we told the model not to include PII.” Charming instructions, weak contract.	Results depend on model, prompt, data, and sampling setup.
DP-SGD matches are largely indistinguishable from phantoms.	Formal privacy mechanisms can materially reduce true disclosure risk, even when apparent matches remain.	The paper tests specific datasets, models, and $\epsilon=10$ DP-SGD settings; it does not certify all DP workflows.
User-match AUC can be computed from disclosure scores.	Audits can quantify whether synthetic data enables membership inference without training shadow models.	AUC is not itself a business risk threshold; significance and harm depend on context.

For enterprises, this suggests a practical governance workflow:

Reserve a holdout set before synthetic generation. If the generator has already seen everything, the clean causal comparison is gone. Retroactive governance is a popular genre of fiction.
Define disclosure classes before running the audit. Include generic classes such as PII, rare strings, and semantic similarity, but add domain-specific extractors for account codes, medical concepts, contractual phrases, proprietary identifiers, transaction patterns, or other secrets that actually matter.
Generate synthetic data using the exact release configuration. Prompt changes, filtering, sampling temperature, fine-tuning method, and post-processing all belong inside the audited pipeline.
Report three layers: raw apparent disclosures, phantom-adjusted excess disclosures, and formal test outcomes such as p-values, $\hat{p}$ lower bounds, empirical $\epsilon$ lower bounds, or user-match AUC.
Tie release decisions to risk classes. A dataset with balanced phantom-like string overlaps may be acceptable for low-risk analytics. A dataset with train-concentrated PII or semantic reconstructions of sensitive records should trigger remediation, stronger privacy mechanisms, or release denial.

This is not glamorous. It is much better than glamorous.

Where the framework does not travel cleanly

The first boundary is the holdout requirement. The audit is strongest when the data owner controls the train/holdout partition before generation or has a natural holdout set that truly was not used. If the entire source corpus was already used to generate the synthetic data, the causal comparison is compromised. You can still run overlap checks. You cannot cleanly estimate phantoms in the same way.

The second boundary is feature coverage. The empirical lower bounds are valid for the chosen disclosure classes. If the auditor only tests PII strings and n-grams, the audit says little about semantic reconstruction. If the auditor tests generic embeddings but misses domain-specific secrets, the audit may understate risk. The paper explicitly frames feature selection as flexible and domain-specific; the hard work does not disappear. It moves into the audit design.

The third boundary is statistical power. Rare features are useful because they support attribution, but if too few matches exist, the test may lack power. Shared features complicate the picture. The appendix shows that as feature frequency changes, sample size and confidence bounds shift. In production, this means “no evidence” can mean no leakage, but it can also mean weak instrumentation.

The fourth boundary is population-level causality. The framework can establish that training membership causes an excess disclosure rate. It does not prove that one synthetic record came from one source record in a mechanistic sense. That kind of provenance would require different tools.

The fifth boundary is empirical scope. The paper evaluates particular public datasets, a rewrite model, Gemma-based SFT, and DP-SGD under a stated privacy configuration. The conclusion should not be inflated into “DP-SGD always fixes synthetic data” or “rewrites always leak.” The more precise statement is better: in these experiments, non-private generation shows excess training-side disclosure, while DP-SGD disclosures are largely consistent with the phantom baseline.

The useful governance artifact is the disclosure receipt

Synthetic data has been sold as a privacy escape hatch for years. The paper does not reject that ambition. It makes it less lazy.

The central move is replacing apparent disclosure with excess disclosure. A rare match is not automatically memorization. A holdout match is not automatically harmless. A raw count is not automatically risk. The audit earns its interpretation by comparing training and holdout behavior under a defined disclosure class.

That makes the framework valuable for enterprises because it produces a release artifact that is legible: what was tested, what matched, how many matches were phantoms, whether the training split was overrepresented, whether membership inference worked, and what empirical privacy-loss lower bounds were visible.

This is what synthetic-data governance should look like. Not a model card full of adjectives. Not a promise that the data is “anonymous” because someone asked the model nicely. A disclosure receipt.

And yes, the receipt may still be ugly. That is rather the point.

Cognaptus: Automate the Present, Incubate the Future.

Kareem Amin, Rudrajit Das, Alessandro Epasto, Adel Javanmard, Dennis Kraft, Mónica Ribero, and Sergei Vassilvitskii, “Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data,” arXiv:2606.16952, 2026, https://arxiv.org/abs/2606.16952. ↩︎

TL;DR for operators#

The real privacy question is not “did we find a match?”#

The audit pipeline: split, extract, compare, test#

Phantoms are not noise; they are the baseline#

The statistical engine: privacy as a hypothesis test#

What each experiment is actually doing#

Rewrites leak even when politely instructed not to#

SFT leaks; DP-SGD mostly collapses into the phantom baseline#

The appendix is mostly about power, not a second thesis#

What the paper directly shows#

What Cognaptus would infer for business use#

Where the framework does not travel cleanly#

The useful governance artifact is the disclosure receipt#