Synthetic Data’s Ghost Problem: Auditing the Leaks That Weren’t
TL;DR for operators Synthetic data privacy reviews should stop treating every rare match as proof of memorization. That is the useful correction in Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data, a paper that turns synthetic-data auditing into a controlled experiment rather than an anxious string search.1 The paper’s mechanism is simple enough to be dangerous in the right way: split the source corpus into training and holdout records; generate synthetic data from the training split; extract rare features from training, holdout, and synthetic data; then ask whether synthetic matches are disproportionately concentrated in the training split. Matches against training records are potential true disclosures. Matches against holdout records are phantom disclosures: things that look like leaks but could have appeared even if that record had never been used. ...