Collapse to Forget: Turning Model Collapse into a Privacy Feature for LLMs
TL;DR for operators When an LLM leaks sensitive, copyrighted, or otherwise forbidden information, the obvious repair is to fine-tune it away from the bad answer. That sounds sensible until you notice the small operational comedy: the remediation process keeps using the very answer it is supposed to remove. The paper behind this article proposes Partial Model Collapse (PMC), a machine unlearning method that avoids directly optimising on ground-truth forget answers. Instead, PMC asks the model the sensitive question, samples multiple responses from the model itself, selects a response that is less like the model’s original answer, and fine-tunes on that self-generated response while also training on retain data to preserve general utility.1 ...