TL;DR for operators

When an LLM leaks sensitive, copyrighted, or otherwise forbidden information, the obvious repair is to fine-tune it away from the bad answer. That sounds sensible until you notice the small operational comedy: the remediation process keeps using the very answer it is supposed to remove.

The paper behind this article proposes Partial Model Collapse (PMC), a machine unlearning method that avoids directly optimising on ground-truth forget answers. Instead, PMC asks the model the sensitive question, samples multiple responses from the model itself, selects a response that is less like the model’s original answer, and fine-tunes on that self-generated response while also training on retain data to preserve general utility.1

The business interpretation is not “privacy solved”. It is narrower and more useful: PMC offers a route for output-level remediation when the goal is to stop a model from producing specific information without retraining from scratch and without repeatedly feeding verified sensitive targets back into the optimisation loop.

The evidence is strongest on TOFU, a fictitious question-answering unlearning benchmark. Across Phi-1.5, Llama-3.2-3B-Instruct, and Gemma-3-12B-it, PMC improves the trade-off between unlearn quality and utility compared with GA, GD, DPO, NPO, SimNPO, and the fixed-refusal “I don’t know” baseline. The paper also tests leakage under sampling and prefilling attacks, where PMC looks more robust than prior methods.

The operator’s caution is just as important. PMC removes information from outputs under the tested evaluation regimes. It does not prove that the model’s weights are clean, that all adversarial attacks fail, or that a compliance team can replace documentation with vibes. The method is promising precisely because it is practical, not because it is magic with a loss function.

The easiest way to forget is not to stare harder at the secret

Imagine a model that can answer a question it should not answer. Perhaps the answer contains personal information. Perhaps it reproduces copyrighted text. Perhaps it reveals a customer-specific detail that should never have survived the training pipeline.

A conventional unlearning workflow often does something like this: take the unwanted answer, use it in the unlearning objective, and push the model away from it. Gradient ascent reduces the likelihood of the target sequence. Preference methods can treat the sensitive answer as a negative example. Refusal fine-tuning can teach the model to answer with a fixed phrase such as “I don’t know”.

That approach has a surface-level logic. It also has a design smell. If the information is sensitive enough to remove, why is the unlearning procedure repeatedly optimising on it?

PMC starts from that discomfort. Its core move is simple but not obvious: do not train against the ground-truth forget answer. Train on the model’s own alternative generations until the output distribution for the forget query drifts away from the original answer.

In other words, collapse is usually the disease. Here it becomes the treatment.

Model collapse becomes useful only when it is conditional

Model collapse normally refers to a failure mode: generative models trained repeatedly on their own outputs can lose distributional richness. Diversity shrinks. Rare modes disappear. The system becomes less faithful to the original data-generating process. This is generally bad news for model quality.

PMC does not celebrate collapse everywhere. That would be a fast route to a useless model, which is technically a privacy-preserving system in the same way that a brick is a secure laptop.

The paper’s idea is conditional collapse. Trigger collapse only for the questions whose answers should be forgotten. Preserve behaviour on retain queries.

The simplified mechanism looks like this:

  1. Keep a retain set of normal question-answer pairs.
  2. Keep a forget set of questions, but not their ground-truth answers.
  3. For each forget question, sample several answers from the current model.
  4. Score those samples with a reward function that prefers movement away from the original model output.
  5. Fine-tune the model on the selected self-generated answer.
  6. Also fine-tune on retain examples, controlled by a utility-retention parameter.

The practical training loss combines two pressures:

$$ \text{loss} = \lambda \cdot \text{retain loss} + \text{forget-query self-generation loss} $$

The retain term keeps the model useful. The forget term nudges the model’s own output distribution away from the answer it used to give. The balance parameter $\lambda$ matters because too much forgetting can damage utility, while too much retention can leave the forbidden behaviour intact.

The important distinction is that PMC does not say: “make this answer impossible.” It says: “for this question, repeatedly move toward self-generated alternatives that score better under the unlearning reward.”

That is a different kind of intervention. It is less like deleting a file and more like redirecting a habit.

The privacy trick is reducing dependence on the target answer

The paper’s privacy argument is not that PMC never touches sensitive content. That would be too clean, and suspiciously convenient.

In the experiments, PMC uses a reward based on dissimilarity from the model’s original greedy answer. If the original model answer contains sensitive information, that original answer is itself sensitive. The method therefore does not eliminate all sensitive-data handling.

What it changes is the role of the sensitive target in optimisation. Existing target-dependent methods may keep applying gradient updates against the fixed answer throughout unlearning. PMC instead samples from the current model and fine-tunes on selected self-generated responses. As the model drifts away from the sensitive answer, later optimisation increasingly uses non-sensitive alternatives.

That distinction matters in deployment. In many real systems, the organisation may not have clean access to the original training samples that caused the leak. It may only know that a particular prompt produces a problematic output. PMC fits that scenario better than methods requiring verified ground-truth forget sequences.

Design choice Conventional target-dependent unlearning Partial Model Collapse
Needs ground-truth forget answers Often yes No
Uses sensitive target directly in loss Often yes Avoids ground-truth forget targets
Main mechanism Suppress or penalise fixed answer Shift output distribution through self-generated alternatives
Likely operational risk Reinforcement or leakage through target dependence Reward design, sampling cost, residual output leakage
Best read as Direct correction Distributional remediation

The subtlety is worth keeping. PMC is not a clean-room deletion certificate. It is a way to perform remediation with less reliance on the very data that caused the remediation problem.

The theorem says the collapse can converge, not that compliance is done

The paper supports PMC with a theoretical argument before moving to LLM experiments. The warm-up starts with categorical distributions. If a distribution is repeatedly refit using retained categories plus self-generated samples, probability mass for non-retained categories can vanish over iterations. This is the “partial” in partial collapse: the system does not collapse into one useless point; the unwanted part fades while retained structure anchors the process.

The more relevant result for question-answering uses a preference-guided self-generation process. For forget queries, the model samples candidate responses. A Bradley-Terry-style preference model selects higher-reward responses. Under idealised assumptions—no statistical approximation error, no function approximation error, and non-zero initial probability of maximum-reward responses—the expected reward converges to the maximum reward and the variance vanishes.

Translated out of theorem voice: if better alternatives are already somewhere in the model’s output distribution, and the selection process consistently favours them, repeated relearning concentrates the model on those alternatives.

That is useful, but the boundary is sharp. The theorem does not prove that a deployed transformer has erased a memory from its weights. It proves convergence for an idealised distributional process. The empirical question is whether real LLM fine-tuning behaves enough like the theory to be operationally useful.

That is where the experiments matter.

The main experiment is a Pareto test, not a leaderboard

The paper evaluates PMC primarily on TOFU, a fictitious dataset of 4,000 question-answering pairs designed for unlearning research. The authors fine-tune models on the full dataset and then unlearn the “forget10” split, which is the largest forget set and therefore the hardest TOFU split used in the paper.

The models are Phi-1.5, Llama-3.2-3B-Instruct, and Gemma-3-12B-it. The baselines include Gradient Ascent, Gradient Difference, DPO, NPO, SimNPO, and the fixed-refusal IDK method. For fairness, the paper runs broad hyperparameter searches: 100 configurations per method, repeated across five random seeds for most settings. Gemma is treated more selectively because of cost; the paper reports only one run per experiment there and omits DPO/NPO because their reference-model dependence limits scalability.

The headline result is not “PMC wins a score.” The better interpretation is: PMC shifts the Pareto frontier.

The paper measures unlearn quality using ROUGE-L overlap on forget and paraphrased-forget outputs, transformed so higher is better. It measures utility using ROUGE-L on retain examples plus world-facts and real-authors evaluations, normalised for readability.

PMC dominates the baselines in the utility-unlearning plots across all three models. Existing methods can unlearn, but they tend to pay for it by damaging utility or failing to move far enough from the fine-tuned model. PMC’s advantage is that it changes the distribution from within: it fine-tunes on responses the model is already likely to produce rather than forcing it away from a fixed string.

That is why the result is business-relevant. Operators rarely need the theoretically strongest possible forgetting if it wrecks the product. They need a remediation method that reduces leakage while preserving enough capability to keep the system usable.

Fixed refusals can hide knowledge rather than remove it

The IDK baseline is operationally tempting. It teaches the model to answer sensitive questions with “I don’t know”. For a demo, this looks excellent. Ask the question. Get refusal. Tick the governance box. Go home early.

The paper shows why that can be misleading.

To test robustness under sampling, the authors draw 100 answers from each unlearned model for each forget question and compute the worst-case ROUGE-L overlap with the ground-truth answer. This matters because greedy decoding can hide low-probability leaks. A model that refuses once may still leak when sampled repeatedly.

PMC substantially reduces leakage under this sampling test. IDK also looks better under sampling than some baselines, but the paper then performs a prefilling attack: ask the forget question and force the model to continue from a prefix such as “The answer is:”.

That bypasses the learned refusal surface. Under prefilling, IDK leaks substantially more. The model has learned to say the correct social sentence, not necessarily to lose the underlying answer. Very corporate.

PMC performs better across both sampling and prefilling settings because the method changes the output distribution more broadly rather than simply installing a refusal template on top.

The practical lesson is blunt: refusal behaviour is not the same as unlearning. A compliance evaluation that only checks the default answer can be fooled by a model that has learned a stage performance.

Target-dependent unlearning can leak by making the answer too unlikely

One of the paper’s most interesting sections is not the headline Pareto plot. It is the analysis of side effects in target-dependent unlearning.

The authors examine what happens when methods such as NPO suppress target tokens. The intended effect is contextual: if asked about a specific fictional author’s profession, the model should not reveal the sensitive answer. The unintended effect is broader: the method may reduce probabilities for the same tokens in unrelated contexts.

The paper tests this using tokens from TOFU forget answers that also appear in Wikitext. NPO substantially reduces the probability of generating those tokens even outside the forget context. PMC does not show the same systematic suppression. Its probability differences are centred much closer to zero, while NPO’s distribution skews negative.

That is not just a quality issue. It creates a leakage channel.

The paper constructs a multiple-choice dataset from 84 TOFU forget questions. It then scores answer options using inverse perplexity. The leakage hypothesis is deliciously perverse: if target-dependent unlearning makes the correct answer unusually unlikely, an adversary can recover the answer by choosing the option the model dislikes most.

For NPO, this pattern appears. Correct answers often become the least likely option, especially when the minimum probability is very low. PMC does not show that same pattern.

This flips the intuition. Suppression can reveal. If the model flinches dramatically at the right answer, the flinch is information.

The appendix tests robustness, not a second thesis

The paper’s appendix is doing several different jobs. Treating every appendix figure as equal would blur the argument, so it is better to separate purpose from evidence.

Paper component Likely purpose What it supports What it does not prove
Figure 3 / Figure 7 Pareto plots Main evidence PMC improves utility-unlearning trade-offs on TOFU across three models General unlearning guarantees
Figure 4 sampling and prefilling Robustness test PMC leaks less under tested attack formats Resistance to all adversarial prompting
Figure 5 side effects Diagnostic comparison with prior work Target-dependent methods can distort token probabilities and leak through least-likely-choice behaviour That all target-dependent methods fail in all settings
Figure 6 epoch/sample/$\lambda$ ablations Ablation PMC behaviour depends on training duration, sampling diversity, and retention weight One universal hyperparameter recipe
Appendix A.1 sampling settings Sensitivity test Temperature and top-p affect the unlearning-utility trade-off That more diversity is always better
Appendix A.2 runtime Implementation practicality PMC is costlier early but remains competitive in tested runs Production cost across all model sizes
Appendix A.7 self-BLEU reward Exploratory reward variant Alternative rewards may improve utility and still produce semantic unlearning A fully validated reward strategy
Appendix A.8 ARC/MMLU utility Extended utility check PMC has minor impact on general benchmarks in tested models No hidden capability degradation
Appendix A.10 MUSE-news Exploratory extension PMC can be adapted beyond Q&A Strong conclusions beyond TOFU

The ablations are especially useful for operators because they show what the method is sensitive to.

More epochs can improve unlearn quality for PMC beyond the point where some baselines have already plateaued. More samples improve the chance of finding a better self-generated alternative, but larger sample sizes can increase variance in utility. Larger $\lambda$ improves utility retention but can weaken unlearning. Higher sampling temperature and top-p can strengthen unlearning by increasing diversity, but too much diversity can damage utility.

This is not a plug-and-play knob. It is a control system.

The business value is remediation without replaying the liability

PMC should interest three groups inside an AI-operating organisation.

The first is privacy and compliance. When a model produces an output that must be removed, retraining from scratch is usually infeasible. Conventional unlearning can require the sensitive sequence. PMC offers a narrower intervention: use the problematic question and the model’s own outputs to push the behaviour away from the leak.

The second is product operations. A model that refuses everything is safe in the same way a closed store has no customer complaints. PMC’s Pareto advantage matters because customer-facing systems need selective remediation, not a bonfire of capability.

The third is security evaluation. The paper makes a strong case that default-response testing is inadequate. Sampling, prefilling, multiple-choice probability analysis, and unrelated-context token checks reveal different failure modes. If a vendor claims it has “unlearned” something because the model now says “I don’t know”, the next question should be: under which decoding settings, prefixes, and probability probes?

A practical deployment workflow inspired by PMC would look something like this:

Operational step PMC-informed interpretation
Identify problematic prompt families Define forget questions, not necessarily ground-truth forget answers
Generate baseline outputs Record what the model currently produces and what must stop appearing
Run collapse-based remediation Fine-tune on selected self-generated alternatives plus retain data
Evaluate beyond greedy decoding Test sampling, prefilling, paraphrases, and multiple-choice leakage
Monitor utility Track retain tasks and unrelated benchmark behaviour
Document residual risk State that the claim is output suppression under tested regimes, not weight-level deletion

That last row is where many governance programmes become allergic. It is also where the credibility lives.

PMC is not a universal delete key

The paper is careful about its boundaries, and operators should be even more so.

First, PMC focuses on removing information from model outputs. That is a valid and measurable goal, but it is not the same as proving that the model behaves exactly like one never trained on the data. The authors explicitly note that current evaluation protocols are not sufficient to establish that stronger distributional claim.

Second, the method depends on the model assigning non-zero probability to better alternatives. If the model’s output distribution for a forget query had already collapsed completely onto the sensitive answer, PMC would have no useful alternative to amplify. The authors do not observe this as a practical issue in their LLM experiments, but it remains a theoretical boundary.

Third, reward design determines post-unlearning behaviour. The paper’s main experiments use a simple ROUGE-L-based reward that favours divergence from the original greedy output. That can produce refusals, but it can also produce hallucinations or gibberish. In a production system, the reward should probably include coherence, harmlessness, and domain-specific refusal behaviour. Otherwise, the model may forget the secret by becoming weird, which is not usually a product requirement.

Fourth, cost is real. PMC samples multiple candidate responses during unlearning, so it can be more expensive than some baselines. The paper’s runtime analysis finds the method practical in its tested setup, with Phi-1.5 completing within about 40 minutes, Llama-3.2-3B-Instruct around 20 minutes, and Gemma-3-12B-it around 200 minutes on H200 hardware. Those numbers are useful, but they are not a universal procurement guide.

Fifth, the MUSE-news experiment is only a pilot. The paper extends PMC beyond Q&A by applying it to a reduced news-text unlearning setup, but the authors explicitly avoid drawing robust conclusions because they consider the evaluation insufficient. That restraint is welcome. It is also a reminder that unlearning benchmarks are still less mature than the dashboards built on top of them.

The real misconception is that “I don’t know” means “I forgot”

The most dangerous misconception in LLM unlearning is not that model collapse is always bad. That is merely incomplete.

The more operationally dangerous misconception is that a clean refusal equals successful unlearning.

A model can refuse in the default path while leaking under sampling. It can refuse until prefilling nudges it back into answer-completion mode. It can suppress the correct answer so aggressively that the suppression itself becomes a signal. It can pass a simple benchmark while distorting unrelated token probabilities.

PMC matters because it attacks this surface-level theatre. It tries to move the model’s conditional output distribution, not just dress the old distribution in a polite refusal suit.

That does not make it a final answer to machine unlearning. It makes it a better question for the field: what should count as forgetting when LLMs can leak through generation, probability, paraphrase, and attack-conditioned behaviour?

Collapse is useful when it is aimed, measured, and contained

The paper’s most useful contribution is conceptual. It takes model collapse, a phenomenon usually framed as degradation, and turns it into a targeted unlearning mechanism. That reversal is elegant, but the operational value comes from the details: conditional application, retain-set anchoring, preference-guided self-generation, and robustness evaluation beyond greedy answers.

For businesses, the takeaway is not to rush PMC into every model governance workflow tomorrow morning. The takeaway is to update the mental model of unlearning.

Deleting model behaviour is not only about pushing down a forbidden string. Sometimes the safer path is to let the model walk away from its own answer, one self-generated alternative at a time.

That is less dramatic than “right to be forgotten, solved”. It is also much closer to engineering.

Cognaptus: Automate the Present, Incubate the Future.


  1. Yan Scholten, Sophie Xhonneux, Leo Schwinn, and Stephan Günnemann, “Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs,” arXiv:2507.04219. ↩︎