Auditing the Illusion of Forgetting: When Unlearning Isn’t Enough

Deletion requests sound simple until the model answers politely.

A user asks for data to be removed. A publisher demands that copyrighted passages stop being reproduced. A compliance team wants evidence that a fine-tuned model no longer carries traces of a forbidden dataset. The model is run through an unlearning method, the surface tests improve, the dashboard turns less red, and everyone enjoys the brief spiritual comfort of a green checkmark.

The awkward question is whether the model forgot—or merely learned not to say the incriminating thing out loud.

That is the problem examined in Auditing Language Model Unlearning via Information Decomposition by Anmol Goel, Alan Ritter, and Iryna Gurevych.¹ The paper’s central claim is not just that current LLM unlearning is imperfect. That would be useful, but not especially surprising. The sharper claim is that standard audits can certify the wrong thing. A model may pass output-level forgetting tests while information about the supposedly forgotten data remains linearly decodable from its internal representations.

In other words: the model may have stopped confessing, but the evidence is still in the files.

The paper calls this shallow unlearning. The useful contribution is that the authors do not stop at naming the problem. They propose a representation-level audit based on Partial Information Decomposition (PID), separating what was actually erased from what remains shared between the original and unlearned models. That residual shared information then becomes more than a diagnostic curiosity: it correlates with vulnerability to post-unlearning attacks and can support an inference-time abstention mechanism.

For business readers, the practical lesson is uncomfortable but clean. Unlearning should not be treated as a behavior patch. It should be treated as an audit problem.

The dangerous misconception: clean outputs mean clean memory

The reader misconception this paper attacks is very plausible: if an unlearned model stops generating forgotten content, or if its Forget Quality score improves, then the data has been removed.

That belief is attractive because it fits how software compliance is often managed. You define a test, run the system, compare outputs, and archive the result. This works reasonably well when the system’s behavior is the system. A database either returns a deleted row or it does not. A website either exposes a file or it does not.

LLMs are not so polite.

A language model can change its output behavior without removing the internal features that encode information about the target data. It can refuse, paraphrase, dodge, or become less helpful on certain prompts while still retaining a representation that makes membership information recoverable by a separate probe. That distinction matters because many unlearning evaluations are still dominated by behavior-facing tests.

The paper contrasts two audit levels:

Audit level	What it observes	What it may miss
Output-level audit	Whether the model still produces forgotten content or behaves like a retain-only model	Internal traces that no longer appear in ordinary generations
Attack-level audit	Whether a membership inference attack succeeds	Leakage missed by a weak or poorly configured attack
Representation-level audit	Whether internal activations still encode membership information	Requires deeper model access and careful estimator design

The paper’s argument begins with a simple probing experiment. Take representations from the residual stream of a supposedly unlearned LLM. Train a linear classifier to predict whether an input belonged to the forget set. If the classifier performs well, membership information is still present in the representations.

This is not a fancy adversarial jailbreak. It is a linear probe. The point is almost impolite: if a simple classifier can still detect membership, the privacy story is not exactly Fort Knox.

In the paper’s Figure 2, probe AUROC scores on TOFU remain well above chance across layers and unlearning algorithms, including Gradient Ascent, Gradient Difference, SimNPO, and RMU. The layer-by-layer pattern matters because the leakage is not only a final-output artifact. It appears inside the representational pipeline. The authors connect this probe evidence to mutual information through Fano’s inequality: low prediction error implies non-trivial information between representations and membership labels.

The result is the conceptual entry point for the whole paper. The model’s public behavior is not enough. The audit must ask what information remains inside.

Why mutual information alone is too blunt

Once we accept that internal representations matter, the next temptation is to measure mutual information between a model representation and the membership label. Let $Y$ denote whether an input belongs to the forget set, $B$ the base model representation, and $U$ the unlearned model representation. Then $I(Y;B)$ and $I(Y;U)$ tell us how much information each representation contains about membership.

Useful, but incomplete.

The central business question is not only whether the unlearned model contains membership information. It is whether the unlearning process removed information that was previously present. For that, we need to compare the base and unlearned representations.

The paper uses Partial Information Decomposition to break the joint information of $B$ and $U$ about $Y$ into interpretable components:

$$ I(Y; B, U) = I^B_{\text{uniq}} + I^U_{\text{uniq}} + I_{\cap} + I_{\text{syn}} $$

The terms have specific audit meanings:

PID component	Technical meaning	Meaning for unlearning audit
$I^B_{\text{uniq}}$	Information about $Y$ uniquely present in the base model	Information successfully erased by unlearning
$I^U_{\text{uniq}}$	Information uniquely present in the unlearned model	Information newly or differently encoded after unlearning
$I_{\cap}$	Information about $Y$ shared by base and unlearned models	Residual knowledge that survived unlearning
$I_{\text{syn}}$	Information available only from $B$ and $U$ jointly	Joint structure not directly attributable to either model alone

The paper’s terminology is operationally helpful. Unlearned Knowledge is $I^B_{\text{uniq}}$: information present in the base representation but absent from the unlearned one. Residual Knowledge is $I_{\cap}$: information shared across both representations after unlearning.

This distinction is the article’s main mechanism. Without it, all audit results collapse into a crude before-and-after comparison. With it, the auditor can say something more precise:

The model erased this much. It retained this much. The retained part is the problem.

RINE turns the audit into something measurable

PID is elegant, but elegance alone does not audit a production model. The hard part is estimating redundancy in high-dimensional continuous representations.

The paper uses Redundant Information Neural Estimation (RINE). In practice, the authors train two decoders—implemented as logistic regression probes—on the base and unlearned model representations. Each decoder predicts the membership label. The redundancy estimate is based on the idea that information is residual when both decoders can extract the same membership-relevant signal while agreeing in their predictions.

The important audit detail is that RINE gives a lower bound on true redundant information. That sounds like a limitation, and mathematically it is. Operationally, it can be useful. If a lower bound already shows high residual knowledge, the model fails the audit without needing a perfect estimate of everything it retained.

This is the same logic as finding a leak in a pressure test. You do not need to know the exact total water loss to reject the pipe. A visible leak is enough.

The authors also connect the framework to Blackwell sufficiency, using it to justify residual knowledge as decision-relevant information that remains available after unlearning. For most business readers, the key intuition is simpler: if the unlearned representation still supports the same membership decision as the base representation, the unlearning procedure has not fully removed the relevant trace.

The main evidence: Forget Quality can overstate deletion

The paper evaluates unlearning on TOFU and MUSE, using model families including LLaMA, Gemma, and Qwen. The unlearning algorithms include Gradient Ascent, Gradient Difference, NPO, SimNPO, and RMU. The core comparison reports traditional Forget Quality alongside the PID-based measures of Unlearned Knowledge and Residual Knowledge.

This is the main evidence, not an ablation. It tests whether the proposed audit reveals information that conventional metrics miss.

On TOFU, the pattern is consistent. Gradient-based approaches can look respectable under Forget Quality while retaining substantial residual knowledge. RMU generally has lower residual knowledge and higher unlearned knowledge.

A few concrete numbers make the problem visible:

Model	Method	Forget Quality ↑	Unlearned Knowledge ↑	Residual Knowledge ↓
LLaMA	Gradient Ascent	0.55	0.22	0.41
LLaMA	Gradient Difference	0.62	0.35	0.32
LLaMA	RMU	0.72	0.81	0.08
LLaMA	Retrained exact baseline	1.00	—	0.002
Gemma	Gradient Ascent	0.42	0.18	0.39
Gemma	RMU	0.56	0.78	0.07
Qwen	Gradient Ascent	0.49	0.25	0.45
Qwen	RMU	0.72	0.84	0.09

The exact retraining baseline is important. It is not just decorative. In the paper, the retrained model shows near-zero residual knowledge—0.002 bits for LLaMA and Qwen, 0.003 for Gemma. This gives the audit a sanity check: when the data really is excluded from training, the residual signal nearly vanishes.

That contrast is the uncomfortable evidence. Approximate unlearning methods may improve behavior while leaving a measurable representational residue. Retraining gets close to the ideal but is usually too expensive for large LLMs. Approximate methods are practical, but this paper shows why “practical” should not be mistaken for “complete.”

The MUSE results in the appendix play a robustness role. They repeat the audit on a different benchmark involving BBC News and Harry Potter-related knowledge. The pattern remains similar: simpler methods leave more residual knowledge; RMU tends to reduce it more strongly. This does not create a second thesis. It supports the first one across a broader unlearning setting.

The algorithm ranking is useful, but not the whole story

It would be easy to turn the results into a lazy ranking: RMU good, Gradient Ascent bad, everyone else somewhere in the middle. That is not wrong, but it undersells the paper.

The deeper result is that different metrics answer different questions.

Forget Quality asks whether the unlearned model behaves like a model trained only on retained data. Residual Knowledge asks whether membership information remains shared between the base and unlearned representations. These are related but not identical. A method can improve one while failing the other.

That matters for procurement, governance, and model risk management. A vendor saying “our unlearning method improves Forget Quality” is not equivalent to saying “we removed the internal membership trace.” The first is a behavioral statement. The second is a representational claim. Those should not be sold under the same label.

A better audit memo would separate the evidence:

Claim	Evidence needed	What the paper contributes
The model no longer outputs forgotten content	Output and task-level tests	Existing metrics such as Forget Quality
The model no longer encodes membership traces	Representation-level information audit	PID/RINE Residual Knowledge
The model is less vulnerable after unlearning	Attack correlation or red-team evidence	Residual Knowledge correlates with attack success
The model can reduce leakage at inference	Deployment-time control	Risk-score-based abstention

This is the business value of the mechanism-first view. It prevents one metric from pretending to be the whole audit.

Residual knowledge behaves like attack surface

The paper’s second major result asks whether Residual Knowledge predicts vulnerability to attacks after unlearning. This section is best read as evidence for operational relevance, not as a proof that residual knowledge explains every attack path.

The authors use attacks from prior work showing that unlearning can be reversed through additional fine-tuning on benign samples or by orthogonalizing the refusal direction. They compare the correlation between attack success rates and their residual knowledge metric against a strong membership inference baseline, Min-K%++.

The reported correlations favor the audit metric. For example, under the fine-tuning attack setting, the paper reports correlations of 0.60 for LLaMA, 0.65 for Gemma, and 0.71 for Qwen using the proposed audit metric. The corresponding MIA correlations are lower: 0.41, 0.39, and 0.51. Under orthogonalization, the proposed audit metric again shows stronger correlations than the MIA baseline.

This is not merely a nice statistical footnote. It gives Residual Knowledge an operational interpretation: models with more residual trace are more vulnerable to post-unlearning recovery.

The business inference is straightforward but bounded. A high residual knowledge score should be treated as a risk flag. It can justify escalation: stronger unlearning, exact retraining for high-value cases, additional access controls, or restricted deployment. But it should not be oversold as a complete vulnerability score for every possible attack. The paper tests specific attack families and datasets, not the entire universe of future adversaries. Annoyingly, attackers have a habit of not respecting benchmark boundaries.

Abstention is a safety layer, not a magic eraser

The paper’s third contribution is an inference-time abstention mechanism. This is where the audit becomes a deployment control.

The authors define a risk score using two membership decoders: one trained on the base model representation and one on the unlearned model representation. Let:

$p_1 = f_1(\text{forget} \mid x_1)$ for the base representation;
$p_2 = f_2(\text{forget} \mid x_2)$ for the unlearned representation.

The proposed score is:

$$ \text{RiskScore}(x) = \frac{1}{2}(p_1 + p_2) \cdot (1 - |p_1 - p_2|) $$

The score is high when both conditions hold:

both decoders think the input resembles the forget set;
the base and unlearned decoders agree.

That second condition is important. If the base model flags an input as forget-set-like but the unlearned model does not, this may indicate successful removal. If both agree confidently, the trace likely survived.

The qualitative example in the paper makes the mechanism easy to understand:

Sample type	$p_1$	$p_2$	Interpretation	Risk score
Retain sample	0.09	0.12	Both decoders see low forget probability	0.10
Forget sample, successfully unlearned	0.95	0.17	Base remembers, unlearned diverges	0.12
Forget sample, residual trace	0.92	0.85	Both agree it looks forgotten-data-like	0.82

This is a clever little mechanism because it does not merely ask whether the model is uncertain. In fact, the paper explicitly compares against predictive entropy as an uncertainty baseline. On LLaMA with RMU, predictive entropy improves Forget Quality from 0.72 to 0.77 while reducing utility from 0.51 to 0.48. The proposed risk score improves Forget Quality to 0.83 with utility at 0.49.

That comparison is a robustness-style check for the abstention claim. It tests whether the risk score is doing something more targeted than generic uncertainty-based refusal. The result suggests that it is.

Still, abstention should be understood correctly. It is not unlearning. It is a guardrail over imperfect unlearning. It reduces leakage opportunities by refusing high-risk queries, but the residual information may still exist inside the model. The pipe still has a weak joint; the system just avoids turning on the water in certain rooms.

Smaller-model and MUSE results support generality, not universality

The appendix adds two useful checks.

First, the authors test smaller models: LLaMA-3 1B and Gemma 2B. The purpose is scalability and consistency. The same broad pattern appears: Gradient Ascent and Gradient Difference retain more residual knowledge, while RMU achieves lower residual knowledge and higher unlearned knowledge. For LLaMA-3 1B, RMU reaches residual knowledge of 0.08, compared with 0.29 for Gradient Ascent. For Gemma 2B, RMU reaches 0.09, compared with 0.33 for Gradient Ascent.

Second, the MUSE appendix table repeats the audit on another dataset. Again, RMU tends to have the lowest residual knowledge: 0.09 for LLaMA, 0.06 for Gemma, and 0.08 for Qwen. SimNPO is also relatively strong, while gradient-based methods retain more.

These are robustness and sensitivity checks. They strengthen confidence that shallow unlearning is not a single-table accident. They do not prove the framework covers all languages, all model architectures, all foundation pretraining data, or all future unlearning algorithms.

That distinction matters. A good article should not inflate appendix consistency into universal law. The results are strong enough without wearing a fake crown.

What this means for AI governance teams

The paper’s practical message is not “stop using approximate unlearning.” Exact retraining is often infeasible, and approximate unlearning will remain operationally attractive. The better message is: stop treating approximate unlearning as complete without representation-level evidence.

For organizations deploying fine-tuned LLMs, the audit stack should separate four layers:

Governance layer	Practical question	Recommended evidence
Behavioral testing	Does the model still answer about deleted data?	Forget Quality, output tests, targeted prompts
Representation audit	Does the model still encode membership traces?	PID/RINE Residual Knowledge and Unlearned Knowledge
Attack validation	Can known recovery attacks restore access?	Post-unlearning attack tests and red-team evaluation
Runtime mitigation	Should the model answer this input?	Risk-score abstention or other privacy-aware routing

The paper is most valuable in the second layer. It gives teams a way to quantify internal traces rather than relying on the model’s manners.

For vendor evaluation, the implication is equally direct. A serious unlearning claim should specify:

the unlearning method used;
the forget and retain benchmark design;
output-level metrics;
representation-level residual knowledge;
attack tests;
utility impact;
whether the evaluator had white-box access;
whether abstention or filtering is being used after unlearning.

If a vendor only reports that the model “no longer answers about the deleted data,” that is not an unlearning audit. It is customer-service theater with a dashboard.

What the paper directly shows, and what we infer

The paper directly shows that, on the tested benchmarks and models, membership information can remain linearly decodable from internal representations after unlearning. It also directly shows that PID/RINE-style decomposition can quantify Unlearned Knowledge and Residual Knowledge, and that Residual Knowledge correlates more strongly with tested attack success rates than a strong MIA baseline. Finally, it directly shows that a representation-based risk score improves Forget Quality with limited utility loss in the tested LLaMA setting.

Cognaptus infers three business implications.

First, compliance workflows should distinguish behavioral removal from representational removal. The first is easier to observe; the second is closer to the privacy risk that matters.

Second, unlearning audits should become more like security audits. Black-box testing is useful, but high-stakes claims need white-box inspection. The paper makes the same analogy: source-code audits provide stronger assurance than black-box penetration tests alone.

Third, runtime abstention can be a practical bridge while unlearning remains imperfect. Not a legal miracle. Not a certificate of deletion. A bridge.

The uncertainty boundary is also clear. This approach needs access to model internals or at least comparable model versions. It focuses mainly on membership information, not all sensitive attributes. The experiments are mostly English-language benchmark settings for fine-tuned LLMs. The estimator provides a practical lower bound, not a certified exact-unlearning guarantee. And for frontier black-box models, the paper only sketches possible extensions through logits or log probabilities; that remains less direct than representation-level auditing.

The hard boundary: this is not certified forgetting

The most important limitation is that the framework does not provide certified exact unlearning. Exact unlearning means the model behaves as if the forgotten data had never been included in training. For large LLMs, that usually implies retraining or stronger formal mechanisms that are expensive or currently impractical at scale.

The paper offers an audit, not a magic delete key.

This distinction should shape how businesses use the result. A high Residual Knowledge score is strong evidence of incomplete unlearning. A low score is reassuring, especially when supported by output tests and attack tests, but it is not a mathematical proof that no information remains anywhere in the model.

The paper’s estimator choice has the same one-sided nature. Because RINE provides a lower bound, high residual knowledge is damning; low residual knowledge is encouraging but should not be treated as omniscience. Audits often work this way. Finding the problem is easier than proving the absence of all possible problems.

The authors also note that they audit membership information rather than specific sensitive attributes. That is a significant practical boundary. In real deployments, a deletion request may concern personal identifiers, copyrighted passages, medical details, or proprietary customer records. Membership is a useful audit target, but it is not the whole privacy surface.

Finally, the tested setting is still benchmark-centered. TOFU’s fictitious authors and MUSE’s news/copyright-style tasks are useful, but enterprise data rarely arrives as tidy benchmark partitions. Production unlearning will need data lineage, retention-set design, prompt distribution analysis, and post-deployment monitoring. Alas, governance refuses to fit into a neat CSV file. Very rude of it.

From forgetting as promise to forgetting as evidence

The best part of this paper is that it changes the audit question.

The weak question is: “Does the model still say the deleted thing?”

The stronger question is: “Can the deleted thing still be decoded from what the model internally represents?”

The strongest operational question is: “How much was removed, how much remains, and what controls activate when residual traces are risky?”

That progression is what makes the paper useful for AI governance. It does not pretend approximate unlearning is hopeless. It shows that approximate unlearning needs better evidence. It does not reduce privacy to legal slogans. It gives deployers a measurable object: residual knowledge.

The business takeaway is therefore not dramatic. It is procedural.

If a model has been unlearned, audit the representations. Compare the base and unlearned versions. Measure what is unique to the base and what remains shared. Validate against attack recovery. Add abstention when risk remains high. Report the limits honestly.

For a field addicted to green checkmarks, this is healthy discipline.

Forgetting is not a press release. It is an information trace problem. And if the trace is still linearly decodable, the model has not forgotten. It has merely learned the corporate art of saying nothing while keeping the memo.

Cognaptus: Automate the Present, Incubate the Future.

Anmol Goel, Alan Ritter, and Iryna Gurevych, “Auditing Language Model Unlearning via Information Decomposition,” arXiv:2601.15111, 2026. ↩︎

The dangerous misconception: clean outputs mean clean memory#

Why mutual information alone is too blunt#

RINE turns the audit into something measurable#

The main evidence: Forget Quality can overstate deletion#

The algorithm ranking is useful, but not the whole story#

Residual knowledge behaves like attack surface#

Abstention is a safety layer, not a magic eraser#

Smaller-model and MUSE results support generality, not universality#

What this means for AI governance teams#

What the paper directly shows, and what we infer#

The hard boundary: this is not certified forgetting#

From forgetting as promise to forgetting as evidence#