Auditing the Illusion of Forgetting: When Unlearning Isn’t Enough

Opening — Why this matters now

“Right to be forgotten” has quietly become one of the most dangerous phrases in AI governance. On paper, it sounds clean: remove a user’s data, comply with regulation, move on. In practice, modern large language models (LLMs) have turned forgetting into a performance art. Models stop saying what they were trained on—but continue remembering it internally.

This gap between behavioral compliance and representational reality is no longer academic. GDPR, the EU AI Act, and similar regimes are moving from intent to enforcement. If a model can still reconstruct erased information under the right pressure, it hasn’t forgotten anything that matters.

Background — The comfort of shallow audits

Most unlearning evaluations today focus on outputs: does the model still answer questions about deleted data? Metrics like Forget Quality and membership inference attacks (MIAs) attempt to answer this. Unfortunately, both are blunt instruments.

Output-based tests reward surface-level compliance. A model can learn not to speak without actually unlearning. MIAs, meanwhile, struggle at LLM scale and often degrade to near-random performance, offering false reassurance rather than real guarantees.

This creates what the paper terms shallow unlearning: the model appears compliant while retaining linearly decodable traces of deleted data inside its representations. From a regulatory standpoint, this is worse than failure—it is misdirection.

Analysis — Auditing memory, not behavior

The paper proposes a sharp pivot: stop asking what the model says and start measuring what the model knows.

The core idea is to compare internal representations before and after unlearning, treating them as two information sources about a sensitive target (e.g., data membership). Instead of measuring raw mutual information—which collapses everything into a single number—the authors apply Partial Information Decomposition (PID).

PID decomposes information into four components:

Component	Meaning in unlearning
Unique (Base)	Information successfully erased
Unique (Unlearned)	Information newly introduced
Redundant	Information that survived unlearning
Synergistic	Information only recoverable jointly

Two quantities matter operationally:

Unlearned Knowledge: information uniquely present in the base model but absent after unlearning.
Residual Knowledge: redundant information shared between base and unlearned models.

If residual knowledge is non-trivial, forgetting is incomplete—no matter how polite the outputs look.

Findings — What actually survives unlearning

Across multiple benchmarks (TOFU, MUSE), models (LLaMA, Gemma, Qwen), and algorithms, the results are uncomfortable.

Gradient-based unlearning methods often score well on Forget Quality while retaining large amounts of residual knowledge. In contrast, more representation-aware methods—especially RMU—consistently reduce residual traces.

A simplified snapshot:

Method	Forget Quality ↑	Residual Knowledge ↓
Gradient Ascent	High	High (bad)
Gradient Difference	High	High
NPO / SimNPO	Medium–High	Medium
RMU	High	Low
Retrain (gold standard)	Perfect	~0

The key takeaway is blunt: forgetting accuracy systematically overstates privacy compliance.

Implications — Residual knowledge is attack surface

Residual knowledge is not just a theoretical artifact. The paper shows it correlates strongly with post-unlearning adversarial attack success. Models that “forgot” convincingly at the surface were often the easiest to recover information from.

More interestingly, residual knowledge enables a practical defense: inference-time abstention.

By comparing how strongly base and unlearned models agree that a query belongs to deleted data, the system can compute a lightweight risk score. High agreement plus high confidence triggers abstention—silence instead of leakage.

Unlike uncertainty-based methods, this mechanism targets memory persistence, not generic hesitation. The result: stronger forgetting with minimal utility loss.

Conclusion — From theater to evidence

This work reframes unlearning from a binary checkbox into a measurable, decomposable process. It exposes why current audits fail, provides a principled alternative, and—critically—connects theory to deployable safeguards.

The uncomfortable implication is that black-box audits are no longer defensible for high-stakes unlearning. Just as security evolved from penetration tests to source audits, privacy compliance in LLMs will demand white-box evidence.

If forgetting is a legal right, then residual knowledge is legal debt—and this paper finally gives us a balance sheet.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The comfort of shallow audits#

Analysis — Auditing memory, not behavior#

Findings — What actually survives unlearning#

Implications — Residual knowledge is attack surface#

Conclusion — From theater to evidence#