Forgetting That Never Happened: The Shallow Alignment Trap

Opening — Why this matters now

Continual learning is supposed to be the adult version of fine-tuning: learn new things, keep the old ones, don’t embarrass yourself. Yet large language models still forget with the enthusiasm of a goldfish.

Recent work complicated this picture by arguing that much of what we call forgetting isn’t real memory loss at all. It’s misalignment. This paper pushes that idea further — and sharper. It shows that most modern task alignment is shallow, fragile, and only a few tokens deep. And once you see it, a lot of puzzling behaviors suddenly stop being mysterious.

Background — Catastrophic forgetting meets its imposter

Classic continual learning treats performance drops as knowledge destruction. The standard toolkit follows accordingly:

Approach	What it protects	Cost
Regularization (EWC, SI)	Important weights	Hyperparameter-heavy
Replay	Old data	+30–50% compute
Parameter isolation	Separate capacity	Model bloat

The 2025 introduction of spurious forgetting disrupted this logic. Models can lose accuracy while their internal representations remain largely intact. In those cases, a handful of samples and a few epochs can “restore” performance — something true forgetting cannot explain.

But that earlier work stopped at description. It told us that alignment breaks, not how deeply, when, or why it keeps happening.

Analysis — The shallow vs. deep alignment framework

This paper’s central claim is simple and uncomfortable:

Most task alignment in LLMs only survives the first 3–5 output tokens.

That’s shallow alignment.

What alignment depth actually means

Alignment is measured token-by-token, producing an alignment score $A_t \in [0,1]$. From this, the paper defines alignment depth:

$$ D = \max { k : A_t \ge 0.7 ; \forall t \le k } $$

Shallow alignment: $D \le 5$
Deep alignment: $D > 10$

Most standard training methods — including freezing strategies and replay — plateau at $D \approx 3$.

Why shallow alignment emerges

Transformers are biased. Gradients decay with token position:

$$ | \nabla L_t | \propto \alpha^t \quad (\alpha \approx 0.6–0.7) $$

Early tokens dominate optimization, early stopping rewards them, and alignment quietly collapses after a few steps. The model looks competent — until the first tokens wobble.

Why shallow alignment explains everything

Phenomenon	Explanation
Spurious forgetting	Early-token misalignment derails generation
Fast recovery	Only output alignment needs repair
Fine-tuning fragility	Few updates can flip shallow alignment
Freezing works (sometimes)	Representations survive, alignment doesn’t deepen

This is not a training bug. It’s a structural bias.

Findings — Measuring, detecting, and fixing the problem

Quantitative detection (in real time)

The paper replaces post-hoc guesswork with live monitoring:

Alignment depth tracking every 100 steps
Reversibility score combining alignment, CKA similarity, and gradient norms
Spurious forgetting score with 86–91% identification accuracy

Detection overhead: ~5% compute. That’s cheaper than pretending nothing is wrong.

Deep alignment training

Three interventions actually move the needle:

Token-position weighted loss — later tokens matter again
Multi-position alignment regularization — smooth alignment across steps
Sequential alignment curriculum — depth is enforced, not hoped for

Results speak clearly:

Training	Alignment Depth D	Forgetting Rate
Standard	≤ 3	11–13%
Deep alignment	> 12	2–3%

Adaptive mitigation

Instead of freezing everything just in case, the system reacts:

Selective repair for spurious forgetting (50–100 samples)
Replay only for true forgetting
Adaptive freezing when alignment becomes brittle

Performance gains: +3.3–7.1% over fixed strategies, with ~12% overhead.

Implications — Why this matters beyond continual learning

This framework quietly reframes several open problems:

Fine-tuning safety: shallow alignment is why small updates can be catastrophic
Adversarial robustness: deep alignment creates self-correcting generations
Agent reliability: long-horizon reasoning fails when alignment dies early

In short: if your model reasons beyond five tokens, shallow alignment is technical debt.

Conclusion — Alignment depth is the real bottleneck

This paper doesn’t just extend prior work. It diagnoses a structural weakness in how we train language models.

Catastrophic forgetting is often not catastrophic. It’s shallow.

And once alignment depth becomes measurable, detectable, and trainable, pretending otherwise looks increasingly lazy.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Catastrophic forgetting meets its imposter#

Analysis — The shallow vs. deep alignment framework#

What alignment depth actually means#

Why shallow alignment emerges#

Why shallow alignment explains everything#

Findings — Measuring, detecting, and fixing the problem#

Quantitative detection (in real time)#

Deep alignment training#

Adaptive mitigation#

Implications — Why this matters beyond continual learning#

Conclusion — Alignment depth is the real bottleneck#