Opening — Why this matters now
Continual learning is supposed to be the adult version of fine-tuning: learn new things, keep the old ones, don’t embarrass yourself. Yet large language models still forget with the enthusiasm of a goldfish.
Recent work complicated this picture by arguing that much of what we call forgetting isn’t real memory loss at all. It’s misalignment. This paper pushes that idea further — and sharper. It shows that most modern task alignment is shallow, fragile, and only a few tokens deep. And once you see it, a lot of puzzling behaviors suddenly stop being mysterious.
Background — Catastrophic forgetting meets its imposter
Classic continual learning treats performance drops as knowledge destruction. The standard toolkit follows accordingly:
| Approach | What it protects | Cost |
|---|---|---|
| Regularization (EWC, SI) | Important weights | Hyperparameter-heavy |
| Replay | Old data | +30–50% compute |
| Parameter isolation | Separate capacity | Model bloat |
The 2025 introduction of spurious forgetting disrupted this logic. Models can lose accuracy while their internal representations remain largely intact. In those cases, a handful of samples and a few epochs can “restore” performance — something true forgetting cannot explain.
But that earlier work stopped at description. It told us that alignment breaks, not how deeply, when, or why it keeps happening.
Analysis — The shallow vs. deep alignment framework
This paper’s central claim is simple and uncomfortable:
Most task alignment in LLMs only survives the first 3–5 output tokens.
That’s shallow alignment.
What alignment depth actually means
Alignment is measured token-by-token, producing an alignment score $A_t \in [0,1]$. From this, the paper defines alignment depth:
$$ D = \max { k : A_t \ge 0.7 ; \forall t \le k } $$
- Shallow alignment: $D \le 5$
- Deep alignment: $D > 10$
Most standard training methods — including freezing strategies and replay — plateau at $D \approx 3$.
Why shallow alignment emerges
Transformers are biased. Gradients decay with token position:
$$ | \nabla L_t | \propto \alpha^t \quad (\alpha \approx 0.6–0.7) $$
Early tokens dominate optimization, early stopping rewards them, and alignment quietly collapses after a few steps. The model looks competent — until the first tokens wobble.
Why shallow alignment explains everything
| Phenomenon | Explanation |
|---|---|
| Spurious forgetting | Early-token misalignment derails generation |
| Fast recovery | Only output alignment needs repair |
| Fine-tuning fragility | Few updates can flip shallow alignment |
| Freezing works (sometimes) | Representations survive, alignment doesn’t deepen |
This is not a training bug. It’s a structural bias.
Findings — Measuring, detecting, and fixing the problem
Quantitative detection (in real time)
The paper replaces post-hoc guesswork with live monitoring:
- Alignment depth tracking every 100 steps
- Reversibility score combining alignment, CKA similarity, and gradient norms
- Spurious forgetting score with 86–91% identification accuracy
Detection overhead: ~5% compute. That’s cheaper than pretending nothing is wrong.
Deep alignment training
Three interventions actually move the needle:
- Token-position weighted loss — later tokens matter again
- Multi-position alignment regularization — smooth alignment across steps
- Sequential alignment curriculum — depth is enforced, not hoped for
Results speak clearly:
| Training | Alignment Depth D | Forgetting Rate |
|---|---|---|
| Standard | ≤ 3 | 11–13% |
| Deep alignment | > 12 | 2–3% |
Adaptive mitigation
Instead of freezing everything just in case, the system reacts:
- Selective repair for spurious forgetting (50–100 samples)
- Replay only for true forgetting
- Adaptive freezing when alignment becomes brittle
Performance gains: +3.3–7.1% over fixed strategies, with ~12% overhead.
Implications — Why this matters beyond continual learning
This framework quietly reframes several open problems:
- Fine-tuning safety: shallow alignment is why small updates can be catastrophic
- Adversarial robustness: deep alignment creates self-correcting generations
- Agent reliability: long-horizon reasoning fails when alignment dies early
In short: if your model reasons beyond five tokens, shallow alignment is technical debt.
Conclusion — Alignment depth is the real bottleneck
This paper doesn’t just extend prior work. It diagnoses a structural weakness in how we train language models.
Catastrophic forgetting is often not catastrophic. It’s shallow.
And once alignment depth becomes measurable, detectable, and trainable, pretending otherwise looks increasingly lazy.
Cognaptus: Automate the Present, Incubate the Future.