When Models Start to Forget: The Hidden Cost of Training LLMs Too Well

Duplicates are supposed to be boring.

In data engineering, duplicate records are usually treated as a hygiene problem: remove them, clean the pipeline, reduce noise, move on. In language-model training, repetition is less innocent. Repeated text can help a model learn an underrepresented domain. It can also teach the model to reproduce specific sequences too well. Somewhere between “useful exposure” and “verbatim recall,” a model stops learning only the pattern and starts carrying around the document.

That would be manageable if memorized content lived in a neat little drawer. Open the drawer, remove the offending memory, close the model, enjoy compliance. Unfortunately, modern neural networks did not get the memo about office filing.

The paper behind this article, Memorization Sinks: Isolating Memorization during LLM Training, argues that the real problem is not merely that large language models memorize repeated sequences. The sharper problem is that ordinary training tends to entangle memorization with the same internal mechanisms that support general language ability.¹ Once that happens, removing the memorized content after training is no longer a clean deletion task. It becomes surgery with a spoon.

Memorization is not just bad generalization with better branding

The easy story says that memorization is the opposite of generalization. A model generalizes when it learns reusable structure; it memorizes when it stores specific examples. Nice distinction. Very tidy. Also not quite sufficient.

In LLMs, the same training sequence can serve two roles at once. A repeated legal clause, medical note, product manual, or children’s story can teach useful domain structure: terminology, syntax, reasoning patterns, genre conventions. But repeated exposure also increases the chance that the model learns the exact sequence. Earlier work has already shown that language models can emit memorized training text, and that memorization tends to increase with model capacity, data duplication, and prompt context.² The business implication is uncomfortable: the data most worth repeating because it is useful may also be the data most likely to become difficult to remove.

This is why “just deduplicate the dataset” is only a partial answer. Deduplication can reduce memorized output dramatically and improve evaluation cleanliness.³ But not every repetition is accidental garbage. Some repetition is deliberate: upsampling rare domains, strengthening a narrow capability, or fine-tuning on proprietary examples. If every repeated item is removed, the model may lose useful signal. If every repeated item is kept, the model may retain sensitive or copyrighted sequences too literally. That is the training engineer’s version of choosing between nutrition and indigestion.

What the paper actually tests: removal after training is the weak point

The paper’s central move is to shift attention from “does the model memorize?” to “can memorization be isolated so it can later be removed without damaging the model?”

That distinction matters. Privacy discussions often focus on leakage: can an attacker extract training data? Prior extraction work shows this is not theoretical. Carlini and co-authors demonstrated training-data extraction from GPT-2, including verbatim sequences such as personal information and code, while later work showed scalable extraction from open, semi-open, and closed production-style models.⁴ Those studies make the risk visible from the outside. Memorization Sinks asks what makes the risk hard to repair from the inside.

The authors compare two types of memorized content. The first is highly atypical “canary” material: artificial sequences that stand out from ordinary language. The second is natural text repeated many times during training. This contrast is important because canaries are convenient for experiments but not always representative of real LLM memorization. A bizarre artificial string may occupy a more separable internal niche. A repeated natural paragraph, by contrast, looks like normal language. It shares features with the validation distribution. It uses the same grammar, vocabulary, and narrative structures that the model needs for general competence.

That is exactly where the trouble begins.

The paper finds that post-hoc localization methods struggle more with natural repeated sequences than with canaries. Removing neurons associated with memorization can increase loss on the memorized sequence, which means some targeted forgetting occurs. But it can also degrade validation performance, especially when the memorized sequence is natural text. In plain English: the model did not put “general English” in one area and “this repeated story” in another. It reused machinery.

This is the key correction to a common reader belief.

Reader belief	Better interpretation	Why it matters
Memorized content is like a stored file.	Memorized natural text can be entangled with general language mechanisms.	Deleting it may damage normal model behavior.
Post-training unlearning is mostly a tooling problem.	Some failure comes from how standard training organizes internal representations.	Better repair tools may still face structural limits.
Deduplication solves memorization.	Deduplication helps, but repeated data can also carry useful domain signal.	Enterprises cannot always remove all repetition without losing capability.
Canary tests represent real memorization.	Canaries are useful diagnostics, but natural repeated text can be harder to isolate.	Risk audits should not rely only on artificial triggers.

This is not a philosophical distinction. It changes where engineering effort should go.

The costly part of understanding: memorization and capability can improve together

The most interesting finding is not that memorization exists. We have known that for a while. The interesting part is that the model can become more capable while also becoming more memorizing.

In the paper’s controlled setting, the loss on repeated natural sequences decreases alongside validation loss. That means memorization is not simply a late-stage overfitting event that appears after useful learning has finished. It can arrive during the same training phase that improves general performance.

That detail matters because many operational controls assume a sequence like this:

train the model;
observe good validation performance;
identify unwanted memorization;
remove the memorized content;
keep most capability intact.

The paper suggests this sequence can fail because steps 2 and 3 are not independent. If the same training dynamics that improve general performance also entangle repeated natural sequences with general mechanisms, then step 4 becomes expensive. The model is not holding a bad memory beside its capabilities. It may be holding the memory through them.

The authors also give a theoretical explanation: gradient descent can prefer entangled solutions. In their simplified setup, memorization can be implemented either by reusing a generalizing subspace or by using a separate memorization subspace. The training dynamics are biased toward reusing the already useful features. This is efficient during training. It is annoying during deletion. Naturally, the model chose the option that makes the auditor’s life worse.

Why forced localization is not enough

A tempting solution is to route repeated sequences into designated memorization components during training. Let general components learn from ordinary data. Let memorization components absorb repeated content. Later, remove the memorization components. The concept is clean enough to be suspicious.

The paper tests a version of this idea through gradient masking: repeated sequences update memorization neurons, while non-repeated sequences update generalization neurons. It fails in two ways.

First, forced localization weakens generalization. Repeated sequences may contain useful general signal. If shared components are prevented from learning from them, the model loses information it could have used to improve broad capability.

Second, even when memorization is pushed into designated components, those components can still co-adapt with the rest of the model. During training, the presence of memorization neurons affects forward passes and gradient updates. Later removal changes the system the shared neurons learned to rely on. The drawer was labelled “memorization,” but the office workflow still depended on it. Very enterprise.

This section is valuable because it prevents a too-simple reading of the paper. The contribution is not “put memorization somewhere and delete it later.” The paper shows that naïve separation is brittle. The hard part is not naming a location for memory. The hard part is preventing the rest of the model from becoming dependent on that location.

MemSinks: isolate memorization by changing training dynamics

The proposed method, Memorization Sinks, is more subtle than ordinary post-hoc unlearning. Instead of waiting until a model has already entangled memorization with general ability, MemSinks tries to shape where memorization goes during training.

The idea is to reserve a subset of MLP neurons as memorization sink neurons. Each sequence receives an identifier, and that identifier deterministically activates a subset of sink neurons. These sink neurons are selectively active for that sequence and dropped out for others. The shared neurons remain available for general learning, while sequence-specific memorization has a more stable place to accumulate.

The mechanism depends on a training-dynamics distinction:

generalizable signals are reinforced across many different sequences;
sequence-specific memorization signals are reinforced mainly when that sequence reappears;
unrelated sequences interfere with each other, creating learning-forgetting cycles.

MemSinks tries to reduce that interference for memorization-specific components. By shielding selected sink neurons from unrelated updates, it gives repeated sequences a place to store sequence-specific residue. By activating those sink neurons infrequently, it also reduces co-adaptation with the shared model. The point is not to stop the model from seeing repeated data. The point is to let shared components learn the useful general structure while discouraging them from carrying the verbatim payload.

A simplified diagram helps:

Standard training
Repeated sequence -> shared model components -> generalization + memorization entangled

Naïve forced localization
Repeated sequence -> memorization components only -> weaker generalization + co-adaptation risk

MemSinks
Repeated sequence -> shared components + sequence-tied sink neurons
                 -> shared components learn general signal
                 -> sink neurons absorb sequence-specific memorization
                 -> sink neurons can later be dropped

The elegance is that MemSinks treats memorization as a training-design problem rather than only a post-training cleanup problem.

What the evidence supports, and what it does not

The paper reports several empirical results that are more useful than the abstract summary alone.

In smaller TinyStories experiments, MemSinks achieves validation loss comparable to standard training while making repeated sequences less memorized after sink neurons are dropped. In the larger experiments, the authors train SmolLM-style models at 360M and 1.7B parameters on mixtures involving SlimPajama and repeated TinyStories data. They report that standard repetition can improve validation performance compared with removing repeated data, which confirms the practical tradeoff: repetition can help. But MemSinks preserves much of that generalization benefit while reducing memorization, closing at least half of the training-validation loss gap on repeated examples in the reported large-scale setting.¹

That “at least half” figure should be read carefully. It does not mean MemSinks removes all memorization. It means the method substantially reduces memorization relative to standard training while avoiding the capability cost of simple deduplication in the tested setup. That is already meaningful. It is not magic. We remain, tragically, in engineering.

The paper also studies robustness. MemSinks works across model sizes, with benefits appearing stronger as models scale. It tolerates small noise in sequence IDs, up to about 10% in the reported experiments, but degrades when IDs become highly inconsistent. This is an operationally important detail. The method assumes repeated sequences can be assigned reasonably stable identifiers. If a data pipeline cannot track repeated or near-repeated content with enough consistency, the training trick loses some of its leverage.

Paper result	What it supports	What it does not prove
Natural repeated text is harder to remove post-hoc than atypical canaries.	Realistic memorization can be entangled with general capability.	Every memorized item in every model is equally entangled.
Validation loss and repeated-sequence memorization can improve together.	Memorization may occur during useful learning, not only after it.	Standard validation curves alone can diagnose memorization.
Naïve gradient-masked localization hurts generalization and still causes removal sensitivity.	Localization must handle co-adaptation, not just storage location.	All architectural localization methods will fail.
MemSinks reduces memorization while preserving validation performance in tested settings.	Training-time isolation can outperform post-hoc cleanup and simple deduplication.	The method is ready for production frontier-scale deployment.
MemSinks tolerates modest sequence-ID noise.	Perfect metadata may not be required.	Metadata quality is unimportant.

This is the right way to read the paper: not as a final product announcement, but as evidence that memorization control may need to be built into training rather than bolted on afterward.

The business value is cheaper diagnosis, not just safer training

For companies building or fine-tuning models, the practical implication is not “implement MemSinks tomorrow morning before coffee.” The direct implementation burden is nontrivial. It affects training architecture, data identifiers, dropout behavior, and evaluation. Many businesses do not train billion-parameter models from scratch. They rent APIs, fine-tune open models, or run retrieval-augmented systems over private documents.

Still, the paper changes the business conversation in three ways.

First, it reframes memorization as a capacity allocation and governance issue. If repeated internal data is used for fine-tuning, the question is not only whether the dataset contains sensitive records. The question is whether training will make those records hard to remove without damaging the model. That is a lifecycle problem, not a one-time compliance checkbox.

Second, it makes data lineage more valuable. MemSinks relies on stable sequence identifiers. Even if an enterprise never uses this exact method, the underlying lesson generalizes: model governance improves when training examples can be tracked, grouped, deduplicated, repeated deliberately, and audited after the fact. “We trained on a folder somewhere” is not a governance strategy. It is a confession with cloud storage.

Third, it distinguishes three mitigation layers that are often collapsed into one.

Layer	What it controls	Typical business action
Dataset hygiene	Accidental repetition, train-test overlap, obvious sensitive records	deduplication, filtering, PII screening, copyright review
Training design	Where memorization is encouraged or discouraged inside the model	weighting, curriculum, regularization, architecture-aware methods
Post-training audit	Whether specific content can be extracted or removed	extraction tests, membership inference probes, unlearning evaluation

Most organizations over-invest rhetorically in the third layer because it sounds reassuring: audit, remove, certify. The paper’s point is that post-training audit may discover problems that training has already made expensive to fix. So the ROI is not merely safer training. It is cheaper diagnosis and cheaper remediation, because fewer memories become fused with the capability substrate in the first place.

Fine-tuning teams should pay attention, but not overread the result

The paper focuses on pretraining-style settings with repeated sequences and evaluates a particular training-time method. It is not a full theory of every fine-tuning workflow. Parameter-efficient fine-tuning, such as LoRA, may behave differently from full training. Recent work on LoRA memorization suggests that parameter-efficient updates can reduce leakage compared with full fine-tuning under some evaluation settings, though the details depend on the task, metric, and attack model.⁵

That matters for business users because many enterprise deployments are not training base models. They are adapting models using smaller private datasets. In those settings, memorization risk can come from different sources: small dataset size, repeated templates, narrow domain text, output style imitation, or retrieval systems that surface sensitive passages too directly. MemSinks does not eliminate the need for those controls.

The safer interpretation is this:

the paper directly shows that natural repeated sequences can become mechanistically entangled with general capabilities under standard training;
it directly shows that a training-time sink mechanism can reduce memorization while preserving useful repetition benefits in the tested settings;
Cognaptus infers that organizations should treat memorization governance as part of training and data-pipeline design, not only as a post-training audit task;
it remains uncertain how well the exact method transfers to frontier-scale proprietary training, messy enterprise corpora, multimodal models, and common API-based fine-tuning workflows.

That separation matters. Otherwise, every interesting paper becomes either a miracle or a nothingburger. Both are lazy. One just wears a shinier badge.

What to change in model governance

A practical takeaway from this paper is that memorization audits should be placed earlier in the model-development cycle.

Before training or fine-tuning, teams should identify which documents are repeated intentionally, which are repeated accidentally, and which should never be repeated at all. The distinction is not cosmetic. Accidental duplicates are usually candidates for removal. Intentional repetition should require a reason: domain balancing, rare skill reinforcement, format consistency, or evaluation coverage. Sensitive repetition should trigger additional controls.

During training, teams should monitor more than aggregate loss. A smooth validation curve can coexist with growing memorization. For repeated or high-risk sequences, teams need targeted probes: loss-gap tracking, exposure-style metrics, extraction tests, and comparisons between repeated training examples and held-out examples from the same distribution. The point is not to panic every time the model improves. The point is to know when improvement is being purchased with brittle recall.

After training, unlearning should be evaluated as a capability tradeoff, not as a press release. If removing a memorized sequence damages related general performance, the problem may not be the unlearning tool alone. The problem may be that training created entanglement. That diagnosis changes the next iteration: adjust repetition, improve data identifiers, modify training design, or avoid using sensitive content as training material in the first place.

Boundary conditions: where the paper should not be stretched

The paper is a proof-of-concept, not a universal deployment recipe. Its experiments are carefully designed, but real-world corpora are messier. Documents appear in paraphrased forms, partial overlaps, translations, templates, boilerplate, code fragments, and quoted excerpts. Stable sequence IDs are harder to maintain when text is chunked, normalized, scraped, updated, and mixed across sources.

The method also assumes access to training internals. That is available to frontier labs and some open-model training teams, but not to most businesses using hosted APIs. For API users, the lesson is more indirect: demand stronger documentation about training-data controls, avoid sending sensitive fine-tuning examples unnecessarily, and build retrieval systems that reduce the need to bake private content into weights.

Finally, reduced memorization is not the same as no leakage. Extraction attacks have improved over time, and aligned chatbot behavior does not guarantee that memorized content is gone.⁴ A model can appear safe under ordinary prompts while remaining vulnerable under adversarial prompting. The paper helps explain how to reduce the structural problem. It does not repeal adversarial testing.

Conclusion: the model did not forget; it learned where forgetting hurts

The hidden cost of training LLMs “too well” is not that models become intelligent and therefore dangerous in some cinematic sense. The more mundane problem is sharper: training can make useful capability and unwanted recall share the same internal machinery.

That is why memorization is difficult to govern. The model may not store a sensitive paragraph as a detachable object. It may encode that paragraph through the same pathways that help it understand similar text. By the time the compliance team asks for deletion, the answer may be: yes, technically, but please enjoy the performance regression.

Memorization Sinks is valuable because it points toward a different design principle. Do not wait for memorization to spread through the model and then attempt a heroic cleanup. Shape the training process so memorization has fewer places to hide and clearer places to remove.

For businesses, the message is not to chase every new training trick. The message is to stop treating memorization as an afterthought. Repetition, data lineage, training design, and post-training audits belong in the same governance conversation. Otherwise, the model will remember exactly what you wish it had forgotten — and forget only after you damage the parts you actually needed.

Cognaptus: Automate the Present, Incubate the Future.

Gaurav R. Ghosal, Pratyush Maini, and Aditi Raghunathan, “Memorization Sinks: Isolating Memorization during LLM Training,” arXiv:2507.09937, 2025. ↩︎ ↩︎
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramèr, and Chiyuan Zhang, “Quantifying Memorization Across Neural Language Models,” arXiv:2202.07646, 2022. ↩︎
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini, “Deduplicating Training Data Makes Language Models Better,” arXiv:2107.06499, 2021. ↩︎
Nicholas Carlini et al., “Extracting Training Data from Large Language Models,” arXiv:2012.07805, 2020; Milad Nasr et al., “Scalable Extraction of Training Data from (Production) Language Models,” arXiv:2311.17035, 2023. ↩︎ ↩︎
Fei Wang and Baochun Li, “Leaner Training, Lower Leakage: Revisiting Memorization in LLM Fine-Tuning with LoRA,” arXiv:2506.20856, 2025. ↩︎

Memorization is not just bad generalization with better branding#

What the paper actually tests: removal after training is the weak point#

The costly part of understanding: memorization and capability can improve together#

Why forced localization is not enough#

MemSinks: isolate memorization by changing training dynamics#

What the evidence supports, and what it does not#

The business value is cheaper diagnosis, not just safer training#

Fine-tuning teams should pay attention, but not overread the result#

What to change in model governance#

Boundary conditions: where the paper should not be stretched#

Conclusion: the model did not forget; it learned where forgetting hurts#