Unlearning

Duplicates are supposed to be boring. In data engineering, duplicate records are usually treated as a hygiene problem: remove them, clean the pipeline, reduce noise, move on. In language-model training, repetition is less innocent. Repeated text can help a model learn an underrepresented domain. It can also teach the model to reproduce specific sequences too well. Somewhere between “useful exposure” and “verbatim recall,” a model stops learning only the pattern and starts carrying around the document. ...

Unlearning

Auditing the Illusion of Forgetting: When Unlearning Isn’t Enough

When Models Start to Forget: The Hidden Cost of Training LLMs Too Well