Cover image

When Benchmarks Forget What They Learned

The leaderboard said “learning.” The model may have heard “storage.” Benchmarks are supposed to answer a simple business question: does this model actually perform the task? That sounds clean. A model receives a test. It gives answers. Someone turns the answers into a score. Procurement teams, product managers, investors, and mildly overconfident LinkedIn commentators then convert the score into a story about intelligence. The machinery is familiar enough to feel objective. ...

February 2, 2026 · 14 min · Zelina
Cover image

When Models Start to Forget: The Hidden Cost of Training LLMs Too Well

Duplicates are supposed to be boring. In data engineering, duplicate records are usually treated as a hygiene problem: remove them, clean the pipeline, reduce noise, move on. In language-model training, repetition is less innocent. Repeated text can help a model learn an underrepresented domain. It can also teach the model to reproduce specific sequences too well. Somewhere between “useful exposure” and “verbatim recall,” a model stops learning only the pattern and starts carrying around the document. ...

January 3, 2026 · 16 min · Zelina
Cover image

When Models Learn to Forget: Why Memorization Isn’t the Same as Intelligence

A contract clause appears in a chatbot response. Not a summary. Not a paraphrase. The clause itself, with the same odd phrasing, the same punctuation, and the same mildly embarrassing typo that legal counsel thought nobody outside the company would ever see. The model did not “reason” its way there. It remembered. ...

December 26, 2025 · 15 min · Zelina
Cover image

Map Before You Train: Data Cartography to Defuse LLM Memorization

TL;DR for operators Training data does not become risky only after a model has memorised it. It often leaves signals while training is still happening. That is the useful idea behind Generative Data Cartography, or GenDataCarto: track how each pretraining sample behaves during early training, then use that behaviour to decide which data should be kept, up-sampled, down-weighted, or removed.1 The method uses two signals. The first is early loss, which approximates how difficult a sample is. The second is the frequency of “forget events”, where a sample appears learned and later becomes poorly fitted again. In the paper’s framing, frequent forget events are not just training noise. They are a warning that a sample may be unusually influential, repeatedly re-entering the model’s attention like a guest who refuses to leave the meeting. ...

September 4, 2025 · 16 min · Zelina