Cover image

When Models Forget on Purpose: Why Data Selection Matters More Than Data Volume

Opening — Why this matters now The AI industry has spent the last three years chanting a single mantra: more data, bigger models. It worked—until it didn’t. Performance gains are slowing, training costs are ballooning, and regulators are starting to ask uncomfortable questions about memorization, leakage, and data provenance. The paper you just uploaded steps directly into this tension and makes a slightly heretical claim: what we remove from training data may matter more than what we add. ...

December 31, 2025 · 3 min · Zelina
Cover image

When Models Look Back: Memory, Leakage, and the Quiet Failure Modes of LLM Training

Opening — Why this matters now Large language models are getting better at many things—reasoning, coding, multi‑modal perception. But one capability remains quietly uncomfortable: remembering things they were never meant to remember. The paper underlying this article dissects memorization not as a moral failure or an anecdotal embarrassment, but as a structural property of modern LLM training. The uncomfortable conclusion is simple: memorization is not an edge case. It is a predictable outcome of how we scale data, objectives, and optimization. ...

December 30, 2025 · 3 min · Zelina
Cover image

When Models Learn to Forget: Why Memorization Isn’t the Same as Intelligence

Opening — Why this matters now Large language models are getting better at everything—reasoning, coding, writing, even pretending to think. Yet beneath the polished surface lies an old, uncomfortable question: are these models learning, or are they remembering? The distinction used to be academic. It no longer is. As models scale, so does the risk that they silently memorize fragments of their training data—code snippets, proprietary text, personal information—then reproduce them when prompted. Recent research forces us to confront this problem directly, not with hand-waving assurances, but with careful isolation of where memorization lives inside a model. ...

December 26, 2025 · 3 min · Zelina
Cover image

Reading Between the Weights: When Models Remember Too Much

Opening — Why this matters now For years, we have comforted ourselves with a tidy distinction: models generalize, databases memorize. Recent research quietly dismantles that boundary. As LLMs scale, memorization is no longer an edge case—it becomes a structural property. That matters if you care about data leakage, IP exposure, or regulatory surprises arriving late but billing retroactively. ...

December 23, 2025 · 2 min · Zelina
Cover image

Map Before You Train: Data Cartography to Defuse LLM Memorization

Generative models leak. Not because engineers are careless, but because web-scale corpora hide rare, high-influence shards—snippets so unique that gradient descent can’t help but memorize them. A new data-first method, Generative Data Cartography (GenDataCarto), gives teams a way to see those shards in training dynamics and intervene—surgically, not bluntly—before they become liabilities. The one-slide idea Track two numbers for every pretraining sample: Difficulty (dᵢ): early-epoch average loss—how hard it was to learn initially. Memorization (mᵢ): fraction of epochs with forget events (loss falls below a threshold, then pops back above)—how often the model “refits” the same sample. Plot (dᵢ, mᵢ), set percentile thresholds, and you get a four-quadrant map that tells you what to up-sample, down-weight, or drop to reduce leakage with minimal perplexity cost. ...

September 4, 2025 · 4 min · Zelina
Cover image

What LLMs Remember—and Why: Unpacking the Entropy-Memorization Law

The best kind of privacy leak is the one you can measure. A recent paper by Huang et al. introduces a deceptively simple but powerful principle—the Entropy-Memorization Law—that allows us to do just that. It claims that the entropy of a text sequence is strongly correlated with how easily it’s memorized by a large language model (LLM). But don’t mistake this for just another alignment paper. This law has concrete implications for how we audit models, design prompts, and build privacy-aware systems. Here’s why it matters. ...

July 13, 2025 · 4 min · Zelina