Cover image

When Models Forget on Purpose: Why Data Selection Matters More Than Data Volume

Opening — Why this matters now The AI industry has spent the last three years chanting a single mantra: more data, bigger models. It worked—until it didn’t. Performance gains are slowing, training costs are ballooning, and regulators are starting to ask uncomfortable questions about memorization, leakage, and data provenance. The paper you just uploaded steps directly into this tension and makes a slightly heretical claim: what we remove from training data may matter more than what we add. ...

December 31, 2025 · 3 min · Zelina
Cover image

When Data Comes in Boxes: Why Hierarchies Beat Sample Hoarding

Opening — Why this matters now Modern machine learning has a data problem that money can’t easily solve: abundance without discernment. Models are no longer starved for samples; they’re overwhelmed by datasets—entire repositories, institutional archives, and web-scale collections—most of which are irrelevant, redundant, or quietly harmful. Yet the industry still behaves as if data arrives as loose grains of sand. In practice, data arrives in boxes: datasets bundled by source, license, domain, and institutional origin. Selecting the right boxes is now the binding constraint. ...

December 13, 2025 · 3 min · Zelina
Cover image

Weight Watchers for LLMs: Dynamic Dieting Beats Static Selection

Most large language models (LLMs) are trained as if every piece of data is equally nutritious. But just as elite athletes optimize not just what they eat but when and how they eat it, a new paper proposes that LLMs can perform better if we learn to dynamically adjust their data “diet” during training. The Static Selection Problem Traditional data selection for LLMs is front-loaded and fixed: you decide what data to keep before training, often using reference datasets (e.g., Wikipedia) or reference models (e.g., GPT-3.5) to prune the lowest-quality examples. While effective in reducing cost, this approach ignores a key insight: an LLM’s preference for certain types of data evolves over time. ...

July 23, 2025 · 3 min · Zelina