Pretraining

The Scaling Law Got a Data Manager

TL;DR for operators A useful scaling law does not merely say “bigger is better.” That is not a law; that is a purchasing department with a GPU account. The paper behind this article studies whether the composition of pretraining data can change the compute-optimal balance between model size and downstream data in jet classification.1 The answer, in this setting, is yes. Training from scratch on JetClass produces a nearly balanced scaling rule: as compute grows, the optimal model size and dataset size grow at roughly similar rates. But after pretraining on a JetClass-II corpus augmented with Beyond Standard Model resonance decays, the compute-optimal rule shifts sharply toward downstream data. More of the next compute budget should be spent processing more examples, not inflating the model. ...

MoA Than One Curve: Teaching FFNs to Choose Their Nonlinearity

Model architecture has a recurring habit: when something works, we freeze it into a default and move the argument elsewhere. Attention gets the drama. Routing gets the diagrams. Context windows get the product demos. Meanwhile, the feedforward network sits there, quietly holding a large share of the parameters and applying the same nonlinearity to every token, every time, as if “one curve fits all” were a law of nature rather than a convenient engineering choice. ...

Filter Bubble Bursts: When Common Crawl Beats Clean Data

Cleaning is comforting. Every serious AI team has some version of the same ritual. Remove spam. Remove repetition. Remove bad language detection. Remove low-quality pages. Remove documents that look too weird, too short, too duplicated, too uneducational, too internet. Then hope the model learns from the respectable leftovers. That instinct is not foolish. In small or compute-constrained training runs, filtering often helps. The expensive mistake is treating that local truth as a permanent law. ...

When Data Decides What Matters: The Quiet Economics of LLM Data Selection

Budgets have a charming way of making AI strategy less philosophical. In the demo room, the question is usually whether a model can reason, code, summarize, plan, and sound pleasantly harmless while doing so. In the finance room, the question becomes simpler: how many tokens, how many GPUs, how many weeks, and why exactly are we paying to teach the model another version of the same web page? ...

Bias, Baked In: Why Pretraining, Not Fine-Tuning, Shapes LLM Behavior

TL;DR for operators Fine-tuning is not a washing machine. It may polish, redirect, or occasionally muffle a model’s behavioural tendencies, but this paper suggests that many cognitive-bias patterns are already substantially shaped before instruction tuning begins. The study separates three possible sources of observed bias in large language models: the pretrained backbone, the instruction dataset, and random variation during fine-tuning. Its main finding is that models’ bias profiles cluster more strongly by pretrained model identity than by the instruction data used later. In plainer operational language: the base model carries a behavioural signature that survives downstream training. ...