Scaling Laws Without Power Laws: Why Bigger Models Still Win

Opening — Why this matters now

The scaling law debate was supposed to be settled. Bigger models, more data, more compute—loss falls predictably. Then came the uncomfortable question: what exactly is being scaled? If power laws in natural language data are the root cause, then scaling laws might be an artifact of language itself, not of learning. This paper dismantles that comfort.

Background — Context and prior art

Classic work by Kaplan and later Chinchilla framed training as a resource allocation problem across parameters, data, and compute. A parallel theoretical literature argued that power-law loss curves merely mirror power-law structure in datasets—Zipf’s law for text, spectral decay for images. The implication was subtle but dangerous: change the data, and the laws might vanish.

Analysis — What the paper actually does

The authors strip language down to its skeleton. They train transformers on random walks over graphs: Erdős–Rényi graphs with no scale-free structure, Barabási–Albert graphs with explicit power laws, and progressively richer synthetic datasets generated by small transformers. In each case, the task is next-token prediction. Architecture is controlled. Optimization is careful. What changes is data complexity.

Crucially, scaling laws persist even when the data has no power-law correlations. Loss still follows clean power laws in model size and dataset size. The paper also revisits how these laws are fit, arguing that many popular two-dimensional fits are over-parameterized. One-dimensional fits are often more stable and interpretable.

Findings — Results at a glance

Dataset Type	Power-law Structure in Data	Scaling Laws Observed	Robustness
Random ER graphs	None	Yes	Moderate
Scale-free graphs	Explicit	Yes	Slightly weaker fits
Transformer-generated (TnL)	Implicit	Yes	Strong
Natural language	Strong	Yes	Strongest

An important side result: a shallow, two-layer transformer with short context can already reproduce language-like scaling behavior. Depth is not the primary driver here.

Implications — Why practitioners should care

If scaling laws survive even in data without power laws, then they are telling us something about optimization and representation—not just statistics of text. Bigger models may simply open wider basins of low-loss solutions, making gradient descent more efficient. For practitioners, this reframes compute planning: scaling laws are not fragile. They are structural.

It also suggests that improving optimization and parameterization (µP versus standard) may shift the efficiency of scaling, not abolish it. The uncomfortable conclusion: we are probably still underestimating how far scaling can go.

Conclusion

This paper doesn’t kill scaling laws. It removes their alibi. Power laws in data are not the cause; they’re optional. Scaling appears to be a property of how neural networks learn, not just what they learn from.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually does#

Findings — Results at a glance#

Implications — Why practitioners should care#

Conclusion#