Spectral Therapy for Transformers: Predicting Divergence Before It Hurts
Opening — Why This Matters Now Training instability in large transformers is not a theoretical inconvenience. It is a budget line item. When a 300M–7B parameter model diverges halfway through training, what disappears is not just gradient sanity — it is GPU hours, engineering time, and often, experimental momentum. Most practitioners discover instability reactively: a loss spike, an exploding norm, and then the quiet resignation of a terminated run. ...