Spectral Therapy for Transformers: Predicting Divergence Before It Hurts
Training failure has a special talent for arriving late. Not late in the philosophical sense. Late in the operational sense: after the run has already consumed GPU time, after the team has already waited, after the dashboard has already looked tolerable long enough to invite optimism. Then the loss spikes, the gradient norm goes feral, and everyone pretends this was “useful learning.” Sometimes it is. Often it is just expensive smoke. ...