AdamW and the Cost of Being Reasonable: Choosing LLM Optimizers Without Leaderboard Theater
GPU memory is the part of AI strategy that does not care about adjectives. A team can say it is building a domain LLM, a private copilot, a long-context research assistant, or a fine-tuned enterprise model. The budget spreadsheet eventually asks a colder question: what actually fits on the available hardware? Model weights need memory. Gradients need memory. Activations need memory. Checkpoints need memory. And the optimizer — the quiet machinery that decides how parameters move during training — can require multiple additional copies of the model itself. ...