AdamW and the Cost of Being Reasonable: Choosing LLM Optimizers Without Leaderboard Theater

GPU memory is the part of AI strategy that does not care about adjectives.

A team can say it is building a domain LLM, a private copilot, a long-context research assistant, or a fine-tuned enterprise model. The budget spreadsheet eventually asks a colder question: what actually fits on the available hardware? Model weights need memory. Gradients need memory. Activations need memory. Checkpoints need memory. And the optimizer — the quiet machinery that decides how parameters move during training — can require multiple additional copies of the model itself.

That is the useful business reading of Aditya Ranganath’s survey, Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers.¹ The paper is not a new optimizer paper, and it does not claim that one method has now dethroned AdamW across all LLM training. Its contribution is more sober and more useful: it organizes the optimizer landscape as a joint problem in optimization, memory, numerics, implementation, and distributed systems.

For executives and technical product owners, that matters because optimizer choice is not just a research preference. It can decide whether a training run uses a larger model, a longer context window, a larger microbatch, a full-parameter fine-tune, or a cheaper hardware configuration. In other words, the optimizer is a capital-allocation lever. A rather mathematical lever, unfortunately, but still a lever.

AdamW is not glamorous, which is why it is hard to beat

The paper starts from the fact that AdamW remains the reference optimizer for modern LLM training. This is not because researchers suffer from a tragic shortage of imagination. AdamW combines momentum, coordinate-wise adaptive scaling, and decoupled weight decay into a recipe that is robust, familiar, and supported by mature training stacks.

The usual Adam-style intuition is simple enough. The optimizer keeps a moving average of gradients and a moving average of squared gradients. The first helps smooth direction. The second rescales coordinates whose gradient magnitudes differ. AdamW then decouples weight decay from that adaptive gradient update, making regularization less entangled with the adaptive preconditioner.

The simplified update story is:

$$ \text{direction} \approx \frac{\text{first moment}}{\sqrt{\text{second moment}} + \epsilon} $$

That denominator is why Adam-style methods are often forgiving in large, heterogeneous neural networks. Transformer parameters do not all behave alike. Attention projections, feed-forward layers, embeddings, normalization parameters, and output heads have different gradient scales and training roles. A single scalar learning rate, as in plain SGD, is often too blunt.

But AdamW’s strength is also its cost. For every trainable parameter, it commonly stores two optimizer-state values: the first moment and the second moment. The paper uses a useful back-of-the-envelope number: for a one-billion-parameter model, one fp32 optimizer-state vector is roughly 4 GB, so AdamW’s two moment vectors alone require roughly 8 GB before counting model parameters, gradients, activations, fragmentation, communication buffers, checkpoints, or framework overhead.

That number is not a decorative statistic. It is the point. If optimizer state consumes the memory that could have gone to context length or batch size, the optimizer is not merely “training machinery.” It is part of the feasible product design.

The first comparison is not AdamW versus novelty; it is AdamW versus the constraint

The lazy reading of the optimizer literature is: new method beats AdamW, therefore use new method. The paper’s better reading is: first identify the binding constraint, then compare optimizers against that constraint.

Optimizer family	What it tries to improve	What it may cost	Practical question
AdamW	Robust adaptive training	High optimizer-state memory	Is the baseline already strong enough and affordable?
Adafactor / 8-bit / Adam-mini / LOMO	Memory use	Approximation, tuning, implementation details	Does memory saving enable a better training regime?
Lion / sign-based methods	Simpler updates, less state	Learning-rate and weight-decay sensitivity	Does sign information preserve enough useful gradient structure?
Sophia / Shampoo-like methods	Curvature or better conditioning	Extra compute, state, and tuning complexity	Does token efficiency survive wall-clock accounting?
GaLore / projection methods	Low-rank memory reduction	Projection error and refresh overhead	Is gradient structure stable enough to compress?
Muon / matrix-based methods	Matrix-aware update geometry	Orthogonalization cost, parameter grouping complexity	Do Transformer matrices benefit from matrix-level normalization?

This table is the article’s central comparison. The paper’s value is not that it recites optimizer names. It shows why those names correspond to different bets.

A memory-efficient optimizer says: maybe we do not need two full moment vectors everywhere. A curvature-aware optimizer says: maybe a better update direction can reduce the number of tokens or steps needed. A low-rank optimizer says: maybe updates live in a smaller effective subspace. A matrix-based optimizer says: maybe treating a Transformer weight matrix as millions of unrelated scalars is a strange habit we inherited because it was convenient.

All of these bets may be intelligent. None is free.

Memory-efficient optimizers change what can be attempted

Memory-efficient optimizers are the easiest family to translate into business language because the constraint is brutally concrete: the run fits, or it does not.

The paper discusses several design patterns. Adafactor factorizes second-moment statistics for matrix-shaped parameters, storing row and column statistics rather than a full matrix of per-coordinate values. Low-bit optimizers store optimizer states in reduced precision, preserving the Adam-like state structure but compressing it. Adam-mini reduces redundant adaptivity by grouping parameters so that not every scalar receives its own independent adaptive statistic. LOMO is aimed at full-parameter fine-tuning under limited memory by fusing gradient computation and updates rather than storing everything in the conventional way.

The important distinction is that these methods do not all “save memory” in the same sense.

Adafactor changes the statistical approximation. Low-bit optimizers change storage precision. Adam-mini changes the granularity of adaptivity. LOMO changes the fine-tuning procedure. GaLore, discussed later in the paper, projects gradients into a lower-dimensional subspace so optimizer states can be maintained in that compressed space.

So the business question should not be: which method has the smallest optimizer state on a slide? That is procurement theater with equations.

The better question is:

$$ \text{Saved memory} \rightarrow \text{changed feasible regime} \rightarrow \text{better model or cheaper run} $$

If a memory-efficient optimizer merely saves state but slows convergence enough to require substantially more training, the win may disappear. If it saves enough memory to train a larger model, use a longer context window, or run full-parameter fine-tuning where AdamW would force adapters, the value can be real even if fixed-model loss curves are unimpressive.

That is why the paper stresses both fixed-model and fixed-resource evaluation. Fixed-model evaluation asks: on the same model, same data, same token budget, does the optimizer perform well? Fixed-resource evaluation asks the question a business actually pays for: under the same hardware, memory, or wall-clock budget, which training setup produces the best usable model?

Those are not the same question. Confusing them is a good way to buy GPUs and call it strategy.

Sign-based and curvature-aware optimizers make opposite compromises

Sign-based optimizers such as Lion move away from AdamW by simplifying the update. Instead of preserving continuous gradient magnitudes and maintaining a second-moment vector, Lion uses the sign of a momentum-like quantity. The gain is simplicity and potentially lower state memory. The cost is that coordinate-wise magnitude information is discarded.

This does not make Lion naïve. It makes its hypothesis clear. It asks whether the direction of a momentum-informed update is enough, and whether AdamW’s second-moment machinery is more expensive than necessary in some regimes.

The catch is scale control. In a sign-based update, many coordinates receive updates with similar magnitude, so the learning rate becomes especially consequential. Weight decay, schedule design, clipping, and momentum settings also stop being transferable from AdamW by default. A fair comparison cannot simply paste AdamW hyperparameters into Lion and then act surprised. Nor can it give Lion an elaborate tuning budget while letting AdamW stumble around in default settings. Both versions of unfairness are common enough to deserve their own tiny museum.

Curvature-aware methods make almost the opposite bet. Instead of throwing away some magnitude information, they try to incorporate richer information about the local geometry of the loss. Sophia uses diagonal Hessian-like estimates. Shampoo-style methods use structured matrix or tensor preconditioners. SOAP-like methods explore structured preconditioning between Adam-style diagonal scaling and heavier matrix preconditioning.

The appeal is obvious: if curvature information gives a better-conditioned update, the model may need fewer steps or fewer tokens to reach a target loss. The problem is equally obvious: curvature costs memory, compute, numerical care, and implementation complexity.

A curvature-aware optimizer is not better merely because its loss curve falls faster per token. The paper’s benchmark logic forces the harder comparison:

$$ \text{Fewer tokens or steps} \quad \text{versus} \quad \text{more expensive steps} $$

If the method improves token efficiency but loses wall-clock efficiency, its practical value depends on the cost structure of the training run. If the team is token-limited, the trade-off may be attractive. If the team is hardware-time-limited, it may not be. This is the kind of nuance that does not fit nicely into a leaderboard cell, which is probably why leaderboards try to avoid it.

Low-rank and matrix-based optimizers stop pretending Transformer weights are just long vectors

The paper’s most interesting conceptual shift is the move from scalar-coordinate thinking toward structure-aware optimization.

AdamW treats parameters coordinate by coordinate. This is convenient, robust, and scalable. But Transformer models are dominated by large dense matrices: query, key, value, output, feed-forward up-projection, down-projection, gated projection, embeddings, and output heads. These matrices are not just bags of scalars. They are linear maps acting on representations.

Low-rank optimizers such as GaLore exploit one kind of structure. The claim is not that the model itself becomes low-rank in the same way as adapter fine-tuning. GaLore still updates the full weight matrix. The low-rank idea applies to the optimizer’s handling of gradients and states: project the gradient into a lower-dimensional subspace, maintain optimizer states there, then map the update back.

That distinction matters. LoRA-style parameter-efficient fine-tuning freezes most base weights and adds low-rank trainable adapters. GaLore-style optimization uses low-rank structure to reduce optimizer memory while still performing full-parameter updates. One is an adaptation architecture. The other is an optimizer-memory strategy. They sit near each other conceptually, but they are not the same animal. Confusing them is not fatal, but it does suggest the taxonomy has not yet done its job.

Projection methods introduce a new operational issue: subspace staleness. If the projection basis is refreshed frequently, the method adapts but pays extra compute. If it is refreshed rarely, it saves overhead but may miss changing gradient directions. The right rank and refresh interval are not minor settings. They are part of the method’s actual performance.

Matrix-based optimizers such as Muon take another route. Instead of compressing updates into low-rank subspaces, they transform update matrices according to matrix-level geometry. Muon forms a momentum update and applies approximate orthogonalization, commonly through Newton–Schulz-style iterations. The goal is to produce matrix updates with more controlled singular-value structure.

This is a different bet from AdamW, Lion, and GaLore:

Method	Main object of attention	Core bet
AdamW	Scalar coordinates	Per-coordinate adaptivity is robust enough to justify memory cost.
Lion	Coordinate-wise signs	Momentum direction matters more than full magnitude scaling.
GaLore	Low-rank projected gradients	Important update information can be represented in a smaller subspace.
Muon	Matrix-level update geometry	Transformer weight matrices benefit from orthogonalized or normalized updates.

Muon is therefore not just “AdamW but cheaper,” and GaLore is not just “LoRA but optimizer-shaped.” These methods ask different questions. That is why comparison-based reading is necessary. A name-by-name summary would flatten the actual decision space into a catalog. Useful for a glossary. Less useful for spending money.

The benchmark section is the paper’s practical spine

The survey’s strongest business contribution is its benchmarking lens. It lists the ways optimizer comparisons go wrong: under-tuned AdamW baselines, unequal hyperparameter search budgets, early-curve overclaiming, ignoring wall-clock cost, incomplete memory reporting, small-scale extrapolation, and implementation confounding.

Each problem corresponds to a familiar business failure mode.

Benchmark pitfall	Why it misleads	Business consequence
Weak AdamW baseline	Makes the new optimizer look better than it is	Premature migration from a stable training stack
Unequal tuning budget	Rewards search effort, not algorithm quality	False confidence in reported gains
Early-curve overclaiming	Speedup may vanish before final loss	Training plan underestimates total cost
Token-only reporting	Ignores slower per-step computation	“Efficient” method burns more calendar time
Incomplete memory accounting	Hides activations, buffers, sharding effects	Expected larger context or batch does not fit
Small-scale extrapolation	Proxy result may not survive LLM scale	Expensive pilot fails when scaled
Implementation confounding	Kernel quality dominates algorithm	Team buys a paper result but inherits software debt

The fixed-model versus fixed-resource distinction deserves special attention.

A fixed-model benchmark is cleaner science: same architecture, same dataset, same token budget, same precision, same hardware. It isolates optimizer behavior as much as possible.

A fixed-resource benchmark is messier but often more relevant: same hardware or memory budget, but each optimizer may enable different feasible choices. If a memory-efficient optimizer permits a longer context window or a larger batch, its advantage may only appear under fixed-resource evaluation. If it cannot convert memory savings into a better training configuration, the saved memory is mostly a nice engineering souvenir.

The paper also insists that wall-clock time must stand beside token efficiency. This is not pedantry. Curvature-aware methods may need fewer tokens but perform more work per step. Low-rank methods may save memory but spend time computing projection bases. Matrix-based methods may benefit from accelerator-friendly matrix multiplication, or they may add orthogonalization overhead. Quantized optimizers may reduce memory but introduce quantization/dequantization work.

The only adult question is the joint one:

$$ \text{final quality} ; | ; \text{hardware budget, memory budget, wall-clock budget, stability constraints} $$

Anything else is a partial answer pretending to be a decision.

What Cognaptus infers for business use

The paper directly shows a taxonomy and an evaluation framework. It does not prove that a particular optimizer should replace AdamW in enterprise LLM work. The business inference is therefore conditional: optimizer choice should be treated as a resource-allocation decision tied to the training regime.

For teams doing ordinary supervised fine-tuning on manageable models, the practical answer may still be boring: use a strong AdamW baseline unless memory pressure or cost pressure is clearly binding. Boring is underrated. Boring often ships.

For teams trying full-parameter fine-tuning under limited GPU memory, memory-efficient methods become more relevant. LOMO-like approaches, Adafactor-style factorization, low-bit states, grouped adaptivity, or low-rank projection can matter because they may change what is feasible. The correct evaluation is not only final validation loss. It is whether the method enables full-parameter adaptation, longer sequences, or larger microbatches without creating instability.

For teams training or continuing pretraining domain models, the bar is higher. Long-horizon stability, throughput, and scale behavior matter more than a nice early loss curve. A method that looks attractive at small scale must be tested across token budgets and model sizes before it becomes infrastructure policy.

For long-context training, optimizer memory competes directly with activation memory. This makes memory-efficient optimizers strategically more interesting, but it also makes incomplete memory accounting more dangerous. Saving optimizer state while increasing temporary buffers may not help the context window. The memory ledger must include parameters, gradients, optimizer state, activations, temporary buffers, communication buffers, checkpointing, and sharding effects.

For teams experimenting with matrix-based methods such as Muon, the business question is not whether the phrase “matrix-aware” sounds modern. Of course it does. The question is whether the training stack can implement the extra matrix operations efficiently, whether parameter grouping is explicit, whether non-matrix parameters use a sensible companion optimizer, and whether gains persist after strong AdamW tuning.

A practical decision framework for optimizer selection

A useful internal review does not start with “Which optimizer is best?” It starts with “What is binding?”

Binding constraint	Candidate direction	Evaluation focus
GPU memory	Adafactor, low-bit states, Adam-mini, LOMO, GaLore	Peak memory, feasible context, feasible batch, checkpoint size
Wall-clock time	Strong fused AdamW, efficient low-overhead alternatives	Validation loss versus time, tokens/sec, kernel maturity
Token budget	Curvature-aware or better-conditioned methods	Loss versus tokens plus per-step overhead
Full-parameter fine-tuning feasibility	LOMO-like or low-memory full-update methods	Memory feasibility, stability, downstream performance
Transformer matrix geometry	Muon or matrix-based updates	Parameter grouping, orthogonalization cost, scale behavior
Large-batch distributed training	LAMB-style or sharding-compatible adaptive methods	Stability, communication cost, global batch behavior

This is the replacement for optimizer gossip. It is less exciting, which is usually a sign that it is closer to management reality.

The firm should also maintain a strong AdamW baseline. Not a default AdamW baseline. A strong one: tuned learning rate, warmup, decay schedule, momentum coefficients, weight decay, clipping, precision settings, batch size, gradient accumulation, and parameter-specific exclusions. Without that baseline, every new optimizer looks like a revolution because the old one was asked to run in shoes two sizes too small.

Boundaries: this survey is a map, not a winner announcement

The paper is best read as a map of the optimizer landscape and a standard for judging claims. It is not a new benchmark proving that memory-efficient, curvature-aware, low-rank, sign-based, or matrix-based optimizers dominate AdamW.

That boundary matters for implementation. A team should not adopt Muon because matrix-based optimization is conceptually elegant. It should test whether Muon improves the relevant frontier on its architecture, model size, hardware, precision regime, and distributed setup. A team should not adopt GaLore because low-rank structure is plausible. It should test rank, refresh interval, projection overhead, memory savings, and downstream behavior. A team should not adopt low-bit states because the optimizer-state number shrinks. It should measure numerical stability and actual peak memory, not just theoretical state size.

The paper also implies that future optimizer design may become hybrid. Different parameter groups may deserve different update rules. Large dense matrices may benefit from matrix-aware updates. Embeddings or normalization parameters may not. Some layers may need rich adaptive states; others may tolerate cheaper approximations. Optimizer memory may become an allocatable budget rather than a fixed multiple of model size.

That is a more interesting future than “AdamW killer arrives.” Also more annoying to benchmark. Progress often has terrible ergonomics.

The real lesson: optimizer choice is infrastructure design

The obvious summary of the paper is that the LLM optimizer landscape now includes AdamW, Adafactor, LAMB, Lion, Sophia, LOMO, GaLore, Adam-mini, Muon, and several related families.

The useful summary is different: optimizer choice is becoming infrastructure design.

AdamW remains the main trail because it is robust, mature, and difficult to beat under fair tuning. The alternatives matter because each attacks a specific limitation: memory burden, coordinate-wise geometry, curvature blindness, low-rank structure, matrix structure, large-batch stability, or hardware inefficiency. Their value depends on whether the attacked limitation is actually binding in the target training regime.

For business users, the conclusion is not to chase optimizer names. The conclusion is to build an evaluation habit:

define the training regime;
identify the binding resource constraint;
maintain a strong AdamW baseline;
compare candidate optimizers under both fixed-model and fixed-resource settings;
report validation loss, wall-clock time, peak memory, tokens per second, stability, downstream behavior, and implementation details;
only then decide whether the new optimizer buys anything real.

That sounds less glamorous than “a new optimizer makes LLM training cheaper.” It is also much less likely to waste a GPU budget.

Cognaptus: Automate the Present, Incubate the Future.

Aditya Ranganath, Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers, arXiv:2605.09176, 2026. https://arxiv.org/abs/2605.09176 ↩︎

AdamW is not glamorous, which is why it is hard to beat#

The first comparison is not AdamW versus novelty; it is AdamW versus the constraint#

Memory-efficient optimizers change what can be attempted#

Sign-based and curvature-aware optimizers make opposite compromises#

Low-rank and matrix-based optimizers stop pretending Transformer weights are just long vectors#

The benchmark section is the paper’s practical spine#

What Cognaptus infers for business use#

A practical decision framework for optimizer selection#

Boundaries: this survey is a map, not a winner announcement#

The real lesson: optimizer choice is infrastructure design#