TL;DR for operators
If your deployment plan depends on squeezing a language model into cheap inference hardware, this paper is worth reading because it changes the timing of the quantization problem. Most quantization work asks: “How do we repair a model after training so it survives 4-bit inference?” Outlier-Safe Pre-Training asks a more irritating question: “Why did we train a quantization-hostile model in the first place?”1
The authors argue that extreme activation outliers are not an unavoidable side effect of large language models. They are, at least in this experimental setting, largely products of training design: diagonal optimizers such as Adam, channel-wise scaling in RMSNorm, and embedding-layer behaviour that can concentrate activation magnitudes. OSP attacks those sources before the model is finished, rather than performing heroic post-hoc surgery later.
The framework has three parts. First, replace Adam with Muon for most model parameters, reducing the privileged coordinate effects caused by element-wise adaptive gradient scaling. Second, replace ordinary RMSNorm’s per-channel learnable scale with Single-Scale RMSNorm, which preserves adaptive magnitude control but removes channel-specific amplification. Third, keep Adam for embeddings for efficiency, but add learnable embedding projection matrices so embedding-originated magnitudes are redistributed rather than concentrated.
The main empirical result is unusually clean. A 1.4B-parameter LLaMA-style model trained with OSP on 1 trillion tokens shows near-zero excess kurtosis in activations, only about 2% training overhead versus Adam, and much stronger 4-bit robustness. Under aggressive 4-bit quantization, the OSP model reaches a 35.7 average score across ten benchmarks, compared with 26.5 for an Adam-trained scratch baseline. In the paper’s 100B-token ablation, the complete OSP configuration reaches excess kurtosis of 0.04, compared with 1818.56 for the standard Adam configuration. That is not a rounding error. That is the outlier problem falling off a cliff.
For business use, the practical takeaway is narrow but important: quantization readiness should be treated as a training objective-by-design, not merely an inference optimisation pass. OSP does not prove that every future model can be made outlier-free, nor that 7B or 70B models will behave the same way. It does show that for at least one serious 1.4B-scale run, outlier-free pre-training is feasible, efficient, and compatible with existing post-training quantization methods. In infrastructure terms: the cheapest model to deploy may be the one that was trained not to fight deployment.
The old quantization bargain was always a little suspicious
Quantization is sold as practical engineering. Take a trained model, reduce the numerical precision of its weights and activations, and enjoy lower memory use, cheaper inference, and perhaps enough efficiency to run something useful outside a data centre. Lovely. Then the model meets 4-bit arithmetic and starts behaving like a spreadsheet opened in the wrong encoding.
The usual villain is the activation outlier. Most values in a layer sit within a modest range, but a small number of channels or token positions produce very large magnitudes. Quantization then has to choose a scale wide enough to include those extremes. The consequence is ugly: ordinary values get mapped too coarsely, rounding error rises, and useful information is lost. A few numerical divas ruin the precision budget for everyone else.
The standard response has been post-training quantization, or PTQ. PTQ methods rotate activations, smooth channel scales, use calibration data, allocate special treatment to outlier channels, or optimise weight rounding. Many of these methods are clever. Some are impressively effective. But they share an implicit concession: the model is already trained, the outliers already exist, and now the engineer must clean up the mess.
OSP challenges that bargain. The paper’s central claim is not merely “we improved 4-bit quantization.” It is sharper: many outliers are not fundamental properties of transformers. They are artifacts of how transformers are trained. If that is right, PTQ is not the whole optimisation problem. It is the after-sales service department.
OSP treats outliers as a training pathology, not a deployment nuisance
The paper’s mechanism-first argument starts with a useful distinction. Outliers are not just “large values.” They are structured concentration: certain coordinates or token positions become disproportionately large, and this concentration interacts badly with low-bit quantization.
The authors focus on three mechanisms that can create or preserve such concentration:
| Mechanism | What goes wrong | OSP intervention | Operational meaning |
|---|---|---|---|
| Diagonal adaptive optimisation | Adam rescales updates element by element, encouraging privileged coordinates where some channels become special | Use Muon for most parameters | Reduce coordinate-specific amplification during training |
| Channel-wise normalization scale | Standard RMSNorm has learnable per-channel scaling, giving the model another way to amplify particular dimensions | Use Single-Scale RMSNorm | Keep global scale adaptation without channel-specific knobs |
| Embedding-layer dynamics | Embedding matrices are costly for non-diagonal optimisation, but leaving them on Adam can reintroduce concentrated magnitudes | Use Adam for embeddings plus learnable embedding projection | Preserve throughput while redistributing embedding-originated magnitudes |
This is the part of the paper that matters most for operators. OSP is not one trick. It is a three-part removal of pathways through which the model can learn quantization-hostile representations.
That also explains why the accepted framing of the paper should be mechanism-first rather than evidence-first. The evidence is strong, but the evidence only becomes interpretable once the reader understands the causal bet: outliers form because training dynamics create privileged bases. Remove enough privileged-basis machinery, and the quantization landscape changes.
Muon attacks Adam’s quiet coordinate bias
Adam is popular because it works. Unfortunately, “works” and “produces quantization-friendly internal geometry” are not the same sentence.
Adam uses per-parameter adaptive statistics to scale updates. This is useful for optimisation, but it also means each parameter coordinate gets its own treatment. Prior work has described this as creating a “privileged basis”: the model can lean into specific coordinates because the optimiser itself treats coordinates separately. In transformer residual streams, that can encourage some dimensions to carry unusually high magnitude.
Muon changes the update geometry. It avoids Adam-style element-wise adaptive scaling for hidden layers and uses matrix orthogonalization through Newton-Schulz iterations. The paper positions Muon as a practical middle ground: it captures some benefits associated with richer matrix-aware updates, but without the full memory and throughput penalties of heavier second-order optimizers.
The throughput evidence is important because many elegant training fixes die at the phrase “production scale.” In the paper’s TPU-v4 512 Pod Slice comparison, Adam processes 4.07M tokens per second, while Muon reaches 3.99M tokens per second, or 97.9% of Adam’s throughput. SOAP, by contrast, is reported with a three-hour build time, while another second-order-style alternative reaches only 75.5% of Adam’s throughput in the same table. The point is not that Muon is universally best. The point is that it is plausible at trillion-token scale. That is a more useful claim.
But the paper is careful, and the ablations justify the caution: Muon alone is not enough. At 100B tokens, Muon without the architectural OSP changes still shows high excess kurtosis, reported as 1575.12 in the ablation table. That is lower drama than Adam only in the way a smaller fire is lower drama than a larger fire. The building is still on fire.
Single-Scale RMSNorm removes a second route to channel celebrity
Even if the optimiser stops privileging individual coordinates, the architecture can still do it.
RMSNorm normally includes a learnable scale for each channel. That gives the model a direct mechanism to amplify some dimensions more than others. From a standard modelling perspective, this is flexible and useful. From a quantization perspective, it is also an invitation for a few channels to become obnoxiously important.
The obvious fix is to remove learnable scaling entirely. The paper notes prior Simple RMSNorm-style approaches that do this, but also reports practical problems: fixed scaling can suppress activation magnitudes early in training or destabilise training later. In other words, removing the knob solves one problem by creating another. Neat, but not useful enough.
Single-Scale RMSNorm is the compromise. Instead of one learnable scale per channel, it uses one shared scale. The model can still adjust the overall activation magnitude, but it cannot secretly crown a few dimensions as royalty.
This is a small architectural change with a large conceptual implication. Quantization-friendly training is not simply about making activations smaller. It is about preventing concentration. A model with well-distributed activation energy can still represent useful information. A model that hides too much information in a few large channels becomes expensive to compress.
Embedding projection patches the efficiency loophole
The embedding layer creates a practical annoyance. Applying Muon-style orthogonalization to large embedding matrices is expensive. The authors report that doing so adds a further 6% throughput degradation. Since vocabularies are large and embeddings are parameter-heavy, this is not a trivial footnote.
OSP therefore keeps Adam for embeddings. That is a sensible engineering choice, but it reopens the door to Adam-induced concentration. The solution is a learnable full-rank embedding projection placed around the embedding/unembedding path. Its role is to redistribute magnitudes so they do not propagate as concentrated outliers through the network.
This is not just a hack added for a cleaner table. It is what makes the framework operationally plausible. Without decoupled embedding optimisation, OSP would be more expensive. Without embedding projection, the retained Adam path could undermine the outlier-prevention story. The paper’s design is therefore more conservative than it first appears: change the parts that cause trouble, but preserve compatibility and throughput where possible.
And yes, this is where the “just train it differently” crowd usually forgets the bill. OSP does not forget the bill. It itemises it.
The ablation is the paper’s main diagnostic evidence
The 100B-token ablation is best read as a diagnostic test, not as the headline deployment result. Its purpose is to show whether each component contributes to outlier prevention and whether partial fixes are enough.
They are not.
The standard Adam model reports excess kurtosis of 1818.56. Muon alone reports 1575.12. Muon with Single-Scale RMSNorm drops sharply to 66.69. Muon with embedding projection gives 703.23. The complete OSP setup — Muon, Single-Scale RMSNorm, and embedding projection — reports 0.04.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Adam baseline at 100B tokens | Main diagnostic baseline | Standard training produces extreme activation heavy tails | That Adam always fails under every architecture or schedule |
| Muon alone | Ablation | Removing diagonal adaptive scaling is insufficient by itself | That Muon has no value; it is necessary but incomplete here |
| Muon + SSNorm | Ablation | Channel-wise scaling is a major contributor | That normalization is the only remaining cause |
| Muon + EmbProj | Ablation | Embedding-originated concentration matters | That projection alone solves the problem |
| Full OSP | Main ablation result | All three interventions together prevent outlier emergence in this setting | That all future model sizes will behave identically |
This is the strongest part of the paper because the result is not merely “our final model is better.” The partial configurations behave in ways that make the proposed mechanism credible. If only the final model had been tested, one could suspect accidental hyperparameter luck. Instead, the ablation suggests a more structural story: outliers can enter through multiple channels, and leaving one route open may be enough for them to propagate.
That propagation point is operationally relevant. A deployment team cannot assume that “mostly quantization-friendly” training is enough. If one layer family or parameter group can seed extreme magnitudes, the rest of the model may inherit the problem. Quantization is not sentimental. It does not give partial credit for good intentions.
The 1T-token run turns the mechanism into a scale claim
The trillion-token experiment is the paper’s main scale evidence. The authors train a 1.4B-parameter LLaMA-style model on 1T tokens and compare it with an Adam-trained scratch baseline and several open-source models of similar size.
Under 4-bit quantization, the OSP model achieves a 35.7 average score across ten benchmarks. The Adam-trained scratch model, also 1.4B and trained on 1T tokens, scores 26.5. Many open-source baselines cluster in the mid-to-high 20s under the same aggressive quantization setup. Qwen 2 at 1.5B reaches 29.3, still below the OSP model’s 35.7.
That comparison needs careful interpretation. The paper is not claiming that OSP produces a better full-precision model than all open-source alternatives. In the appendix’s unquantized benchmark table, the OSP model scores 44.4 on average, while the Adam scratch model scores 45.9, and several open-source models score higher. SmolLM 2, for example, reports 49.7 unquantized. So the win is not general model capability in full precision.
The win is robustness under severe quantization.
That distinction matters because businesses do not buy “average benchmark score” in the abstract. They buy deployable capability under constraints. If the target environment has enough memory and bandwidth for higher precision inference, OSP may be less compelling. If the target is mobile, browser, embedded, private on-device, or cost-sensitive edge inference, the relevant question changes: how much capability survives after compression?
The paper’s answer is that OSP preserves far more capability at 4-bit than comparable training under Adam. That is a deployment result, not a leaderboard result. A rare and welcome category.
PTQ still matters; it just starts from a better patient
A lazy reading of the paper would say: “OSP eliminates outliers, therefore post-training quantization is obsolete.” The paper does not say that. In fact, its PTQ analysis points in the opposite direction.
The authors test 4-bit quantization using several post-training methods. With plain round-to-nearest quantization, the Adam model’s WikiText-2 perplexity explodes to 14475.51, while the OSP model reports 45.92. Adding a Hadamard-style transform improves Adam to 4794.00 and OSP to 19.27. Adding GPTQ improves Adam to 3723.46 and OSP to 14.29. More advanced rotation-based PTQ narrows the gap: QuaRot gives 16.62 for Adam and 14.38 for OSP; SpinQuant gives 14.94 for Adam and 13.66 for OSP.
There are two lessons here.
First, OSP is especially valuable when naïve or minimal quantization would otherwise fail catastrophically. The difference between 14475.51 and 45.92 is not subtle. It is the difference between a model that has fallen into a quantization sinkhole and one that is still recognisably functioning.
Second, strong PTQ methods still help. OSP does not replace them; it improves the starting point. Once advanced rotation methods are applied, the gap becomes smaller but does not disappear. That is exactly what one would expect if OSP removes a major source of quantization pathology while PTQ still addresses residual quantization error.
For operators, the decision is therefore not “OSP or PTQ.” It is “OSP before PTQ if we control pre-training or serious continued training; PTQ alone if we only inherit finished models.” The ownership boundary matters.
Attention sinks survive, which is more interesting than it sounds
One of the paper’s more useful conceptual corrections concerns attention sinks. Prior explanations have linked massive activations to attention sink behaviour: the model repeatedly focuses attention on special tokens, often initial or delimiter tokens, as a kind of “no-op” mechanism. The intuitive story is that attention sinks cause massive activations.
OSP complicates that story.
The authors find that OSP-trained models can still exhibit attention sink behaviour even when massive activations disappear. In Adam-trained models, sink-related query and key activations concentrate high magnitude in sparse channels. In OSP-trained models, the magnitudes are distributed more broadly. The attention pattern can persist, but the extreme channel concentration does not.
This distinction matters because it separates a behavioural pattern from one implementation strategy. A model may need a way to route attention toward unimportant or special tokens. But it does not necessarily need to implement that behaviour through massive negative logits or sparse high-magnitude channels. Softmax depends on relative differences, not metaphysical commitment to numerical melodrama.
The paper’s interpretation is that attention sinks are not inherently the cause of outliers. Instead, outlier-prone training dynamics may push the model toward a particular implementation of sink behaviour: concentrated channels and extreme logits. OSP allows a different implementation: distributed magnitudes and more balanced logits.
This is more than interpretability trivia. It tells practitioners not to overfit the fix to the surface symptom. If attention sinks persist without catastrophic outliers, then suppressing every sink-like pattern may be unnecessary or even misguided. The enemy is not necessarily the attention pattern. The enemy is the concentrated numerical implementation that makes quantization brittle.
The business value is not “smaller models”; it is fewer deployment surprises
OSP should not be read as a generic compression slogan. The business value is more specific: it reduces the risk that a model which looks acceptable in full precision becomes unusable when pushed into a low-bit deployment regime.
That risk is expensive. It appears late, after training costs have already been paid. It creates engineering loops around calibration data, hardware kernels, fallback precision, exception handling, and model-specific quantization recipes. It also makes deployment planning harder because “1.4B parameters” tells you less than expected. Two models of similar size can behave very differently once quantized.
OSP suggests a different procurement and training checklist:
| Decision area | Old question | OSP-informed question |
|---|---|---|
| Model training | Does the model converge? | Does it converge without developing quantization-hostile activation distributions? |
| Optimizer choice | Which optimizer is fastest and stable? | Which optimizer avoids coordinate concentration at acceptable cost? |
| Architecture | Does the transformer match standard inference stacks? | Does it preserve compatibility while reducing privileged channels? |
| Quantization | Which PTQ method repairs this model? | Was the model trained to make PTQ easier in the first place? |
| Deployment planning | What is the full-precision benchmark score? | What capability survives under the target bit-width and KV-cache setting? |
The ROI argument is not that OSP makes pre-training cheaper. It adds about 2% training overhead in the paper’s 1T-token run. The ROI argument is that a small training-side overhead may reduce inference-side fragility, memory pressure, and model-specific quantization firefighting. Since inference cost dominates many production deployments over time, that trade-off is at least worth attention.
Still, this is an inference from the paper, not something the paper directly measures as enterprise cost savings. The authors measure quantization robustness, throughput, kurtosis, perplexity, and benchmark accuracy. Cognaptus infers the business relevance: if those properties hold in target model classes, training for outlier safety could lower the practical cost of reliable low-bit deployment.
What the paper directly shows versus what operators may infer
A clean separation is useful here, because otherwise every quantization paper becomes a small bonfire of overclaiming.
| Category | Claim | Status |
|---|---|---|
| Directly shown | OSP combines Muon, Single-Scale RMSNorm, and embedding projection to prevent activation outliers in the tested setup | Supported by ablation and 1T-token training dynamics |
| Directly shown | Full OSP at 100B tokens reaches near-zero excess kurtosis, while partial variants do not | Supported by ablation table and training curves |
| Directly shown | A 1.4B OSP model trained on 1T tokens performs much better than an Adam scratch baseline under aggressive 4-bit quantization | Supported by benchmark table |
| Directly shown | OSP remains compatible with PTQ and can benefit from methods such as GPTQ, QuaRot, and SpinQuant | Supported by PTQ perplexity results |
| Reasonable inference | Pre-training choices can reduce downstream quantization engineering burden | Plausible, but not directly costed |
| Reasonable inference | OSP-like recipes may be valuable for edge and on-device LLM deployment | Plausible where low-bit inference is required |
| Not yet shown | The same results hold for 3B, 7B, 13B, or frontier-scale models | Explicitly outside current evidence |
| Not yet shown | OSP is superior to all other optimizer or architecture combinations | Not established; optimizer comparison is limited |
That last row matters. The paper’s limitation section is refreshingly concrete. The authors did not exhaustively compare Muon with other second-order or matrix-aware optimizers such as Shampoo or SOAP. They also did not test across the 3B and 7B scales that are especially relevant for mobile deployment. The reported result is strong, but it is not a universal law of model training. The outlier may be a lie; scale remains a bureaucrat.
The boundary: promising at 1.4B, unproven where many buyers will care most
The most important limitation is not that the work is academic or that benchmarks are imperfect. Those are boilerplate limitations, and boilerplate is where thought goes to nap.
The real boundary is model scale and deployment target. A 1.4B model is meaningful, especially for efficient inference research, but many commercial teams evaluating on-device or private-edge LLMs care about 3B, 7B, or larger models. The paper itself names 3B and 7B as future work. Until those experiments exist, OSP should be treated as a strong candidate design pattern, not a settled recipe.
There is also a capability trade-off to watch. In the unquantized appendix results, the OSP model is competitive but not superior to the Adam scratch baseline. That may be acceptable if the deployment target is 4-bit inference, but it matters if a team wants one training recipe to dominate across all precision regimes. The right comparison is not “best full-precision model” versus “best quantized model” in the abstract. It is “best model under the actual operating constraint.”
Finally, implementation maturity matters. The authors use a distributed Muon implementation in JAX on TPU infrastructure. An organisation training in PyTorch on GPUs, using different sharding strategies, kernels, or optimiser stacks, should not assume the same overhead profile automatically. The mechanism may transfer before the engineering economics do. Annoying, but familiar.
The strategic lesson: quantization begins before quantization
The paper’s best contribution is not the phrase “outlier-free LLM,” though that will probably get the attention. The better contribution is a shift in where the quantization problem lives.
If outliers are treated as inevitable, then deployment teams inherit a fixed problem. They calibrate, rotate, smooth, split, exempt, and patch. Sometimes it works. Sometimes the model falls apart under 4-bit arithmetic and everyone pretends the real requirement was 8-bit all along.
If outliers are treated as training-induced, the design space changes. Optimizer choice, normalization structure, and embedding treatment become deployment decisions. Pre-training is no longer only about loss curves and benchmark capability. It is also about the numerical shape of the model that future inference systems must serve.
OSP does not end the need for PTQ. It does not prove every model can be made outlier-free. It does not magically turn small models into large ones, nor does it repeal memory bandwidth. The universe remains rude.
But it does make one point difficult to ignore: the outlier problem may be less like bad weather and more like bad plumbing. If you install the pipes wrong, you can still spend money on buckets. Better, perhaps, to stop the leak upstream.
Cognaptus: Automate the Present, Incubate the Future.
-
Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, and Jaewoo Kang, “Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models,” arXiv:2506.19697, 2025. ↩︎