When it comes to deploying large language models (LLMs) efficiently, few challenges are as stubborn—and misunderstood—as activation outliers. For years, engineers have treated them like a natural disaster: unpredictable but inevitable. But what if they’re more like bad habits—learned and fixable?

That’s the provocative premise behind a new framework called Outlier-Safe Pre-Training (OSP). Developed by researchers at Korea University and AIGEN Sciences, OSP proposes a simple but radical shift: instead of patching over outliers post hoc with quantization tricks, why not train the model to never form outliers in the first place?

What Are Activation Outliers and Why Do They Matter?

In quantized models, especially in the aggressively compressed 4-bit regime, activation outliers wreak havoc. These extreme values skew the range of tensor distributions, inflating scaling factors and destroying precision during inference. The result? Severe degradation in downstream task performance.

Typical mitigation strategies, like Post-Training Quantization (PTQ), accept outliers as a fact of life. They redistribute or clip extreme values, or apply statistical band-aids like Hessian-aware rounding and Hadamard transforms. But these are reactive, not preventive.

OSP flips the script: prevent the disease instead of treating the symptoms.

The OSP Trinity: Muon, SSNORM, and EMBPROJ

OSP comprises three key innovations, each addressing a distinct source of outlier formation:

  1. Muon Optimizer – A novel alternative to Adam, Muon discards the diagonal preconditioning that encourages outlier channels. Instead, it applies orthogonal transformations to gradients using the Newton-Schulz method. It runs at 97.9% the throughput of Adam, with far fewer outliers.

  2. Single-Scale RMSNorm (SSNORM) – Traditional normalization layers scale each channel individually, reinforcing certain privileged bases. SSNORM enforces a single global scale across all dimensions, removing channel-wise amplification without stalling convergence.

  3. Learnable Embedding Projection (EMBPROJ) – Embedding layers are computational bottlenecks and prone to outliers. OSP trains them using Adam for speed but follows up with a projection layer that redistributes any emerging outlier values, ensuring they don’t propagate.

Together, these components yield zero-excess kurtosis—a statistical hallmark of outlier-free activations—across a 1.4B-parameter LLM trained on a 1 trillion-token corpus. That’s production scale.

Does It Work? Yes, Quantitatively and Qualitatively

The authors benchmarked OSP against a dozen open-source LLMs under 4-bit quantization. While models like TinyLlama and Qwen 2.5 crumbled to near-random accuracy on tasks like GSM8K and ARC, the OSP-trained model delivered up to 35.7 average score—a 35% boost over Adam-trained baselines.

OSP also maintains perplexity under control across bit-widths, and its ablation studies show that all three components are essential: Muon, SSNORM, and EMBPROJ. Leaving any out weakens the entire framework.

Moreover, the model preserves emergent LLM behaviors—like attention sinks—but in a kinder, gentler way. Instead of creating activation spikes via negative infinity logits, it maintains balanced logits while still focusing attention. This reshapes our understanding of what causes outliers: not attention sinks themselves, but how optimizers and norms react to them.

Why This Matters

OSP isn’t just another quantization technique. It’s a paradigm shift. By integrating outlier prevention directly into pre-training, OSP opens the door to:

  • 4-bit inference without catastrophic loss
  • Efficient edge deployment of LLMs
  • Better foundations for PTQ—OSP-trained models still benefit from Hadamard and GPTQ, but start from a stronger base

And perhaps most importantly, OSP challenges the fatalism around outliers. They’re not intrinsic to scale or transformers. They’re artifacts of design choices—and design choices can be changed.

Final Thoughts

The Outlier-Safe Pre-Training framework feels like one of those ideas that will, in hindsight, seem obvious. Why were we ever tolerating outliers in the first place?

By demonstrating that it’s possible to train LLMs not to form outliers, Jungwoo Park and collaborators have opened up new frontiers in efficient model deployment. The next time someone says “LLMs can’t work on-device,” just smile and say: maybe yours can’t.


Cognaptus: Automate the Present, Incubate the Future