Packing is easy until one object is much larger than everything else.
A warehouse can fit hundreds of ordinary boxes onto neatly spaced shelves. Add one grand piano, however, and the spacing plan becomes rather less elegant. Either the piano does not fit, or every shelf is redesigned around an object that appears once.
Scalar quantization has a similar problem. It replaces a wide range of model-weight values with a limited set of discrete levels. When a few weights are much larger than the rest, those outliers expand the range that the quantization grid must cover. The grid spends precious levels accommodating extremes while representing ordinary weights less precisely.
The OptRot paper1 asks whether the model can be reorganized before quantization so that fewer coordinates behave like grand pianos. Its answer is a learned rotation that preserves the model’s function while redistributing extreme weight values. More unusually, OptRot learns that rotation without placing quantization itself inside the optimization loop.
The central contribution is therefore not merely another rotation method. It is a mechanism:
The mechanism works convincingly for GPTQ weight quantization. It also reveals its own boundary: a rotation that improves weight quantization can worsen activation quantization when both are pushed to four bits.
Geometry, as usual, does not offer free lunches. It merely makes the invoice easier to read.
One Outlier Can Set the Scale for an Entire Group
In uniform scalar quantization, each weight is mapped onto one of a finite number of levels. Fewer bits mean fewer available levels. A four-bit representation, for example, must describe weights using a much coarser grid than a sixteen-bit representation.
The quantizer generally scales the grid so that the largest weight in a group remains representable. That prevents clipping, but it creates a different problem: when the largest weight is far from the rest, the distance between adjacent grid points grows. Most weights are then rounded with unnecessarily large errors.
Smaller quantization groups can contain the damage, but they require more stored scale values. Outliers therefore impose a choice between precision and metadata overhead.
The paper formalizes the weight-outlier problem through weight incoherence:
Here, $W$ is an $m \times n$ weight matrix. The numerator captures the largest absolute weight, while the Frobenius norm in the denominator captures the matrix’s overall energy.
A low value means weight magnitude is distributed relatively evenly. A high value means a small number of coordinates dominate. The minimum possible incoherence is one, reached when all weights have the same magnitude.
This matters because the paper derives quantization-error bounds for both round-to-nearest quantization and a theoretically modified version of GPTQ in which the error grows with weight incoherence. The result turns an intuitive complaint about outliers into an optimization target.
The quantizer is not offended by outliers. It is simply forced to budget around them.
Rotations Change the Coordinates, Not the Model
An orthogonal rotation changes how a representation is expressed without changing the underlying information.
Consider a vector described using horizontal and vertical coordinates. Rotate the coordinate axes, and the numerical coordinates change even though the physical vector does not. In a neural network, compatible rotations can similarly be inserted into adjacent operations and then algebraically absorbed into the model’s weight matrices.
The model computes the same function, but the distribution of values inside particular matrices changes.
That distinction allows rotations to reduce outliers without retraining the original model. Some rotation methods use fixed Hadamard matrices. Others learn transformations by repeatedly quantizing the model and optimizing its output loss.
OptRot learns two types of fusible rotations used across the relevant Llama-style linear layers. Because the learned rotations can be folded into the weights, they do not themselves add inference-time computation. The paper’s main experimental setup can still include separate online rotations inherited from prior rotation-based pipelines, but an appendix test removes those online rotations and finds that OptRot retains its advantage.
The useful operational claim is therefore narrower than “rotations are free.” The rotations learned by OptRot are fusible, and its gains do not depend entirely on retaining additional online transformations.
The Paper Turns Model-Level Damage Into a Layerwise Objective
The quantity a deployment team ultimately cares about is not weight incoherence. It is whether the compressed model still behaves like the original.
The paper begins from the divergence between the output distributions of the original and quantized models. Directly optimizing that divergence while differentiating through GPTQ would be expensive. GPTQ performs sequential corrections using activation-covariance information, and its quantization operation is not conveniently differentiable.
The authors therefore approximate model-level divergence using a sum of layerwise reconstruction errors:
where $\hat{W}$ is the quantized weight matrix and
is the uncentered covariance of the layer’s input activations, commonly described as the layer Hessian in GPTQ implementations.
The same error can be read more intuitively as:
It asks how much the quantized layer’s output differs from the original layer’s output over representative inputs.
From there, the paper derives upper bounds in which the rotation-sensitive part can be reduced to two broad components:
The first component measures extreme weights. The second depends on the activation covariance and the LDL structure used by GPTQ.
The exact bound contains additional dimensions, bit-width terms, and probability factors. Those terms matter to the proof, but the rotation-learning implication is simpler: reduce extreme weights, improve favorable feature correlations, or do both.
That creates two methods.
OptRot Replaces the Maximum With a Fourth-Power Penalty
Directly minimizing the largest absolute weight is awkward. The maximum changes abruptly when a different coordinate becomes the largest, producing a non-smooth optimization problem.
OptRot substitutes a smooth proxy:
The rotated matrices are denoted by $\widetilde{W}$. Raising values to the fourth power penalizes large coordinates disproportionately. Doubling an ordinary value multiplies its contribution by sixteen; doubling an already large value is correspondingly expensive.
Because an orthogonal rotation preserves overall matrix energy, optimization cannot simply shrink every weight. It must redistribute that energy away from the most extreme coordinates.
The choice of a fourth-power objective is not merely aesthetic. Higher powers showed no improvement in the authors’ preliminary experiments, while the fourth power is closely related to kurtosis, a familiar measure of distributional tails.
This is what makes OptRot data-free: learning the rotations does not require a calibration dataset or repeated quantization of the model.
That phrase needs careful handling. OptRot still requires access to the model weights, and the downstream GPTQ stage in the paper still uses calibration data. “Data-free” describes rotation learning, not the complete compression pipeline. A closed-weight API remains stubbornly closed, regardless of how elegant the objective is.
OptRot can also learn rotations using only the 50 weight matrices with the largest objective values. This cheaper variant generally remains competitive, suggesting that the worst-behaved matrices carry much of the available improvement.
OptRot+ Uses the Hessian Mostly to Decide Where Improvement Matters
OptRot deliberately ignores the activation-dependent part of the error bound. OptRot+ restores it.
The data-dependent method combines a smooth weight-outlier penalty with a rotation-sensitive upper bound derived from the activation covariance. In principle, this should favor rotations that both reduce weight extremes and create a covariance structure that GPTQ can quantize more effectively.
The method is more expensive. It requires calibration data, Hessian computation, and additional optimization over Hessian-related objectives. The authors avoid repeatedly differentiating through a full Cholesky decomposition by using a cheaper, parallelizable upper bound.
The experimental interpretation is more interesting than the method description.
Across learned rotations, the paper finds that most Hessian-related metrics are nearly identical, apart from the unrotated model. Yet OptRot+ still produces modest downstream improvements. The authors’ likely explanation is that the data-dependent term functions primarily as a layer-importance score: it tells the optimizer where reducing weight incoherence matters most, rather than discovering radically better feature correlations within every layer.
That distinction changes the business interpretation. OptRot+ is not evidence that more elaborate covariance optimization unlocks an entirely different compression regime. It is evidence that modestly better prioritization can improve a strong weight-focused objective.
Useful, yes. Revolutionary, no. The Hessian has survived.
The Experiments Validate the Mechanism Before the Leaderboard
The paper evaluates Llama-3, Qwen3, and additional Qwen2 models. Weights are quantized using GPTQ, while activations—when included—are quantized using round-to-nearest. Evaluation covers WikiText perplexity, average accuracy across six zero-shot commonsense benchmarks, and KL divergence between the quantized and original models.
The tests serve different purposes:
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Layerwise incoherence and signal-to-noise ratio | Mechanism validation | Lower weight incoherence generally corresponds to lower GPTQ reconstruction error | That incoherence alone determines end-to-end model quality |
| Four-bit weight-only GPTQ on Llama models | Main evidence | OptRot improves practical GPTQ compression against major rotation baselines | Universal dominance on every metric |
| Weight-only GPTQ on Qwen models | Architecture-family robustness | The effect is not confined to the tested Llama architecture | Generalization to arbitrary model families |
| Three-bit quantization and removal of online rotations | Stress and robustness tests | Gains remain under harder compression and a leaner rotation setup | Measured production latency or cost savings |
| W4A8 and W4A4 activation tests | Boundary identification | Weight-focused rotations behave differently as activation precision falls | A complete solution for joint weight-and-activation quantization |
| RTN experiments with SpinQuant W4 | Objective-alignment ablation | End-to-end quantization-aware learning can exploit interactions missed by layerwise proxies | That OptRot is the best method for RTN |
This hierarchy matters. The appendix is not a second thesis. It tests where the main mechanism survives and where it stops being sufficient.
Lower incoherence becomes higher layerwise signal-to-noise
The layerwise plots provide the cleanest evidence for the proposed mechanism. Hadamard rotations improve weight incoherence over no rotation. SpinQuant, which is primarily trained around activation quantization, generally does not reduce weight incoherence further. OptRot and OptRot+ achieve the lowest incoherence across most layer types.
The down-projection layers are especially informative because they are known to contain substantial weight outliers. OptRot consistently lowers incoherence there and produces higher signal-to-noise ratios after GPTQ quantization.
This connects the theory to practical GPTQ behavior:
The plots do not prove that every reduction in incoherence must improve every downstream metric. They do show that the proxy controls the quantity it was designed to control.
The strongest consistent end-to-end result is lower KL divergence
For four-bit weight-only GPTQ, OptRot consistently improves KL divergence relative to SpinQuant and QuaRot. OptRot+ generally improves it further.
Selected Llama results illustrate the pattern:
| Model | SpinQuant KL ↓ | OptRot KL ↓ | Best OptRot-family KL ↓ |
|---|---|---|---|
| Llama-3.2-1B | 0.137 | 0.125 | 0.114 |
| Llama-3.2-3B | 0.096 | 0.093 | 0.086 |
| Llama-3.1-8B | 0.0925 | 0.0866 | 0.080 |
KL divergence measures how closely the quantized model’s output distribution follows the original model. It is the metric most directly aligned with the paper’s starting objective, and it is where the OptRot family shows its most consistent advantage.
Accuracy and perplexity also improve in many cases, but leadership varies by model and variant. On Llama-3.1-8B, for example, OptRot+ raises average accuracy from SpinQuant’s 69.46 to 70.16, close to the FP16 model’s 70.29. On other models, a competing method may retain slightly better task accuracy despite having worse KL divergence.
The correct reading is not “OptRot wins every column.” It is that reducing the proposed error proxy consistently produces a quantized model that more closely resembles the original, with downstream gains that are usually—but not perfectly—aligned.
Qwen results show transfer, not universality
On Qwen3 models, OptRot achieves lower KL divergence than both QuaRot and SpinQuant at every tested size:
| Model | QuaRot KL ↓ | SpinQuant KL ↓ | OptRot KL ↓ |
|---|---|---|---|
| Qwen3-1.7B | 0.089 | 0.091 | 0.083 |
| Qwen3-4B | 0.064 | 0.064 | 0.059 |
| Qwen3-8B | 0.043 | 0.041 | 0.039 |
The additional Qwen2 appendix results broadly reinforce the finding. This supports the claim that the objective is not tied exclusively to one Llama implementation.
It remains evidence across related transformer families, not permission to declare the problem solved for every architecture, quantizer, and deployment stack.
The difficult settings strengthen the weight-only case
The three-bit weight-only tests are a useful stress test. Under more aggressive compression, OptRot often creates a larger advantage over SpinQuant, reducing the gap between the compressed model and FP16.
Removing the online rotations also preserves OptRot’s advantage over SpinQuant and QuaRot across the tested Llama models. This matters operationally because online rotations can introduce inference overhead. The result suggests that the learned fusible rotations are doing substantive work rather than merely decorating a stronger inherited pipeline.
Neither test reports production throughput, memory usage, or wall-clock rotation-learning cost. They establish robustness of model quality, not an ROI spreadsheet.
At W4A4, Better Weights Can Produce a Worse System
A tempting interpretation is that reducing weight outliers should improve every low-bit deployment. The activation-quantization experiments reject it.
At W4A8—four-bit weights and eight-bit activations—OptRot remains competitive with SpinQuant. It achieves lower WikiText perplexity across all three tested Llama models and generally improves KL divergence.
At W4A4, the pattern reverses. OptRot often performs worse than SpinQuant and, on several measures, worse than fixed Hadamard rotations.
The KL divergence results make the trade-off visible:
| Model | SpinQuant W4A4 KL ↓ | OptRot W4A4 KL ↓ |
|---|---|---|
| Llama-3.2-1B | 0.393 | 0.430 |
| Llama-3.2-3B | 0.308 | 0.362 |
| Llama-3.1-8B | 0.273 | 0.348 |
Why does a better weight rotation harm the complete system?
A rotation changes both the weights and the activation coordinates that interact with them. OptRot explicitly optimizes weight outliers. When activations remain at eight bits, activation error is limited enough that improving the weights still helps overall performance. At four bits, activation outliers become dominant. A rotation favorable to weight quantization can make the activation distribution harder to represent.
The failure is therefore not an inconvenient benchmark exception. It follows from the objective.
OptRot asks: Which equivalent coordinate system makes the weights easiest for GPTQ to quantize?
W4A4 asks a different question: Which coordinate system jointly balances weight and activation error under extreme compression?
Those questions can have different answers. Optimizing one side harder is not automatically a compromise. Sometimes it is merely choosing a side.
Where the Proof Stops and Practical GPTQ Begins
The paper’s theoretical story is useful, but it should not be made tidier than it is.
The rigorous GPTQ error bound applies to a modified method the authors call GPTQS. GPTQS uses a constrained LDL decomposition and stochastic rounding so that corrected weights remain inside the quantization range with high probability.
Practical GPTQ instead uses the ordinary LDL structure and clamps corrected values. The appendix shows a substantial gap between the constrained LDL assumed in the theory and the true LDL used in practice. The authors explicitly note that substituting one for the other does not preserve the theoretical bound.
There is another approximation earlier in the chain. Model-level KL divergence is replaced by a sum of layerwise reconstruction errors using second-order, block-diagonal, and factorization assumptions.
The resulting logical structure is therefore:
- The theory identifies weight incoherence as a principled driver of quantization error under analyzable conditions.
- OptRot turns that insight into a cheap proxy objective.
- Layerwise experiments show that the proxy lowers incoherence and practical GPTQ reconstruction error.
- End-to-end experiments show that those improvements generally reduce KL divergence and preserve task performance.
The proof motivates the method. The practical experiments validate its transfer to ordinary GPTQ. They are related claims, not identical ones.
The RTN appendix reinforces this boundary. OptRot often achieves lower incoherence and better layerwise SNR than an RTN-trained SpinQuant variant. Yet SpinQuant W4 produces better downstream results because its end-to-end objective can learn compensating errors across weights and layers.
A layerwise proxy cannot capture every useful interaction. It is attractive because it is cheap, stable, and aligned with GPTQ—not because proxies have finally defeated reality.
OptRot Is a Deployment Decision, Not a Leaderboard Trophy
The paper directly establishes improved model-quality results in several tested GPTQ settings. It does not measure serving throughput, end-to-end compression cost, energy consumption, or monetary ROI.
The business interpretation must therefore separate demonstrated results from operational inference.
What the paper directly shows
- OptRot learns weight-focused fusible rotations without calibration data during rotation learning.
- It generally lowers weight incoherence and improves practical GPTQ layerwise SNR.
- It consistently lowers KL divergence against major rotation baselines in tested four-bit weight-only Llama and Qwen models.
- It remains competitive at W4A8.
- It is not a reliable default at W4A4.
- OptRot+ offers modest additional gains at additional computational and data cost.
What Cognaptus infers for deployment teams
- A model-serving team already using GPTQ can treat OptRot as a comparatively lean preprocessing candidate before adopting a heavier quantization-aware rotation-learning pipeline.
- Because the learned rotations are fusible, their benefits need not introduce additional inference computation.
- The top-50 variant and successful no-online-rotation test suggest possible ways to limit preparation cost, although the paper does not quantify those savings.
- Lower KL divergence can be valuable when compressed models must remain behaviorally close to an approved reference model, even when a small benchmark-accuracy difference is operationally irrelevant.
A practical selection framework
| Intended deployment | Reasonable first candidate | Reason | Boundary |
|---|---|---|---|
| Four-bit or three-bit weight-only GPTQ | OptRot | Strongest and most consistent evidence; data-free rotation learning | Validate on domain tasks and the production GPTQ implementation |
| W4A8 GPTQ deployment | Benchmark OptRot against SpinQuant | OptRot remains competitive and often improves perplexity or KL | Joint weight-and-activation behavior is less predictable |
| W4A4 deployment | Activation-aware method rather than OptRot by default | Activation outliers dominate and OptRot often degrades results | Requires configuration-specific evaluation |
| Weight-only RTN with sufficient training budget | Quantization-aware SpinQuant W4 | End-to-end objective outperforms OptRot downstream in the paper | RTN remains weaker than GPTQ overall in the tested RTN remains results |
| Quality-sensitive GPTQ pipeline with available Hessian resources | OptRot+ | Modest additional KL improvements | Extra computation may not justify the incremental gain |
For most tested GPTQ weight-only deployments, the authors recommend the simpler OptRot rather than OptRot+. That recommendation is sensible. OptRot+ is the method to test when a small quality improvement matters enough to justify Hessian-dependent rotation learning—not merely because a plus sign has appeared in the methods table.
The Remaining Boundaries Are Specific, Not Ceremonial
Several limitations materially affect adoption.
First, the theory does not directly bound ordinary GPTQ with its true LDL and clamping behavior. Practical effectiveness is supported empirically.
Second, the evaluation concentrates on Llama and Qwen transformer families, scalar GPTQ and RTN quantization, and a limited set of bit-width configurations. Other architectures and quantization systems may respond differently.
Third, “data-free” does not eliminate calibration data from GPTQ itself. It removes data from the rotation-learning stage.
Fourth, the paper reports model-quality metrics rather than systems metrics. It does not establish how long rotation learning takes relative to alternatives, how much GPU memory is saved in practice, or whether lower KL divergence produces measurable serving-cost reductions.
Finally, the results show sensitivity to the exact proxy. OptRot-v2, which more closely mirrors the squared term in the theoretical bound, performs poorly on some configurations, including Llama-3.1-8B. Matching a bound more literally is not automatically better optimization. Mathematics occasionally declines to reward enthusiasm.
These boundaries do not erase the contribution. They identify where engineering validation must begin.
Rotate the Problem Before Paying to Solve It
OptRot’s most useful lesson is not that every model should be rotated before quantization. The W4A4 results make that interpretation untenable.
The stronger lesson is that expensive deployment problems often contain a cheaper intermediate variable worth optimizing.
Instead of repeatedly running GPTQ inside rotation learning, OptRot identifies a geometric property—weight incoherence—that can be improved more directly. Instead of treating the largest weights as unavoidable facts, it changes the coordinate system in which “largest” is measured. Instead of assuming more calibration and more simulation must produce better compression, it tests whether a principled proxy is sufficient.
For GPTQ weight quantization, it often is.
The grand piano has not disappeared. It has been rotated until it fits through the door.
Cognaptus: Automate the Present, Incubate the Fture.
-
Advait Gadhikar, Riccardo Grazzi, and James Hensman, “OptRot: Mitigating Weight Outliers via Data-Free Rotations for Post-Training Quantization,” arXiv:2512.24124. ↩︎