TL;DR for operators

If your fine-tuned model gets better on the training task while quietly becoming worse outside it, the problem may not be that the model “lost intelligence”. It may have rotated its useful internal directions away from broadly generalizable behaviour.

The paper behind this article studies SFT followed by PPO-style RL on two open LLMs using a controlled arithmetic benchmark, then inspects the weight matrices through singular-value decomposition.1 The pattern is clean enough to be operationally interesting: OOD performance peaks early during SFT, falls as SFT continues, and can be substantially restored by RL when the SFT checkpoint is only moderately degraded. But if SFT pushes the model too far into a specialized regime, RL is no longer a reliable rescue crew. Apparently even reinforcement learning has limits. Who knew.

The mechanism is the useful part. The authors find that singular values barely move; singular vectors rotate. In less algebraic English: the model’s representational capacity is not obviously collapsing. The directions through which that capacity is expressed are being reoriented.

For post-training teams, the immediate playbook is not “run more PPO”. It is:

  1. track ID and OOD probes separately during SFT;
  2. checkpoint around the OOD peak, not only at the best ID score;
  3. inspect directional drift before committing to RL;
  4. try targeted singular-vector restoration by layer or rank as a cheap diagnostic;
  5. treat RL as a restoration and rebalancing stage, not a generic capability generator.

The boundary: this is not a universal law of all reasoning, all RLHF, or all domains. The evidence is strongest for the paper’s controlled 24-point arithmetic setting, two model families, and full-parameter SFT/RL pipeline. It is still a sharper mental model than the usual “SFT memorizes, RL generalizes” slogan, which is true-ish in the way a weather forecast saying “outside happens” is true-ish.

The expensive part is not RL; it is knowing what RL is fixing

The industry likes a simple post-training story. First you supervise the model with demonstrations. Then you reinforce it with rewards. The first stage teaches the format; the second stage teaches reasoning. Lovely. Clean. Board-slide friendly.

The paper complicates that story. It does not deny that RL can improve OOD performance. In the authors’ experiments, it clearly can. What it disputes is the interpretation that RL is primarily discovering brand-new reasoning ability after SFT has done its clerical warm-up. The better reading is more conservative and more useful: SFT can damage general-purpose behaviour while improving target-task fit, and RL can partially undo that damage by moving the model’s feature directions back into a more useful arrangement.

That distinction matters because it changes the operating question.

The naive question is:

How much RL do we need to add after SFT?

The better question is:

What did SFT rotate out of alignment, and can we repair that before spending a full RL budget?

This is why the paper’s spectral analysis is not decorative maths. It is the difference between treating PPO as a ritual and treating post-training as a diagnosable engineering process.

SFT first helps generalization, then starts selling it for ID accuracy

The experimental setting is deliberately narrow. The authors use GeneralPoints, a 24-point card-game environment where the model must form an equation that reaches a target number using four card values. The OOD variant changes how face cards are interpreted: instead of treating J, Q, and K as 10, it evaluates them as 11, 12, and 13. That is not “the whole of reasoning”, but it is a useful controlled stress test because the model cannot simply lean on the exact training convention.

The two-stage pipeline is standard enough: start with base models, run SFT, then apply PPO-style RL from selected SFT checkpoints. The models are Llama-3.2-11B and Qwen-2.5, labelled as Qwen-2.5-7B in the result figures and Qwen-2.5-8B in one setup sentence. I will follow the result figures and call it Qwen-2.5-7B.

The first main evidence is the checkpoint comparison.

Model Stage ID / IID accuracy OOD accuracy Interpretation
Qwen-2.5-7B Intermediate SFT 26.07% 19.67% Early SFT improves OOD before specialization dominates
Qwen-2.5-7B Full SFT 43.59% 17.09% ID improves while OOD falls
Qwen-2.5-7B RL after SFT 40.60% 19.66% RL nearly restores the early OOD peak, with some ID trade-off
Llama-3.2-11B Intermediate SFT 24.79% 17.52% Early SFT again produces the OOD peak
Llama-3.2-11B Full SFT 68.38% 8.97% Strong ID specialization coincides with severe OOD loss
Llama-3.2-11B RL after SFT 71.79% 16.24% RL recovers most OOD loss; ID does not fall in this plotted result

Two details are worth separating.

First, the central pattern is not just “SFT bad, RL good”. SFT initially helps OOD performance. The problem appears later, when continued SFT keeps improving the in-distribution task while the OOD variant deteriorates. In Figure 2, the same pattern appears in loss space: ID train and test loss decline smoothly, while OOD loss drops early and then rises.

Second, the ID trade-off is not uniform in the plotted numbers. For Qwen, RL restores OOD while ID falls from the full-SFT peak. For Llama, the figure shows ID increasing after RL. So the safest conclusion is not “RL always sacrifices ID”. It is that ID and OOD are not moving under the same incentives, and an ID-only SFT curve is a poor instrument panel. Flying a model by that dashboard is how teams discover “generalization regression” after deployment and then pretend this was unforeseeable.

The mechanism: singular values stay home while singular vectors wander

The paper’s mechanism-first claim is spectral.

For a weight matrix $W$, singular-value decomposition writes:

$$ W = U \Sigma V^\top $$

Here, $\Sigma$ contains singular values: the magnitudes or “stretching strengths” of the matrix along principal directions. The columns of $U$ and $V$ are singular vectors: the directions themselves.

A crude but useful analogy: singular values tell you how powerful the channels are; singular vectors tell you where those channels point. The paper argues that SFT and RL mostly alter where the channels point, not how powerful they are.

The authors examine key matrices in transformer blocks, especially attention projections such as $W_q$, $W_k$, and $W_v$, plus discussion of embedding/output-head geometry. In the plotted singular-value analysis for the first self-attention layer of Llama-3.2-11B, the “before” and “after” singular-value curves are nearly indistinguishable across both SFT and RL. The reported singular-value differences are small, roughly zero-centred, and on the order of 0.005.

That is the non-event that becomes the event. If performance changes materially while singular values barely move, then magnitude is unlikely to be the main story.

The authors then measure principal angles between singular-vector subspaces before and after fine-tuning. These angles are a directional drift measure: near zero means the direction is preserved; near $90^\circ$ means strong misalignment. Their principal-angle plots show leading singular directions starting around 25–30 degrees and tail directions rising toward 90 degrees. SFT and RL produce strikingly similar rotation profiles.

So the mechanism is not “SFT crushes the spectrum and RL inflates it again”. The mechanism is closer to:

SFT and RL move the model through rotations of representational directions, while much of the spectral magnitude structure remains intact.

That has an important practical implication. If the model’s broad capacity has not disappeared, recovery may not require relearning everything. It may require reorientation.

The paper’s tests are not all making the same kind of claim

This matters because the article’s business interpretation should not flatten every figure into “proof”. The paper uses different tests for different purposes.

Test or section Likely purpose What it supports What it does not prove
Figure 1 checkpoint comparison Main evidence OOD peaks early during SFT, drops after continued SFT, and can be restored by RL That RL universally creates new reasoning capability
Figure 2 ID/OOD loss curves Supporting diagnostic ID and OOD trajectories diverge during SFT; loss proxies reveal forgetting dynamics cheaply Full rollout accuracy at every step
Figure 3 singular-value comparison Mechanism measurement Singular values in selected Q/K/V matrices remain highly stable after SFT and RL That no other matrix or layer changes matter
Principal-angle analysis Mechanism measurement Singular-vector directions rotate substantially while values remain stable The full causal explanation for why SFT and RL share similar rotation profiles
Figure 5 layer-wise restoration Intervention / ablation Restoring singular-vector directions in selected layers changes ID/OOD behaviour differently A production-ready universal reset recipe
Figure 6 rank-wise restoration Intervention / ablation Top-rank singular directions appear especially important for OOD recovery That all OOD knowledge lives only in top ranks
Figure 7 imposing SFT directions on RL model Causal validation Forcing RL models back toward poorer SFT directions damages OOD performance That RL’s entire benefit is exhausted by this one geometric operation
Appendix A rotation argument Theoretical sketch / explanatory support Orthogonal rotations can preserve singular values while changing vectors at low parameter cost A complete proof of real transformer training dynamics

The last row deserves attention. The appendix gives a toy argument for why optimizers may “prefer” rotations: under small-step updates, a pair of matrices can be transformed by approximately orthogonal changes that preserve the composed function to first order, keep singular values stable, and rotate singular vectors. This is a useful explanatory sketch. It is not the same as proving that all useful fine-tuning dynamics in large transformers are reducible to orthogonal gauge drift. The paper itself treats the shared SFT/RL rotation profile as an open question. Good. Humility occasionally survives peer review.

Directional restoration turns the mechanism into an intervention

The strongest part of the paper is not that it observes rotation. Observation is nice; intervention pays rent.

The authors perform “directional re-alignment” experiments: they replace the post-SFT singular-vector directions with directions from the original pretrained model while keeping SFT-learned singular values. In effect, they ask whether undoing SFT-induced rotations can recover OOD performance.

Layer-wise restoration produces a useful split. Restoring intermediate layers, such as layers 10–20 for Llama and 10–15 for Qwen, is most damaging to ID performance. That suggests SFT’s task-specific adaptations concentrate heavily in the middle of the model. Restoring shallower and deeper layers has a stronger effect on OOD behaviour, suggesting those regions may be more important for preserving broader functional alignment.

Rank-wise restoration sharpens the point. Restoring top singular directions has the largest positive impact on OOD accuracy. For Qwen, OOD performance peaks after restoring roughly 512 directions and then stabilizes. Llama is noisier but follows the broad trend that more directional restoration can improve OOD performance.

Then comes the useful causal check. The authors take a high-performing RL-tuned model and impose the poorer SFT feature directions on it. For Llama-3.2-11B, OOD accuracy drops from 16.2% to 10.6%. For Qwen-2.5-7B, the drop is smaller, from 19.8% to 19.0%. The contrast is important: the Llama result strongly supports the directional-recovery story; the Qwen result suggests model-specific geometry matters.

This is where the article’s title earns its rent. RL fine-tuning, in this evidence, is less inventor than spin doctor. It does not necessarily write new capabilities into the model from blank space. It often rotates the representational story until old capabilities become usable again.

The business value is cheaper diagnosis, not cheaper mythology

For companies fine-tuning LLMs, the default workflow is often expensive in the wrong places. Teams spend heavily on RL infrastructure while underinvesting in checkpoint diagnostics. They monitor target-task loss, declare victory, and later discover that the model has become brittle outside the supervised distribution. Then PPO is summoned like an exorcist.

A better workflow follows directly from the paper.

Operational problem Better diagnostic Lower-cost intervention before RL Business meaning
ID accuracy improves while external tasks regress Maintain an OOD probe during SFT Stop or branch from the early OOD peak Avoid paying RL to repair damage you could have avoided
OOD regression appears after full SFT Compare singular-vector drift from base checkpoint Test UV restoration on selected layers/ranks Determine whether the issue is reorientation rather than missing data
RL restores some OOD but not enough Inspect whether the SFT checkpoint is already too specialized Restart RL from an earlier SFT checkpoint Better SFT in, better RL out
Fine-tuning budget is constrained Use loss proxies and spectral checks before rollout-heavy evaluation Try shallow or low-rank restoration as triage Reserve PPO for cases where cheap repair fails
Domain deployment requires reliability Track ID and OOD separately through the pipeline Keep multiple SFT checkpoints, not just the final one Prevent the “best checkpoint” from meaning “best overfit checkpoint”

The ROI argument is not that SVD repair will replace RL. That would be a neat overstatement, and neat overstatements are how engineering teams become archaeology departments. The argument is narrower: spectral diagnostics can tell you whether RL is likely repairing directional drift, and directional interventions can serve as a fast pre-RL test.

If a shallow reset or top-rank UV merge recovers much of the OOD loss, you have learned something valuable. You may still run RL, but now you are running it with a hypothesis. That is cheaper than running it with vibes.

What this changes in the SFT-versus-RL debate

The paper sits in a live argument: does SFT memorize while RL generalizes? The answer from this work is not “no”. It is “that slogan skips the mechanism”.

SFT can improve OOD early because it teaches the model task format and useful reasoning behaviours. Then continued SFT can over-specialize, pulling the model into directions that fit the supervised distribution but degrade the OOD variant. RL can counteract some of that drift, especially when the SFT checkpoint remains close enough to a recoverable representation.

That means RL’s advantage may depend less on mystical exploration and more on initialization geometry. Start from a decent SFT checkpoint and RL has something to restore. Start from a heavily overfit checkpoint and RL may spend its budget pushing on a representation that has already moved into a less useful regime.

For business teams, the lesson is uncomfortable but useful: the quality of the SFT checkpoint is not a prelude to the “real” RL stage. It is one of the main determinants of whether RL will work.

This also reframes KL control. KL regularization is often described as keeping the policy from drifting too far from the reference model. In the paper’s geometry, that sounds less like bureaucratic caution and more like preserving useful directions. Staying near the base model is not always cowardice. Sometimes it is asset preservation.

Boundaries: useful evidence, not a universal post-training law

The paper is valuable because it is mechanistic. It is not universal because it is narrow.

The task is a controlled arithmetic environment, not open-ended software engineering, legal drafting, multi-turn tool use, or financial analysis. The OOD shift is crisp: J/Q/K change from 10 to 11/12/13. That makes the experiment interpretable, but it also means the results should not be casually promoted into a grand theory of all reasoning models.

The model coverage is also limited. Two open models are enough to show an interesting pattern; they are not enough to establish invariance across architectures, scales, data mixtures, reward models, and RL algorithms. The RL method is PPO-style, so conclusions should not be blindly transferred to every preference-optimization method wearing an RL-adjacent jacket.

The manuscript itself is also clearly early. It contains internal TODO comments, inconsistent Qwen naming, and open-ended future-work statements about validating head/tail singular-value roles across more tasks. That does not invalidate the result. It does mean the right managerial response is “add this diagnostic to the experiment plan”, not “rewrite the post-training strategy manual in permanent ink”.

The most important limitation is causal scope. The intervention experiments strongly suggest singular-vector directions matter. They do not prove that singular-vector rotation is the only mechanism behind RL’s gains. In a transformer, “only mechanism” is usually a phrase said immediately before a model embarrasses you.

The conclusion: train less blindly, rotate more deliberately

The paper’s best contribution is a replacement mental model.

Old model:

SFT teaches the model. RL makes it reason.

Better model:

SFT changes both task fit and representation geometry. If pushed too far, it can rotate general-purpose directions away from useful behaviour. RL can partially restore those directions, but its success depends heavily on the checkpoint it inherits.

That replacement matters. It turns post-training from a two-stage faith ritual into a measurable process: track OOD early, preserve intermediate checkpoints, inspect spectral drift, test directional repair, and only then decide whether PPO deserves the compute.

RL fine-tuning is not a panacea, and it is not a mirage. It is often a very expensive way to rotate the model back toward capabilities it already had before SFT got overenthusiastic. That is not as glamorous as “RL invents reasoning”. It is more useful, which is a terrible fate for a slogan but a decent outcome for an engineering team.

Cognaptus: Automate the Present, Incubate the Future.


  1. Hangzhan Jin, Sicheng Lv, Sifan Wu, and Mohammad Hamdaqa, “RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs,” arXiv:2508.16546, 2025. https://arxiv.org/abs/2508.16546 ↩︎