Fine-Tuning Isn’t Just Supervised: Why SFT Is Really RL in Disguise

TL;DR for operators

Fine-tuning on curated examples is usually sold as the boring, stable cousin of reinforcement learning. The paper behind this article says that is too neat. When a team filters examples into “good” and “not good,” it has already created a sparse reward function. Standard supervised fine-tuning on the surviving examples is therefore not outside reinforcement learning; it is optimising a lower bound on an RL objective, only without admitting it at the meeting.

The useful part is not the philosophical label. It is the engineering consequence. If ordinary SFT is a loose RL approximation, then better SFT should tighten the approximation. Qin and Springenberg propose two small variants: sampling examples according to quality scores, and importance-weighted SFT, which gives more weight to trajectories that matter more under the evolving policy.¹ On Qwen2.5-32B fine-tuned with the S1.1K reasoning dataset, iw-SFT reaches 66.7% on AIME 2024 and 64.1% on GPQA Diamond, improving over comparable SFT baselines while avoiding budget forcing at inference.

For business teams post-training domain models, the message is practical: curated data is not just “clean data.” It is a policy signal. Keep the reference model or a defensible approximation. Preserve quality scores where possible. Treat filtering decisions as reward design. Then consider importance weighting before building a full RL stack, a reward model, and a shrine to distributed debugging.

The boundary is equally practical. This does not make RL obsolete. The paper tests narrow reasoning and control settings, uses a 32B model trained on 8 H100 GPUs for under 18 hours, and still does not beat the strongest closed or heavily RL-trained models. The method is best read as a cheaper post-training lever for teams with strong curated trajectories, not as a universal replacement for reinforcement learning.

The familiar fine-tuning workflow is already making reward decisions

A normal enterprise post-training workflow looks harmless enough. Collect examples. Filter out the bad ones. Keep the high-quality completions. Fine-tune the model to imitate those completions. Call it supervised learning, because nobody wants to explain policy gradients before lunch.

But the filtering step is not neutral. It says: this trajectory gets kept; that one gets discarded. In reinforcement-learning language, this is a sparse binary reward. A trajectory that passes the filter receives reward 1. A trajectory that fails receives reward 0. The reward may come from a human judge, a benchmark checker, a code unit test, a mathematical answer verifier, or a domain expert’s slightly weary spreadsheet. The source varies. The structure is the same.

That is the paper’s central reframing. Supervised fine-tuning on curated data is not just imitation. It is reinforcement learning seen through a filtered dataset. The model is trained only on examples that satisfy the reward condition, so ordinary maximum-likelihood training becomes a special case of reward-weighted regression under sparse rewards.

The difference is that standard SFT hides this reward structure. It turns “these examples succeeded” into “imitate these examples,” and then pretends the missing failures have vanished from the problem rather than from the dataset. As usual, the accounting trick is not the same thing as the economics.

SFT optimises a lower bound, not the full RL objective

The reinforcement-learning objective is to maximise expected return over trajectories:

$$ J_{\mathrm{RL}}(\theta)=\mathbb{E}\ast{\tau \sim p\ast\theta(\tau)}[R(\tau)] $$

For language models, a trajectory is simply the generated token sequence. For a robot policy, it is a sequence of states and actions. The paper treats both under the same mathematical frame.

Ordinary SFT on a filtered dataset instead maximises the likelihood of successful trajectories:

$$ J_{\mathrm{SFT}}(\theta)=\mathbb{E}\ast{\tau \in \mathcal{D}^{+}}[\log p\ast\theta(\tau)] $$

where $\mathcal{D}^{+}$ is the curated set of “good” trajectories.

The bridge between the two objectives comes from expressing the RL objective using trajectories sampled from a reference policy, then applying a standard lower-bound inequality. In the sparse-reward case, where reward is essentially a success indicator, the reward-weighted regression objective collapses into maximum-likelihood learning on filtered examples. That is the elegant part. SFT is not merely adjacent to RL. Under this construction, it is optimising a lower bound on the RL objective.

The annoying part, because papers are allowed to contain one, is that the bound gets looser as training progresses.

At the start, the trained model is close to the reference policy that produced or approximates the curated dataset. As fine-tuning moves the policy away from that reference, the lower bound becomes a weaker proxy for the real RL objective. The model can keep getting better at imitating the curated set while becoming less directly connected to the reward-maximisation problem that made the set valuable in the first place.

That is the paper’s most useful conceptual move. The problem with ordinary SFT is not that it is too simple. The problem is that it is simple in a way that stops updating its view of which examples matter as the policy changes.

Importance weighting changes who gets heard during fine-tuning

The proposed fix is not to abandon SFT. It is to make SFT behave more like the RL objective it is already approximating.

The paper introduces an auxiliary distribution $q(\tau)$ and uses importance weighting to tighten the lower bound. At a high level, the training objective becomes a weighted version of maximum likelihood:

$$ J_{\mathrm{iw\text{-}SFT}}(\theta) \approx \mathbb{E}\ast{\tau \in \mathcal{D}^{+}} \left[ w(\tau)\log p\ast\theta(\tau) \right] $$

where the weight $w(\tau)$ depends on the ratio between the current policy and the auxiliary or reference distribution. The exact formulation matters for implementation, but the operational intuition is simple: not every successful example should be treated equally throughout training.

Some examples become more informative as the model shifts. Some are merely typical. Some are technically successful but lead the policy toward mediocre behaviour. Ordinary SFT gives them equal voting rights. iw-SFT adjusts the vote.

The paper also proposes SFT(Q), where examples are sampled or weighted according to quality scores. This matters because many real datasets are not merely pass/fail. A mathematical solution can be correct but clumsy. A customer-support answer can resolve the issue but sound like it was assembled from printer manuals. A code patch can pass tests while introducing maintenance debt, the software equivalent of moving mess from the desk to the cupboard.

Where quality scores exist, SFT(Q) gives the training process more information than binary filtering. Where only curated successes exist, iw-SFT tries to recover some of the missing reward structure through policy-dependent reweighting.

The toy bandit explains why filtering can throw away useful information

The paper’s toy example is small but important because it exposes a failure mode that is easy to miss in LLM discussions.

Imagine a two-arm bandit. One action is better than the other. A reference policy explores both arms uniformly. After collecting data, we filter for successful outcomes. Because the better arm succeeds more often, the filtered dataset contains more examples of it. Standard SFT learns this skew and improves over the reference policy.

But it does not necessarily learn the optimal policy. In the paper’s toy construction, filtered data contains twice as many right-arm actions as left-arm actions. SFT therefore learns to choose the better action two-thirds of the time. Better than random, yes. Optimal, no.

The reason is subtle. By training only on successes, SFT sees successful left-arm outcomes and successful right-arm outcomes, but it does not fully use the information contained in failures. The filtered dataset says “the right arm is more common among successes.” It does not directly say “the left arm fails more often.”

iw-SFT can recover the optimal policy in this toy setting by adaptively placing greater weight on the better action as training proceeds. That does not mean iw-SFT magically learns from data it never saw. It means the importance weights let the model use the relationship between the evolving policy and the reference distribution to sharpen the learning signal.

There is a warning embedded in the same example. Uncontrolled importance weights can collapse the policy. In the toy case, collapse to the right arm is correct. In a high-dimensional language model, collapse can mean overfitting, brittle reasoning style, reduced diversity, or a model that has learned to shout the one pattern that looked good in the curated set. The paper handles this through clipping and smoothing. Operators should translate that as: weighting is not seasoning. You cannot just add more and call it cuisine.

The LLM experiment tests whether weighting extracts more reasoning signal from the same curated traces

The main language-model experiment fine-tunes Qwen2.5-32B-Instruct on S1.1K, a heavily curated reasoning dataset. The dataset began from roughly 59K candidate question-answer pairs, used Gemini Flash to generate reasoning traces, then filtered down to about 1K examples by quality and diversity, with decontamination against test datasets.

This is exactly the setting where the paper’s thesis should matter. The dataset is small. The examples are high quality. The filtering process is doing real work. If curated SFT is secretly sparse-reward RL, then a better approximation to that RL objective should extract more value from the same 1K traces.

The results support that claim.

Model / method	Examples	AIME 2024	MATH 500	GPQA Diamond
Qwen2.5-32B-Instruct	N/A	26.7	84.0	49.0
s1 without budget forcing	1K	50.0	92.6	56.6
s1 with “Wait” budget forcing 4x	1K	56.7	93.0	59.6
s1.1 without budget forcing	1K	56.7	94.4	60.6
s1.1 with “Wait” budget forcing 2x	1K	56.7	95.4	63.6
iw-SFT, per-step importance weighting	1K	63.3	95.2	60.6
iw-SFT, sequence-level weighting	1K	66.7	94.8	64.1

The strongest iw-SFT run reaches 66.7% on AIME 2024, compared with 56.7% for s1.1 with budget forcing and 56.7% for s1.1 without budget forcing. On GPQA Diamond, iw-SFT reaches 64.1%, slightly above s1.1 with budget forcing at 63.6%. On MATH 500, it is broadly comparable rather than clearly better.

That pattern matters. The claim is not “iw-SFT improves everything.” The claim is narrower and more useful: in this curated reasoning setup, importance weighting extracts more signal on the hardest reasoning benchmark and remains competitive elsewhere, without relying on test-time budget forcing.

Budget forcing is itself an interesting comparison. The s1 line of work used “Wait” injections to make the model think longer at inference time. iw-SFT’s best reported result does not need that extra inference-time manipulation. In business terms, this shifts some value from serving-time tricks into training-time objective design. That is attractive because inference-time complexity repeats every time the model is used. Training-time complexity is painful once. Usually.

The paper also reports a per-step importance-weighting ablation. It performs well on AIME and MATH 500 but underperforms full sequence-level weighting on GPQA Diamond. The likely purpose is not to create a second method family, but to test whether the weighting should operate over complete trajectories rather than isolated tokens. For reasoning traces, sequence-level treatment makes conceptual sense: the value of a solution is rarely contained in a single token. It lives in the trajectory.

The control experiments test generality, not LLM glamour

The authors then move outside language models into offline reinforcement learning for continuous-control tasks. This is not decorative. It tests whether the mechanism is specific to token prediction or whether it transfers to policies over actions.

They evaluate D4RL locomotion environments, including halfcheetah, hopper, walker2d, and ant-maze. Because these datasets contain reward information, the experiments can use quality-scored variants: SFT(Q) and iw-SFT(Q). The model is a simple three-layer MLP policy, not a giant transformer wearing a GPU budget as jewellery.

The results are mixed in the useful way. SFT(Q) clearly improves over plain SFT and behaviour cloning in many tasks. iw-SFT(Q) often improves slightly over SFT(Q), but the improvement is less dramatic than in the LLM reasoning case.

Experiment	Likely purpose	What it supports	What it does not prove
Qwen2.5-32B on S1.1K	Main evidence	iw-SFT can extract more reasoning performance from a small curated dataset than comparable SFT baselines	That iw-SFT beats all RL-trained or closed models
Per-step vs sequence-level iw-SFT	Ablation	Complete-trajectory weighting is useful, especially for GPQA Diamond	That per-token weighting is useless
D4RL locomotion tasks	Generality test	Quality sampling and iw-SFT(Q) can be competitive with offline RL baselines in control settings	That iw-SFT(Q) dominates specialised offline RL methods
Franka Kitchen	Low-data and noisy-quality extension	Fine-tuning with quality and importance weighting can recover much of the performance lost by using small filtered data	That the method is robust in real robotics deployments
Budget forcing appendix	Robustness/sensitivity check	Extra forced thinking does not appear to improve the reported iw-SFT result	That inference-time scaling is generally unnecessary

The Franka Kitchen experiment is especially relevant for applied teams. The authors pre-train a behaviour-cloning policy on partial data, then fine-tune on a small filtered set from more complete or mixed trajectories. Training only on the filtered data fails. Fine-tuning helps. SFT(Q) helps more. iw-SFT(Q) recovers performance close to behaviour cloning on complete data.

Translated out of robotics: if you have a broad base of mediocre data and a small set of high-quality examples, do not assume the small set can stand alone. Pre-train or initialise on the broader behavioural distribution, then fine-tune using quality-aware weighting. The reference distribution still matters. The “good” examples are not self-sufficient little saints.

The operational lesson is to treat curation as reward design

The business relevance of this paper is not that everyone should implement iw-SFT tomorrow morning. The relevance is that it changes how post-training teams should think about their data pipeline.

Most organisations already curate. They select good support replies, good legal summaries, good research answers, good workflow plans, good SQL queries, good tool-use traces. They often treat this as data cleaning. The paper suggests it is closer to reward engineering.

That shift changes the checklist.

Technical idea	Operational consequence	Business relevance
Filtering induces sparse reward	Keep records of why examples were accepted or rejected	Curation becomes auditable model steering, not mere dataset hygiene
SFT lower-bounds RL	Interpret SFT gains as approximate policy optimisation	Teams can reason about when SFT is enough and when full RL may be justified
The bound loosens as policy drifts	Preserve or approximate the reference policy	Post-training should include reference-model management, not just final checkpoint storage
Quality scores tighten the signal	Store graded labels, not only pass/fail labels	Expert review becomes more valuable when it captures degrees of quality
Importance weighting changes trajectory influence	Weight examples dynamically rather than treating all curated examples equally	Small high-quality datasets may yield more value before costly RL infrastructure is needed
Clipping and smoothing control variance	Treat weight design as a stability parameter	Prevents the model from over-amplifying narrow patterns in curated data

The most immediate use case is domain LLM post-training where full RL is too expensive, too unstable, or organisationally too ambitious. Finance research assistants, legal drafting models, coding copilots, compliance triage agents, medical-adjacent administrative systems, and industrial troubleshooting bots all share the same practical problem: high-quality expert traces are scarce, expensive, and politically annoying to collect.

If the organisation already has curated traces, iw-SFT offers a plausible middle path. It does not require building a full online RL loop. It does not require a separate reward model in the basic curated-data setting. It does require careful reference-policy approximation and implementation of importance weights. That is a smaller beast, though still a beast.

What the paper directly shows, and what Cognaptus infers

The paper directly shows three things.

First, under a sparse-reward interpretation, SFT on filtered data can be derived as a lower bound on the RL objective. This is not merely metaphorical. It follows from the reward-weighted regression view and the lower-bound derivation.

Second, importance weighting and quality sampling can tighten or improve that training process. The strongest evidence is the Qwen2.5-32B reasoning result on S1.1K, where iw-SFT improves AIME 2024 performance over the SFT-based s1 and s1.1 baselines.

Third, the same family of ideas is not restricted to language tokens. The offline-control experiments show that SFT(Q) and iw-SFT(Q) can be competitive with stronger offline RL algorithms in several D4RL settings, though not uniformly superior.

Cognaptus infers a practical post-training rule from this: if your team is already curating examples, you should design the curation pipeline as if it were reward infrastructure. That means tracking the source policy, keeping quality scores, avoiding destructive filtering that erases useful negative information, and testing whether importance weighting improves downstream behaviour before escalating to more complex RL.

What remains uncertain is scale and breadth. The LLM experiment is compelling but narrow. It uses one main open model family, one small curated reasoning dataset, and benchmarks centred on mathematical and scientific reasoning. The control results broaden the mechanism but do not prove enterprise robustness. Nobody should read this as “replace RLHF with iw-SFT and go home early.” Tempting, but no.

The reference model is not an implementation detail

One of the paper’s most operationally important constraints hides in what might look like engineering housekeeping: importance weighting needs a reference distribution.

In the S1.1K experiment, the reasoning traces were generated with Gemini Flash, but the authors did not have access to Gemini Flash weights or its original training distribution. They therefore approximated the reference distribution using the starting Qwen2.5-32B model after testing likelihoods across candidate 32B models. This is a reasonable workaround, but it is still a workaround.

For companies, this matters because curated datasets often come from a messy mixture of sources: human experts, older models, outsourced annotators, production logs, vendor models, and “temporary” scripts that become permanent faster than anyone admits. If the source distribution is unknown, importance weighting becomes more approximate.

This does not kill the method. It changes the governance requirement. A serious post-training pipeline should record:

which model or human process generated each trajectory;
which filter accepted it;
whether the label is binary or graded;
whether the example passed automated checks, expert review, or both;
which reference checkpoint is appropriate for weighting.

Without this metadata, the team may still run SFT. It just should not pretend the result is a clean approximation to anything in particular. The model may improve. So does a lucky spreadsheet. That is not a methodology.

The real ROI is cheaper signal extraction, not magical alignment

There is a lazy version of this article that says: “SFT is RL, therefore enterprises can get RL benefits cheaply.” That version should be taken outside and made to read appendices.

The better interpretation is narrower. iw-SFT may improve the return on already-expensive curated data. It is a signal-extraction method. If expert traces cost money, and if the organisation has too few of them for brute-force scaling, then squeezing more performance out of each trace is economically meaningful.

The cost profile is also different from full RL. The reported LLM training runs use a single machine with 8 H100 GPUs and 192GB of host RAM, completing within about 18 hours. That is not pocket change. But it is far less operationally complex than maintaining online RL infrastructure, reward-model training loops, rollout generation, distributed evaluation, and the usual stack of YAML files quietly plotting against civilisation.

The method’s practical value is therefore strongest in three cases:

The team has a small but high-quality curated dataset.
The task has clear success or graded quality signals.
Full RL is too expensive, unstable, or unnecessary for the deployment stage.

It is weaker when the data is low quality, the curation criteria are vague, the reference distribution is unknowable, or the task requires exploration beyond existing trajectories. iw-SFT can reweight what you have. It cannot discover what your data never expressed.

Boundaries that matter before adoption

The paper’s limitations are not cosmetic. They directly affect whether an operator should use the method.

The first boundary is domain narrowness. The strongest LLM evidence is for reasoning benchmarks. That is valuable, but reasoning benchmarks are not the same as customer operations, legal drafting, clinical administration, or enterprise tool use. Transfer is plausible. It is not shown.

The second boundary is dependence on curation quality. If the filtered data encodes superficial preferences, shortcut reasoning, or hidden benchmark artefacts, iw-SFT may amplify those. The method can make a good signal sharper. It can also make a bad signal more efficiently bad. Adversarial or careless curation is not rescued by a clever objective.

The third boundary is variance. Importance weights can become extreme as the trained policy moves away from the reference distribution. The paper addresses this through clipping and smoothing. In deployment terms, this means the weighting scheme is part of the safety and reliability design, not a hyperparameter to be tuned by vibes.

The fourth boundary is competitive positioning. iw-SFT reaches strong open-model, open-data results, but the comparison table still shows stronger closed or heavily RL-trained systems. DeepSeek-R1 and top API models remain ahead on several benchmarks. The paper does not claim otherwise. The contribution is efficiency and conceptual clarity, not overthrowing the entire post-training regime before breakfast.

The better mental model: curated data is a compressed policy search

The cleanest takeaway is this: curated SFT is compressed policy search.

A team samples behaviour from a reference policy. It filters for success. It trains the model to imitate what survived. That process already contains the bones of reinforcement learning. What ordinary SFT lacks is a way to keep the optimisation tightly connected to the RL objective as the policy changes.

iw-SFT supplies one such mechanism. Quality sampling supplies another. Neither requires abandoning the stability of supervised training. Both require admitting that the curation pipeline is doing more than producing examples. It is defining what the model is being rewarded to become.

For operators, that is the useful provocation. Stop treating post-training data as a pile of approved answers. Treat it as a reward system with missing entries, hidden assumptions, and a reference distribution. Then optimise accordingly.

Fine-tuning was never just supervised. It was reinforcement learning in office clothes. This paper simply checks the pockets and finds the reward function.

Cognaptus: Automate the Present, Incubate the Future.

Chongli Qin and Jost Tobias Springenberg, “Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved),” arXiv:2507.12856, 2025, https://arxiv.org/abs/2507.12856 ↩︎

TL;DR for operators#

The familiar fine-tuning workflow is already making reward decisions#

SFT optimises a lower bound, not the full RL objective#

Importance weighting changes who gets heard during fine-tuning#

The toy bandit explains why filtering can throw away useful information#

The LLM experiment tests whether weighting extracts more reasoning signal from the same curated traces#

The control experiments test generality, not LLM glamour#

The operational lesson is to treat curation as reward design#

What the paper directly shows, and what Cognaptus infers#

The reference model is not an implementation detail#

The real ROI is cheaper signal extraction, not magical alignment#

Boundaries that matter before adoption#

The better mental model: curated data is a compressed policy search#