Opening — Why this matters now
Chain-of-thought (CoT) has quietly become the default crutch of modern LLM training. When models fail, we add more reasoning steps; when benchmarks stagnate, we stretch the explanations even further. The assumption is implicit and rarely questioned: better thinking inevitably leads to better answers.
The paper “Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy” challenges that assumption with a refreshingly blunt observation: in supervised fine-tuning, the answer itself is often the shortest—and most under-optimized—part of the output.
In a world where enterprise users don’t grade reasoning traces but judge correctness, this imbalance is not academic. It’s operationally expensive.
Background — How CoT quietly hijacked SFT
Standard supervised fine-tuning (SFT) optimizes every output token equally. That was sensible when outputs were short. But modern instruction datasets frequently contain:
- Long, verbose reasoning chains
- A tiny final answer span
- Uniform loss applied across both
The result is predictable: gradients are dominated by reasoning tokens simply because there are more of them. Accuracy becomes a side effect rather than a target.
Prior attempts to fix this—token reweighting, importance grouping, reasoning compression—add complexity, hyperparameters, or auxiliary models. Elegant in theory, fragile in practice.
This paper takes a more pragmatic route.
Analysis — What the paper actually does
The authors propose SFTKey, a deliberately simple two-stage fine-tuning scheme:
Step 1: Learn how to talk
First, the model is trained with conventional SFT, but with explicit structural tags:
… …
This stage ensures:
- Correct output format
- Stable reasoning behavior
- No loss of instruction-following capability
Think of it as teaching the model how to respond.
Step 2: Learn what matters
In the second stage, the same data is used—but only the answer tokens contribute to the loss. Reasoning tokens remain as context but are excluded from gradient updates.
This shifts optimization pressure squarely onto the part users—and benchmarks—actually care about.
The authors call the combined approach SFTKey-Tag.
No token importance heuristics. No auxiliary judge model. No clever weighting tricks. Just a change in where gradients are allowed to land.
Findings — The numbers that matter
Across five model families and four reasoning-heavy benchmarks, the results are unambiguous.
Composite performance gains
| Model | Avg. Improvement vs SFT |
|---|---|
| Qwen3-8B | +10.0% |
| Qwen2.5-7B | +6.1% |
| SmolLM3-3B | +2.1% |
| Qwen2.5-3B | +2.5% |
| Qwen2.5-1.5B | +4.1% |
Larger models benefit more—unsurprising, since they have enough capacity to separate reasoning fluency from answer precision.
Accuracy vs. format: a controlled trade-off
One-stage Key-only training improves accuracy but catastrophically breaks output structure. Models forget how to close tags. Some forget tags entirely.
The two-stage design fixes this:
| Method | Accuracy | Format Adherence |
|---|---|---|
| SFT | Medium | Medium |
| Key-Tag | High | Poor |
| SFTKey-Tag | High | High |
In other words: correctness without chaos.
Loss dynamics tell the real story
Answer-level loss curves show something subtle but important:
- Early training: Key-focused loss is worse
- Later training: it consistently undercuts standard SFT
This suggests SFT doesn’t fail because it can’t learn correct answers—but because it never prioritizes them long enough.
Implications — Why this should change how teams fine-tune
1. Reasoning verbosity is not free
Longer CoT improves interpretability, but it silently taxes accuracy under uniform loss. If you care about correctness, you must counterbalance it.
2. Token-level cleverness is overrated
The paper’s biggest strength is what it doesn’t introduce:
- No importance classifiers
- No learned token weights
- No extra models
This makes SFTKey practical for real pipelines.
3. Two-stage optimization is underused
Most fine-tuning treats training as a single, monolithic objective. This work shows that sequencing objectives—first structure, then correctness—can dominate clever loss shaping.
For enterprise AI teams, this is a useful design pattern.
Conclusion — Less thinking, better answers
This paper doesn’t argue against chain-of-thought. It argues against letting reasoning monopolize learning.
By explicitly re-centering optimization on answer tokens—after structure is learned—SFTKey-Tag delivers consistent accuracy gains without sacrificing reliability.
Sometimes progress isn’t about making models think harder. It’s about reminding them what they’re being judged on.
Cognaptus: Automate the Present, Incubate the Future.