Opening — Why this matters now

Chain-of-thought (CoT) has quietly become the default crutch of modern LLM training. When models fail, we add more reasoning steps; when benchmarks stagnate, we stretch the explanations even further. The assumption is implicit and rarely questioned: better thinking inevitably leads to better answers.

The paper “Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy” challenges that assumption with a refreshingly blunt observation: in supervised fine-tuning, the answer itself is often the shortest—and most under-optimized—part of the output.

In a world where enterprise users don’t grade reasoning traces but judge correctness, this imbalance is not academic. It’s operationally expensive.

Background — How CoT quietly hijacked SFT

Standard supervised fine-tuning (SFT) optimizes every output token equally. That was sensible when outputs were short. But modern instruction datasets frequently contain:

  • Long, verbose reasoning chains
  • A tiny final answer span
  • Uniform loss applied across both

The result is predictable: gradients are dominated by reasoning tokens simply because there are more of them. Accuracy becomes a side effect rather than a target.

Prior attempts to fix this—token reweighting, importance grouping, reasoning compression—add complexity, hyperparameters, or auxiliary models. Elegant in theory, fragile in practice.

This paper takes a more pragmatic route.

Analysis — What the paper actually does

The authors propose SFTKey, a deliberately simple two-stage fine-tuning scheme:

Step 1: Learn how to talk

First, the model is trained with conventional SFT, but with explicit structural tags:


This stage ensures:

  • Correct output format
  • Stable reasoning behavior
  • No loss of instruction-following capability

Think of it as teaching the model how to respond.

Step 2: Learn what matters

In the second stage, the same data is used—but only the answer tokens contribute to the loss. Reasoning tokens remain as context but are excluded from gradient updates.

This shifts optimization pressure squarely onto the part users—and benchmarks—actually care about.

The authors call the combined approach SFTKey-Tag.

No token importance heuristics. No auxiliary judge model. No clever weighting tricks. Just a change in where gradients are allowed to land.

Findings — The numbers that matter

Across five model families and four reasoning-heavy benchmarks, the results are unambiguous.

Composite performance gains

Model Avg. Improvement vs SFT
Qwen3-8B +10.0%
Qwen2.5-7B +6.1%
SmolLM3-3B +2.1%
Qwen2.5-3B +2.5%
Qwen2.5-1.5B +4.1%

Larger models benefit more—unsurprising, since they have enough capacity to separate reasoning fluency from answer precision.

Accuracy vs. format: a controlled trade-off

One-stage Key-only training improves accuracy but catastrophically breaks output structure. Models forget how to close tags. Some forget tags entirely.

The two-stage design fixes this:

Method Accuracy Format Adherence
SFT Medium Medium
Key-Tag High Poor
SFTKey-Tag High High

In other words: correctness without chaos.

Loss dynamics tell the real story

Answer-level loss curves show something subtle but important:

  • Early training: Key-focused loss is worse
  • Later training: it consistently undercuts standard SFT

This suggests SFT doesn’t fail because it can’t learn correct answers—but because it never prioritizes them long enough.

Implications — Why this should change how teams fine-tune

1. Reasoning verbosity is not free

Longer CoT improves interpretability, but it silently taxes accuracy under uniform loss. If you care about correctness, you must counterbalance it.

2. Token-level cleverness is overrated

The paper’s biggest strength is what it doesn’t introduce:

  • No importance classifiers
  • No learned token weights
  • No extra models

This makes SFTKey practical for real pipelines.

3. Two-stage optimization is underused

Most fine-tuning treats training as a single, monolithic objective. This work shows that sequencing objectives—first structure, then correctness—can dominate clever loss shaping.

For enterprise AI teams, this is a useful design pattern.

Conclusion — Less thinking, better answers

This paper doesn’t argue against chain-of-thought. It argues against letting reasoning monopolize learning.

By explicitly re-centering optimization on answer tokens—after structure is learned—SFTKey-Tag delivers consistent accuracy gains without sacrificing reliability.

Sometimes progress isn’t about making models think harder. It’s about reminding them what they’re being judged on.

Cognaptus: Automate the Present, Incubate the Future.