When the Answer Matters More Than the Thinking

Answer.

In most business systems, that is the part users actually care about. The approval decision. The risk label. The final invoice category. The recommended next action. The tidy little field that decides whether the workflow moves forward or someone opens a Slack thread titled “Why did the AI say this?”

Yet much of modern LLM fine-tuning treats that answer as just another slice of text. Worse, when supervised examples include long chain-of-thought explanations, the final answer may become the shortest and least dominant part of the training objective. The model learns to produce a convincing trail of reasoning, but the tiny destination at the end receives comparatively little optimization pressure. Very elegant. Also slightly absurd.

The paper “Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy” makes that imbalance explicit and proposes SFTKey-Tag, a two-stage supervised fine-tuning method that first teaches the model to generate structured reasoning-plus-answer outputs, then fine-tunes only the final answer tokens.¹ The result is not a war against reasoning. It is a reminder that reasoning text and final answers play different roles, and training objectives should stop pretending otherwise.

The useful business lesson is not “chain-of-thought is bad.” That would be a lazy reading, and we have enough lazy readings already. The better lesson is this: when a model’s output contains both a long explanation and a short decision field, accuracy and format become separate training targets. If they are optimized as one undifferentiated text block, the training process may spend most of its effort polishing the narrative while under-training the decision.

Uniform SFT spends most of its learning budget where the output is longest

Standard supervised fine-tuning minimizes the negative log-likelihood over the entire target response. In simplified form, each output token contributes to the loss:

$$ L_{SFT}(\theta) = -\sum_i \sum_t \log P(y_{i,t} \mid x_i, y_{i,

This is normal, clean, and mathematically boring in the best possible way. The problem appears when the target response is not a flat object.

A typical reasoning-style training example may contain two parts:

Output segment	Typical length	Business role	Training treatment under standard SFT
Reasoning / chain-of-thought	Long	Explains or structures the process	Many token-level loss terms
Final answer / key field	Short	Determines correctness and evaluation success	Few token-level loss terms

The model is not told that the final answer is the business-critical field. It simply sees a sequence. If the reasoning span is much longer than the answer span, the reasoning span contributes more loss terms. The final answer may be the thing that decides whether the model is right, but it is not necessarily the thing that dominates training.

This is the paper’s central mechanism. The issue is not that reasoning tokens are useless. They often help with difficult tasks. The issue is that token count and task value are not the same thing.

That distinction matters in enterprise fine-tuning because many production outputs are structured in exactly this way: a rationale plus a decision, an explanation plus a classification, a summary plus a recommendation, a compliance note plus a final risk flag. If all tokens are treated equally, the model may learn the style of the workflow better than the decision criterion inside the workflow.

The misconception: more reasoning supervision does not automatically mean better answers

The common assumption is simple: if chain-of-thought improves reasoning, then more supervised reasoning should improve final answers.

The paper pushes back against that assumption without rejecting chain-of-thought itself. The correction is narrower and more useful.

Reader belief	Correction from the paper	Why it matters
More reasoning tokens give the model more useful supervision.	More reasoning tokens also occupy more of the loss objective.	Long rationales can crowd out the short answer span.
Tags around reasoning and answers should naturally improve accuracy.	Tags alone have inconsistent effects across models and datasets.	Structure is not magic; it must be tied to the right optimization target.
Answer-only training should solve accuracy.	Answer-only training can damage output format.	Correct answers are not enough if the system cannot reliably emit usable structure.
One fine-tuning objective should handle everything.	The paper shows a case for sequential objectives: structure first, answer precision second.	Fine-tuning pipelines may need stages, not just bigger datasets.

This is why the paper is best read mechanism-first. The benchmark tables are interesting, but the benchmark gains are not the deepest part of the paper. The deeper part is the separation of two objectives that are often bundled together: learning how to respond and learning what exact answer to produce.

SFTKey-Tag separates format learning from answer optimization

The paper compares four training strategies.

First, standard SFT trains on the full response without special reasoning-answer tags. This is the baseline.

Second, SFT-Tag adds explicit structural markers: <Thinking>...</Thinking> for reasoning and <Answer>...</Answer> for the final answer. The model is still trained on the full sequence, but it now sees a structured output format.

Third, Key-Tag keeps the tagged structure but trains only on the answer segment. The reasoning remains part of the input context, but only the answer tokens contribute to the loss.

Fourth, SFTKey-Tag combines the two ideas sequentially:

Stage	What the model sees	Which tokens contribute to loss	Intended role
Stage 1: SFT-Tag	Reasoning plus answer, with tags	Reasoning tokens, answer tokens, and tags	Learn the structured output pattern
Stage 2: Key-focused fine-tuning	Reasoning plus answer, with tags	Answer segment only	Improve final-answer correctness

The second stage is the important move. The model still has access to the reasoning context, but the gradient is concentrated on the answer span. In plain English: the model can read the thinking, but training rewards the answer.

That is a practical design. It avoids training an extra rewrite model. It avoids token-importance classifiers. It avoids asking another model to decide which tokens matter during training. The “key” tokens are identified by a simple structural convention: they are inside the answer tags. Sophisticated? Not especially. Useful? Quite possibly.

One caveat: the paper does use Meta-Llama-3-70B-Instruct as a reference judge for semantic answer evaluation. That is an evaluation choice, not a training-time token-selection mechanism. This distinction matters because the method’s operational simplicity comes from not needing a separate judge model to decide token importance during fine-tuning.

The main result is a balanced score, not a pure accuracy trophy

The paper evaluates five open models: Qwen3-8B-Base, Qwen2.5-7B, SmolLM3-3B-Base, Qwen2.5-3B, and Qwen2.5-1.5B. The benchmark set covers GSM8K, OpenR1-Math-220K, OpenBookQA, and CoT-Collection.

The main table reports a composite score combining answer accuracy and format adherence:

$$ Score = \alpha \cdot Acc + (1-\alpha) \cdot Fmt $$

The paper sets $\alpha = 0.7$, meaning answer accuracy receives 70% of the score and format adherence receives 30%. That choice is reasonable for a research benchmark focused on answer quality, but it is not universal. A production JSON pipeline might care much more about format. A human-facing tutoring product might care more about explanation quality. A compliance workflow might care about both, and then ask legal to ruin everyone’s afternoon.

Against standard SFT, SFTKey-Tag improves the average composite score across all five models:

Model	SFT avg. score	SFTKey-Tag avg. score	Relative improvement vs SFT
Qwen3-8B-Base	0.7670	0.8441	+10.05%
Qwen2.5-7B	0.7586	0.8048	+6.07%
SmolLM3-3B-Base	0.6863	0.7005	+2.06%
Qwen2.5-3B	0.4176	0.4280	+2.49%
Qwen2.5-1.5B	0.3468	0.3880	+4.12%

Averaged across these reported relative improvements, the gain is roughly 5%. That is meaningful, especially because the method is simple. It does not require a new model architecture or a heroic new inference scaffold.

But the table should be read carefully. SFTKey-Tag is not the top-scoring method in every model row when compared with every variant. For the smaller Qwen2.5-3B and Qwen2.5-1.5B models, Key-Tag obtains higher composite scores than SFTKey-Tag in the main table. The authors interpret smaller models as having limited base capabilities and weaker format scores, which makes the accuracy-heavy composite score behave differently. In other words, “SFTKey-Tag beats SFT” is strongly supported; “SFTKey-Tag is always the best possible variant” is not what the table cleanly says.

That nuance is not a weakness. It is the interesting part.

The method’s strongest claim is that a two-stage approach can improve answer correctness while preserving usable output structure, especially in models with enough capacity to benefit from both objectives. For enterprise teams, that is often the relevant target: not the most aggressive accuracy hack, but the best accuracy improvement that does not break the output contract.

The ablations show why one-stage fixes are unstable

The paper’s ablation studies are more important than the headline average. They explain why the two-stage design exists.

Evidence	Likely purpose	What it supports	What it does not prove
Main composite table	Main evidence	SFTKey-Tag improves over standard SFT across five models on the reported composite metric.	It does not prove SFTKey-Tag is best against all variants for all model sizes.
SFT vs SFT-Tag accuracy comparison	Ablation	Adding tags alone has inconsistent accuracy effects across datasets and models.	It does not prove tags are useless; tags still help structure.
Key-Tag vs SFT-Tag accuracy and loss curves	Mechanism support / ablation	Answer-focused training can lower answer loss and improve answer accuracy.	It does not show Key-Tag is deployable by itself.
Format adherence comparison	Failure-mode evidence	Key-focused training can damage output structure severely.	It does not imply all answer-focused training must fail structurally; the failure is tied to one-stage Key-Tag.
One-stage vs two-stage comparison	Ablation validating sequencing	SFTKey-Tag combines much of the answer benefit with much better format preservation.	It does not validate larger frontier-scale models or specialized enterprise datasets.
Accuracy and format appendix tables	Decomposition / robustness detail	The composite score can be inspected through its separate components.	The appendix is not a second thesis; it supports interpretation of the main tradeoff.

The first ablation asks whether tags alone improve accuracy. The answer is: not reliably. SFT-Tag helps in some cases and hurts in others. For example, Qwen3-8B’s average accuracy rises from 0.7069 under SFT to 0.7296 under SFT-Tag, while Qwen2.5-7B slightly drops from 0.6722 to 0.6696. SmolLM3-3B also drops on average, despite improving on some datasets. Tags are useful for separating structure, but they are not accuracy fairy dust.

The second ablation compares SFT-Tag and Key-Tag under the same tagged format. Here the paper argues that Key-Tag often improves answer accuracy. The reported analysis says Key-Tag delivers clear accuracy improvements in four of five model cases, with the remaining model showing comparable performance. The loss curve on GSM8K for Qwen2.5-7B supports the mechanism: Key-Tag initially has higher answer loss, but as training progresses, its answer loss falls below standard SFT.

That pattern is exactly what the mechanism predicts. If training pressure is focused on the answer span, answer-level loss should eventually improve. The model may need time to adapt to the tagged structure, but the endpoint is better answer specialization.

Then comes the inconvenient part: Key-Tag can wreck format.

For Qwen2.5-7B, the format adherence under Key-Tag is reported as 0.0000 across the evaluated datasets in the format comparison, while SFT-Tag is around 0.9601 average and SFTKey-Tag reaches 0.9632. SmolLM3-3B shows a similar failure pattern: Key-Tag averages 0.0679 in format adherence, while SFTKey-Tag reaches 0.9084. Qwen3-8B is less catastrophic, but Key-Tag still trails badly: 0.7512 versus 0.9959 for SFTKey-Tag.

That is the paper’s most business-relevant tradeoff. If a system returns the right answer but fails the required output structure, it may still fail the workflow. A malformed answer tag is not just ugly formatting; it can break parsers, downstream automation, audit trails, and user trust. The model is not being charmingly informal. It is failing the contract.

SFTKey-Tag addresses this by sequencing the objectives. First, learn the contract. Then, optimize the answer inside the contract.

The real contribution is objective sequencing

A weaker version of this article would say: “The paper improves SFT by focusing on answer tokens.” True, but incomplete.

The better formulation is: the paper shows that output structure and answer correctness may need to be optimized in sequence, because each one can undermine the other if trained alone.

SFT-Tag is good at teaching the model to produce structured responses. It is less direct at improving the answer span because the answer remains only part of the full loss.

Key-Tag is good at emphasizing final answers. It is dangerous because the model may stop learning, or stop preserving, the full response format.

SFTKey-Tag is the compromise: teach the output grammar first, then specialize the decision field. This makes the method less like a clever token trick and more like a training curriculum.

That curriculum view is useful because it generalizes beyond the exact tags used in the paper. Many enterprise outputs have a similar shape:

Enterprise output pattern	“Thinking-like” component	“Answer-like” component
Credit memo	Rationale, supporting factors	Approve / reject / review
Customer support automation	Case summary, policy explanation	Refund / escalate / deny
Legal document triage	Clause reasoning, extracted context	Risk category
Finance operations	Transaction explanation	Account code or exception flag
Sales operations	Lead notes, qualification reasoning	Next-best action

The paper does not test these domains. That boundary is important. But the mechanism maps cleanly: when long explanatory fields and short decision fields coexist, the short field may deserve its own optimization stage.

Business teams should treat final fields as product targets, not text leftovers

For enterprise LLM teams, the immediate practical takeaway is not “copy these exact tags.” It is to redesign fine-tuning data around output roles.

A useful pipeline would look like this:

Step	Practical action	Why it matters
1. Segment outputs by role	Separate reasoning, explanation, citations, final answer, classification, and action fields.	The model cannot optimize field importance if the data treats everything as one blob.
2. Train structure first	Use full-output SFT to teach the response contract.	This protects format adherence and downstream compatibility.
3. Fine-tune key fields second	Apply answer- or decision-focused loss on the fields that determine task success.	This aligns training pressure with business value.
4. Evaluate components separately	Measure answer accuracy, format validity, explanation quality, and parser success independently.	A single score can hide operational failure modes.
5. Tune the composite metric to the workflow	Adjust the accuracy-format tradeoff based on production cost.	A chatbot and an automated claims pipeline should not use the same metric weights.

This is where the paper becomes operationally interesting. It suggests that some fine-tuning failures are not data-volume failures. They are objective-design failures.

If a model gives beautiful rationales and wrong final classifications, adding more rationale examples may not help. If a model gives correct classifications but invalid JSON, answer-only training may make production worse. The fix is not necessarily a larger model, more prompts, or a ritual sacrifice to the benchmark gods. The fix may be a staged objective that matches the structure of the output.

The ROI is cheaper diagnosis, not just higher benchmark scores

The reported gain of roughly 5% over standard SFT is useful, but the more durable value is diagnostic.

SFTKey-Tag gives teams a way to ask: which part of the output is failing?

If full SFT produces valid structure but weak answers, the answer field may be under-optimized. If answer-only fine-tuning improves accuracy but breaks structure, the system needs a structure-preserving first stage. If tags alone do not help, the issue is not merely output demarcation. If a smaller model behaves differently from a larger one, capacity may be limiting the model’s ability to maintain both format and answer precision.

That is a better engineering conversation than “let’s fine-tune again and see what happens.”

The method also has a practical advantage over more elaborate token-importance strategies. It does not require a model to discover which tokens matter. The important field is specified by the schema. For business systems, that is often how importance actually works. The final decision field matters because the workflow says it matters, not because a token-ranking algorithm had a spiritual awakening.

Boundaries: where the paper should not be overread

The evidence is promising, but the boundary conditions are clear.

First, the models are in the 1.5B to 8B range. The paper does not test 14B, 32B, or frontier-scale models. It is plausible that larger models respond differently, either because they better preserve format under answer-focused training or because they already allocate learning capacity more effectively. The paper does not settle that question.

Second, the tasks are public reasoning and QA benchmarks: GSM8K, OpenR1-Math-220K, OpenBookQA, and CoT-Collection. These are useful tests, but they are not the same as legal review, medical triage, financial compliance, procurement classification, or internal enterprise knowledge workflows. Domain-specific outputs may have messier schemas, noisier labels, and less clean separation between reasoning and answer.

Third, the method assumes the training data can be segmented into reasoning and answer spans. The paper manually divides each dataset into reasoning and answer segments before recombining them with tags. Many business datasets will not arrive so politely dressed. Teams may need preprocessing rules, annotation, or LLM-assisted segmentation before this method becomes usable.

Fourth, the composite score uses $\alpha = 0.7$, weighting accuracy more heavily than format. That is a design choice, not a universal truth. If a production system depends on strict schema adherence, format failures may deserve a much heavier penalty. In some workflows, a malformed output is not 30% of the problem. It is the whole problem.

Fifth, the method adds a second training stage. The paper notes that SFTKey-Tag requires longer training time than conventional SFT. The reported compute budget is manageable for research-scale open models, but enterprise teams still need to evaluate whether the added training cost is justified by the accuracy and reliability gain.

Finally, the evaluation uses an external LLM judge to assess semantic answer equivalence. That can reduce brittle string-matching errors, but judge-based evaluation introduces its own uncertainty. For production systems, especially high-stakes ones, teams should pair semantic judging with task-specific validation, human review, or deterministic checks where possible.

What this changes in how we should read chain-of-thought fine-tuning

The fashionable story says reasoning is the magic. The paper’s quieter story is that reasoning is part of the interface, not always the final product.

A model can reason fluently and answer incorrectly. A model can answer correctly and break the required format. A model can learn tags without improving accuracy. These are not contradictions. They are different failure modes exposed by separating the output into parts.

That separation is the paper’s real contribution. It makes the training objective less naïve about what the output is for.

For Cognaptus readers building automation systems, the lesson is straightforward: do not let the longest part of the output automatically become the most trained part of the output. If the business value sits in a short final field, that field deserves explicit optimization. If the workflow depends on structure, that structure deserves its own training stage. And if both matter, train them in the right order.

The answer may be shorter than the thinking. It may also be more expensive to get wrong.

Cognaptus: Automate the Present, Incubate the Future.

Xiaofeng Shi, Qian Kou, Yuduo Li, and Hua Zhou, “Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy,” arXiv:2512.21017, 2025, https://arxiv.org/abs/2512.21017. ↩︎

Uniform SFT spends most of its learning budget where the output is longest#

The misconception: more reasoning supervision does not automatically mean better answers#

SFTKey-Tag separates format learning from answer optimization#

The main result is a balanced score, not a pure accuracy trophy#

The ablations show why one-stage fixes are unstable#

The real contribution is objective sequencing#

Business teams should treat final fields as product targets, not text leftovers#

The ROI is cheaper diagnosis, not just higher benchmark scores#

Boundaries: where the paper should not be overread#

What this changes in how we should read chain-of-thought fine-tuning#