Cover image

When the Answer Matters More Than the Thinking

Opening — Why this matters now Chain-of-thought (CoT) has quietly become the default crutch of modern LLM training. When models fail, we add more reasoning steps; when benchmarks stagnate, we stretch the explanations even further. The assumption is implicit and rarely questioned: better thinking inevitably leads to better answers. The paper “Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy” challenges that assumption with a refreshingly blunt observation: in supervised fine-tuning, the answer itself is often the shortest—and most under-optimized—part of the output. ...

December 26, 2025 · 4 min · Zelina
Cover image

Fine-Tuning Isn’t Just Supervised: Why SFT Is Really RL in Disguise

In the arms race to align large language models (LLMs), supervised fine-tuning (SFT) and reinforcement learning (RL) are often painted as competing paradigms. SFT is praised for its stability and simplicity; RL is heralded for its theoretical soundness and alignment fidelity. But what if this dichotomy is an illusion? A recent preprint from Chongli Qin and Jost Tobias Springenberg makes a bold and elegant claim: SFT on curated data is not merely supervised learning—it is actually optimizing a lower bound on the RL objective. ...

July 18, 2025 · 4 min · Zelina