Supervised Fine-Tuning

When AI Learns the Trick First: Why Insight Beats Brute Force in Theorem Proving

The trick usually comes before the proof. That is not how most AI demos are staged, of course. The demo asks a model a difficult question, the model produces a long answer, and everyone pretends length is evidence of thought. Mathematics is less polite. A proof can be long, fluent, and wrong. It can also be short because the solver noticed the one move that makes the rest almost mechanical. ...

When the Answer Matters More Than the Thinking

Answer. In most business systems, that is the part users actually care about. The approval decision. The risk label. The final invoice category. The recommended next action. The tidy little field that decides whether the workflow moves forward or someone opens a Slack thread titled “Why did the AI say this?” Yet much of modern LLM fine-tuning treats that answer as just another slice of text. Worse, when supervised examples include long chain-of-thought explanations, the final answer may become the shortest and least dominant part of the training objective. The model learns to produce a convincing trail of reasoning, but the tiny destination at the end receives comparatively little optimization pressure. Very elegant. Also slightly absurd. ...

From Building Blocks to Breakthroughs: Why RL Finally Teaches Models to Think

Training an AI model is often sold like a kitchen renovation: add more data, add reinforcement learning, install the shiny reasoning countertop, and suddenly the whole thing looks expensive enough to be intelligent. This paper is useful because it ruins that brochure. The authors of Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies ask a deceptively simple question: does reinforcement learning create new reasoning ability, or does it only increase the probability of behaviors the model could already produce?1 Their answer is not the clean slogan either camp wants. RL can synthesize new compositional reasoning, but only when the model has already learned the right underlying atomic skills. Without that foundation, RL mostly polishes whatever behavior already exists. Sometimes that is reasoning. Sometimes it is just a better-trained shortcut wearing a lab coat. ...

Eight Arms, One Mind: How OctoMed Turns Data Recipes into Medical Reasoning Power

Eight Arms, One Mind: How OctoMed Turns Data Recipes into Medical Reasoning Power Recipe sounds like a small word for an expensive problem. In medical AI, the usual boardroom story is simple: buy a bigger model, add more compute, sprinkle in reinforcement learning, and wait for clinical intelligence to appear. Very elegant. Also very convenient for anyone selling compute. ...

Fine-Tuning Isn’t Just Supervised: Why SFT Is Really RL in Disguise

TL;DR for operators Fine-tuning on curated examples is usually sold as the boring, stable cousin of reinforcement learning. The paper behind this article says that is too neat. When a team filters examples into “good” and “not good,” it has already created a sparse reward function. Standard supervised fine-tuning on the surviving examples is therefore not outside reinforcement learning; it is optimising a lower bound on an RL objective, only without admitting it at the meeting. ...