Fine-Tuning Isn’t Just Supervised: Why SFT Is Really RL in Disguise

In the arms race to align large language models (LLMs), supervised fine-tuning (SFT) and reinforcement learning (RL) are often painted as competing paradigms. SFT is praised for its stability and simplicity; RL is heralded for its theoretical soundness and alignment fidelity. But what if this dichotomy is an illusion? A recent preprint from Chongli Qin and Jost Tobias Springenberg makes a bold and elegant claim: SFT on curated data is not merely supervised learning—it is actually optimizing a lower bound on the RL objective. ...

July 18, 2025 · 4 min · Zelina

Train of Thought: How Long-Haul RL Unlocks LLM Reasoning Diversity

In the race to make Large Language Models (LLMs) reason like humans—or better—most researchers obsess over one thing: prompting. Chain-of-thoughts, few-shot demos, scratchpads, tools. But a new study from NVIDIA suggests something even more fundamental: it’s not just how you prompt them—it’s how long you train them. Their paper, Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training, explores how stretching reinforcement learning (RL) over time unlocks broader, more stable, and more versatile reasoning in LLMs. This isn’t just about incremental gains—it’s about escaping reasoning ruts. ...

July 18, 2025 · 3 min · Zelina

The Sink That Remembers: Solving LLM Memorization Without Forgetting Everything Else

When large language models (LLMs) memorize repeated content during training—be it a phone number, a copyrighted paragraph, or a user’s personal story—the implications go beyond benign repetition. They touch the very core of AI safety, privacy, and trust. And yet, removing this memorized content after training has proven to be a devil’s bargain: anything you subtract tends to weaken the model’s overall capabilities. In their recent ICML 2025 paper, Ghosal et al. propose an elegant reframing of this problem. Rather than performing painful post-hoc surgery on a trained model, they suggest we prepare the model from the outset to isolate memorization into removable compartments—which they call Memorization Sinks (MemSinks). ...

July 15, 2025 · 4 min · Zelina