Cover image

Clipped, Grouped, and Decoupled: Why RL Fine-Tuning Still Behaves Like a Negotiation With Chaos

Training a reasoning model sounds wonderfully modern until the model discovers that “being correct” and “looking correct enough to satisfy the reward” are not the same career path. That is the quiet problem behind reinforcement learning fine-tuning for large language models. The research conversation often treats methods like PPO, GRPO, and DAPO as a sequence of upgrades: first the classic algorithm, then the critic-free group method, then the decoupled-and-dynamically-sampled variant with a nicer acronym. Very tidy. Unfortunately, models do not read product positioning decks. ...

December 9, 2025 · 17 min · Zelina
Cover image

Policies with Purpose: How PPO Powers Smart Business Decisions

TL;DR for operators The paper is about air-purifying booth placement in Delhi, but the useful business lesson is broader: optimisation is rarely about chasing the loudest metric. In the study, a greedy strategy that targets the highest-AQI cells achieves the highest overall AQI improvement, at 25.76%. The PPO-based strategy is slightly lower on that headline number, at 25.39%, but much stronger on population impact and traffic impact, with zero green-space violations. ...

May 5, 2025 · 16 min · Zelina