GRPO | Cognaptus

The Reward Model Was Confident. That Was the Bug.

TL;DR for operators Reward models should not be treated as little oracles that hand down one clean number from the alignment heavens. In the paper’s diagnosis, the problem is more mundane and therefore more dangerous: a reward model can be wrong, uncertain, and numerically confident-looking at the same time. GRPO then standardizes those rewards inside a rollout group, giving extreme scores large influence even when the reward model is least reliable. Excellent. The pipeline has discovered a way to launder uncertainty into policy updates. ...

Rewarding Bad Physics Habits: What VLMs Learn When You Pay Them to Reason

A factory camera sees a pressure gauge. The AI reads the image, explains the mechanism, applies the formula, and recommends an action. Everyone in the meeting relaxes, because the model has produced a neat chain of reasoning. That is usually the moment to become nervous. The dangerous part is not that a vision-language model can be wrong. We know that. The more interesting problem is that a model can become wrong in a very specific way because we trained it to chase the wrong reward. Pay it for clean formatting, and it learns to look organized. Pay it for final answers, and it may sacrifice the reasoning path. Pay it to stare at the image, and it may do better on spatial problems while forgetting that physics also contains formulas. Apparently, “look harder” is not a complete theory of mechanics. ...

When Reasoning Pays (and When It Cheats): Fixing RL Signals in LLM Training

Scorecards are useful until people learn how the scorecard works. That is not a cynical observation. It is basic management. Sales teams optimize for commission rules. Customer-service teams optimize for handle-time dashboards. Students optimize for exams. And language models, with their charming lack of shame, optimize whatever reward function we put in front of them. ...

When Right Meets Wrong: Teaching LLMs by Letting Their Mistakes Talk

Training a reasoning model is often treated like running a classroom with a very impatient teacher: give the model a problem, let it produce several answers, mark each answer right or wrong, and push the policy toward the winners. That is already useful. It is also slightly wasteful. Because in a real classroom, the wrong answers are not just trash to be swept off the floor. They reveal what the student misunderstood. They show which shortcuts are tempting, which algebra step keeps breaking, and which false pattern looks suspiciously persuasive. A good teacher does not only praise the correct solution. A good teacher puts the correct and incorrect attempts side by side and asks: what exactly changed? ...

Reasoning Is Optional. Optimization Is Not: Rethinking VLA Training with NORD

Driving teams do not pay for reasoning tokens because they enjoy watching a model narrate its inner life. They pay for them because, at least in current VLA training culture, reasoning traces are treated as a bridge between perception and action. The bridge is expensive. A typical reasoning-heavy Vision-Language-Action pipeline for autonomous driving collects large driving datasets, generates dense chain-of-thought-style annotations, supervised-fine-tunes the model, and then applies reinforcement learning to improve driving metrics. It is a respectable pipeline. It is also the kind of pipeline that quietly converts every research win into an invoice. ...

Drafts, Then Do Better: Teaching LLMs to Outgrow Their Own Reasoning

Most office work has a draft problem. A junior analyst writes a first version of a financial memo. A lawyer marks up an argument. A consultant turns messy meeting notes into a client-ready recommendation. The first attempt is rarely useless. It is usually half-right, locally clever, and globally flawed. The expensive part is not starting from zero. The expensive part is learning how to improve a decent draft without being hypnotized by it. ...

When Tokens Become Actions: A Policy Gradient Built for Transformers

Tool calls are not tokens. Neither are paragraphs, reasoning blocks, spreadsheet edits, web searches, code executions, or the awkward little detours an agent takes before finally answering the user. Yet much of reinforcement learning for language models still behaves as if it must choose between two unsatisfying extremes. At one end, every token is treated as a tiny action. At the other, the whole answer is treated as one indivisible action. The first view is mathematically tidy and operationally noisy. The second is practical for verifiable tasks, but it compresses an entire reasoning process into one final score, which is a bit like reviewing an employee only by checking whether the office building is still standing. ...

Clipped, Grouped, and Decoupled: Why RL Fine-Tuning Still Behaves Like a Negotiation With Chaos

Training a reasoning model sounds wonderfully modern until the model discovers that “being correct” and “looking correct enough to satisfy the reward” are not the same career path. That is the quiet problem behind reinforcement learning fine-tuning for large language models. The research conversation often treats methods like PPO, GRPO, and DAPO as a sequence of upgrades: first the classic algorithm, then the critic-free group method, then the decoupled-and-dynamically-sampled variant with a nicer acronym. Very tidy. Unfortunately, models do not read product positioning decks. ...

Plan>Then>Profit: Reinforcement Learning That Teaches LLMs to Outline Before They Think

Planning is usually the part of work everybody claims to value and nobody wants to inspect. The deck has a roadmap. The project has a strategy. The model has a chain of thought. Splendid. Now, does the plan actually make the execution better, or is it just theatre with bullet points? That is the useful question behind Plan Then Action: High-Level Planning Guidance Reinforcement Learning for LLM Reasoning, which introduces PTA-GRPO, a reinforcement-learning method that trains language models to generate an explicit analytic plan before detailed reasoning and then rewards the quality of that plan, not merely the final answer.1 ...

Branching Out of the Box: Tree‑OPO Turns MCTS Traces into Better RL for Reasoning

Branching Out of the Box: Tree-OPO Turns MCTS Traces into Better RL for Reasoning A search tree is expensive to build. Once you have paid for it, using only the final answers is a little like buying an aircraft engine and admiring the packaging. That is the useful instinct behind Tree-OPO, a paper that asks whether Monte Carlo Tree Search traces from a stronger teacher model can be reused not merely as demonstrations, but as a structured curriculum for training a smaller reasoning policy.1 The idea is not to run MCTS at inference time and call that progress. Nor is it to imitate a teacher’s logits until the student develops the personality of a photocopier. The paper’s more interesting move is subtler: take the partial reasoning states produced by search, let the student complete from those prefixes, and compute advantages in a way that respects where each prefix sits in the tree. ...