GRPO | Cognaptus

Plan>Then>Profit: Reinforcement Learning That Teaches LLMs to Outline Before They Think

TL;DR Most LLMs reason token‑by‑token and get lost in the weeds. PTA‑GRPO is a two‑stage method that (1) distills short, high‑level plans from a stronger teacher and (2) reinforces both the final answer and the plan’s quality. Across math benchmarks, it reliably outperforms GRPO/DAPO while producing shorter, cleaner solutions. For AI builders, the principle is simple: force an outline, then reward it. Why this paper matters for builders (not just benchmark chasers) From local greed to global guidance. Traditional CoT is myopic: it optimizes each next token. PTA‑GRPO adds a global outline that trims detours and reduces reasoning drift. Aligns with how teams actually work. Great analysts draft an outline before the memo; great agents should too. PTA‑GRPO operationalizes that habit. Product leverage: If your agents make multi‑step decisions (pricing, triage, troubleshooting), rewarding plan quality prevents hallucinated subgoals and makes reasoning auditable. Compute sanity: Instead of expensive tree search at inference, PTA‑GRPO trains planning skill so you can keep runtime simple. The core idea in one picture (words) Plan → Think → Answer. ...

Branching Out of the Box: Tree‑OPO Turns MCTS Traces into Better RL for Reasoning

The punchline Tree‑OPO takes something many labs already produce—MCTS rollouts from a stronger teacher—and treats them not just as answers but as a curriculum of prefixes. It then optimizes a student with GRPO-like updates, but with staged, tree-aware advantages instead of a flat group mean. The result in math reasoning (GSM8K) is a modest but consistent bump over standard GRPO while keeping memory/complexity low. Why this matters for practitioners: you can get more out of your expensive searches (or teacher traces) without training a value model or lugging around teacher logits during student training. ...

Train Long, Think Short: How Curriculum Learning Makes LLMs Think Smarter, Not Longer

When it comes to reasoning, bigger isn’t always better. Large language models (LLMs) often produce unnecessarily long chains of thought, burning through tokens — and budgets — even for simple problems. While fixed token limits during training can force brevity, they also rob models of the chance to first explore and then compress their reasoning. A new study, Train Long, Think Short, proposes a smarter path: curriculum learning for length control. Instead of a one-size-fits-all cap, the model starts with a generous token budget, learns robust reasoning strategies, and then gradually adapts to shorter limits over time. The result is a model that solves complex tasks with fewer tokens, without losing accuracy. ...

The Joy of Many Minds: How JoyAgents-R1 Unleashes the Power of Multi-LLM Reinforcement Learning

When it comes to language model agents, more minds may not always mean merrier results. Multi-agent reinforcement learning (MARL) promises a flexible path for decomposing and solving complex tasks, but coordinating multiple large language models (LLMs) remains riddled with instability, inefficiency, and memory fragmentation. Enter JoyAgents-R1, a novel framework that proposes an elegant, scalable solution for jointly evolving heterogeneous LLM agents using Group Relative Policy Optimization (GRPO). Developed by researchers at JD.com, JoyAgents-R1 combines memory evolution, policy optimization, and clever sampling strategies to form a resilient multi-agent architecture capable of matching the performance of larger SOTA models with far fewer parameters. ...