GRPO | Cognaptus

Train Long, Think Short: How Curriculum Learning Makes LLMs Think Smarter, Not Longer

TL;DR for operators The paper behind this article proposes Curriculum GRPO: a reinforcement-learning training method that starts a reasoning model with a larger token budget, then gradually shrinks that budget until the model learns to solve problems in shorter traces.1 The important point is not “ask the model to be brief.” We have tried that. It works roughly as well as asking a committee to be concise, which is to say: occasionally, under duress. The paper instead changes the training trajectory. The model is first allowed to explore longer reasoning paths, then is forced to compress successful strategies into a tighter token budget. ...

Train of Thought: How Long-Haul RL Unlocks LLM Reasoning Diversity

TL;DR for operators NVIDIA’s paper is not saying “train longer and reasoning magically appears.” That would be comforting, simple, and wrong — a classic enterprise AI trifecta. The practical lesson is more surgical: prolonged reinforcement learning can keep improving a small reasoning model, but only when the training loop actively prevents collapse. The model needs verifiable rewards, diverse tasks, enough rollout diversity, careful clipping, a small KL penalty, reward shaping when behaviour goes off the rails, and periodic resets of both the reference policy and optimiser state. In other words, long-horizon RL behaves less like a single training job and more like operating a live system under stress. ...

The Joy of Many Minds: How JoyAgents-R1 Unleashes the Power of Multi-LLM Reinforcement Learning

TL;DR for operators A naming note before the machinery starts: the existing Cognaptus title says JoyAgents-R1, but the arXiv paper itself names the benchmark HiMA-Ecom and the training method HiMA-R1. This revision uses the paper’s terminology, because accuracy is not decorative trim. The paper is useful for operators because it does not simply say “use more agents.” That slogan is old, cheap, and usually followed by a demo in which three chatbots politely agree with one another until the invoice arrives. The real contribution is more specific: the authors build a hierarchical e-commerce assistant benchmark, then train the master agent and specialised sub-agents jointly with reinforcement learning instead of optimising them as isolated prompt puppets.1 ...