No Prompt Left Behind: How Shopee’s CompassMax Reinvents RL for Giant MoE Models
Why This Matters Now Large reasoning models are entering their awkward adolescence. They’ve grown enormous—hundred-billion‑parameter MoE giants with 30k‑token rollouts—but their training pipelines still behave like fragile prototypes. Reinforcement learning, supposedly the engine that turns raw scale into actual reasoning capability, too often collapses: unstable gradients, wasted rollouts, unreliable reward models, and a stubborn mismatch between training and inference behavior. ...