Why This Matters Now

Large reasoning models are entering their awkward adolescence. They’ve grown enormous—hundred-billion‑parameter MoE giants with 30k‑token rollouts—but their training pipelines still behave like fragile prototypes. Reinforcement learning, supposedly the engine that turns raw scale into actual reasoning capability, too often collapses: unstable gradients, wasted rollouts, unreliable reward models, and a stubborn mismatch between training and inference behavior.

Shopee’s CompassMax‑V3‑Thinking paper tackles these problems with unusual bluntness: every prompt must matter, or the RL budget is being lit on fire. Their solution is not a shiny new algorithm, but a unified RL system—algorithmic, architectural, and infrastructural—engineered to eliminate waste from end to end.

For enterprises building domain‑adapted reasoning models or agentic systems, CompassMax is a preview of where industrial‑grade RL is heading next: precision, compression, and ruthless efficiency.

Background — Why RL Hits a Wall at MoE Scale

Existing RL pipelines excel at small and mid‑sized models, but when scaled to large MoE systems, several chronic ailments flare up:

Zero‑variance prompts: groups of rollouts where every answer earns the same reward—perfectly useless for learning.
Importance sampling instability: token‑level signals diverge wildly across 32k‑token sequences.
Training–inference mismatch: MoE routing decisions differ between rollout engines and training engines.
Reward model inversion: naïve Bradley–Terry preference models flip advantage signs near the mean.
Rollout bottlenecks: most time is spent simply generating text.

CompassMax’s insight is simple: these issues are not isolated. They interact. Fixing only one leaves the others to sabotage the system.

Analysis — What the Paper Actually Does

The authors build a holistic training pipeline (page 3–4 images) that moves from SFT → Model Merge → Two‑Stage RL. But the real novelty lies in the RL engineering.

1. Multi‑Stage Zero‑Variance Elimination (ZVE)

A three‑layer defense against useless rollouts:

Rollout filtering: detect prompts that always produce perfect (or uniformly terrible) rewards.
Reward shaping: length penalties, repetition penalties, and diverse pass‑rate measures to widen the reward distribution.
Advantage smoothing: RL‑ZVP injects controlled stochasticity to preserve gradient signals.

The result: 17% reduction in zero‑variance groups, and meaningful gradients even when the model becomes strong.

2. ESPO — Entropy‑Adaptive Importance Sampling

Where GRPO flattens all tokens, ESPO clusters tokens by entropy and assigns group‑wise importance ratios. High‑entropy segments—where reasoning actually happens—receive proportionally more weight.

This directly addresses the 80/20 effect of reasoning tokens and avoids instability in long‑horizon updates.

3. Router Replay to Align Training & Inference

MoE routing drift is a spectacular way to detonate RL. CompassMax logs routing decisions during inference (vLLM) and replays them during training (Megatron). This eliminates the biggest source of log‑probability mismatch.

The discrepancy drops by an order of magnitude—from 1e‑3 to 1e‑4.

4. Generative Reward Model (GenRM)

Instead of brittle pairwise scoring, GenRM:

performs CoT‑based reasoning before judging,
supports ternary outcomes: better / tie / worse,
avoids advantage flipping around the reward mean.

Performance: 84.3% GPT‑4 agreement, with 98.8% tie recognition.

5. High‑Throughput RL System

The profiling diagram on page 12 shows rollout dominating 61% of time. Their system‑level upgrades include:

FP8 quantized rollout → 30% faster generation.
Length‑aware load balancing → reduces stragglers.
Multi‑detokenization → parallel CPU decoding.
Reward overlap → compute rewards while generation continues.

Combined speedup: 1.66× overall throughput.

Findings — The Pipeline’s Impact

CompassMax‑V3‑Thinking posts strong results across all evaluations, especially in e‑commerce tasks where domain grounding matters.

CompassMax vs Frontier Models (Selected Metrics)

Domain	CompassMax‑V3‑Thinking	GPT‑5‑Thinking (medium)	DeepSeek‑R1
E‑com QA (avg)	85.79	76.29	79.10
SEA Multilingual Avg	86.41	86.64	84.11
ARC Reasoning (AIME24)	83.30	80.94	79.80
HumanEval Pass@1	98.17	85.08	96.95

The strength is not just raw accuracy—it’s stability across tasks, a signature of well‑behaved RL.

System Efficiency Improvements

Optimization	Speedup (Cumulative)
Baseline	1.00×
+ Detokenization Parallelism	1.16×
+ Reward Overlap	1.17×
+ FP8 Rollout	1.52×
+ Length‑Balancing	1.66×

A rare case where RL moves faster as models get larger.

Implications — Why Enterprises Should Care

CompassMax’s contribution is not confined to Shopee’s e‑commerce stack. It speaks to a broader shift:

1. RL for reasoning is becoming an engineering discipline, not a research experiment.

The lesson is clear: scaling RL requires equally scaled infrastructure, debugging, and system design.

2. Domain‑adaptive models will rely on modular reward systems.

Shopee’s Compass‑Gym blends code verifiers, instruction‑following validators, tool‑use rewards, and GenRM assessments. Future enterprise LLMs will need similar mixed‑signal pipelines.

3. Importance sampling and advantage estimation must evolve beyond classic PPO.

Entropy‑adaptive grouping is a preview of more heterogenous, context‑aware optimization methods.

4. Model merging is no longer a hack—it is a core capability.

Particularly for industries (like e‑commerce) where compute budgets lag behind domain complexity.

5. High‑throughput RL must become standard for agentic systems.

Massive reasoning agents can’t be trained slowly. Efficiency is now an accuracy multiplier.

Conclusion

Shopee’s CompassMax‑V3‑Thinking demonstrates that RL at MoE scale is feasible—not by inventing a new algorithm, but by rebuilding the entire pipeline around one principle: each prompt must matter.

The result is a reasoning system that is faster, more stable, more interpretable, and more domain‑adaptable. For enterprises building their own domain LLMs or agent systems, CompassMax provides a template: unify reward modeling, optimization, and system engineering into a cohesive whole.

Cognaptus: Automate the Present, Incubate the Future.

fileciteturn0file0

Why This Matters Now#

Background — Why RL Hits a Wall at MoE Scale#

Analysis — What the Paper Actually Does#

1. Multi‑Stage Zero‑Variance Elimination (ZVE)#

2. ESPO — Entropy‑Adaptive Importance Sampling#

3. Router Replay to Align Training & Inference#

4. Generative Reward Model (GenRM)#

5. High‑Throughput RL System#

Findings — The Pipeline’s Impact#

CompassMax vs Frontier Models (Selected Metrics)#

System Efficiency Improvements#

Implications — Why Enterprises Should Care#

1. RL for reasoning is becoming an engineering discipline, not a research experiment.#

2. Domain‑adaptive models will rely on modular reward systems.#

3. Importance sampling and advantage estimation must evolve beyond classic PPO.#

4. Model merging is no longer a hack—it is a core capability.#

5. High‑throughput RL must become standard for agentic systems.#

Conclusion#