Rollouts are expensive little creatures. They consume GPU time, produce long reasoning traces, wait for reward computation, and then—if the reward signal is flat—contribute exactly nothing to learning. The GPU was busy. The training dashboard looked serious. The model learned no usable distinction. Very productive, in the same way a meeting with twelve people and no decision is productive.

Shopee’s paper on CompassMax-V3-Thinking is built around a simple operating principle: each prompt must matter.1 The phrase sounds like a motivational poster for reinforcement learning engineers, but the technical argument is more concrete. At hundred-billion-scale Mixture-of-Experts (MoE), the problem is not merely that RL is expensive. The problem is that expensive RL can silently waste itself through uninformative prompt groups, unstable importance sampling, reward-model distortions, train-inference mismatch, and rollout-system bottlenecks.

That is the useful way to read this paper. Not as “Shopee trained a stronger reasoning model,” although it did. Not as “another company joins the thinking-model parade,” because we have enough parades and not enough traffic control. The more interesting contribution is an integrated RL pipeline for making large MoE reasoning training less wasteful and less fragile.

The paper’s headline result is CompassMax-V3-Thinking, a hundred-billion-scale MoE reasoning model trained through cold-start Long-CoT supervised fine-tuning, model merging, and two-phase large-scale RL. The model performs strongly on Shopee’s internal e-commerce benchmark, multilingual Southeast Asian evaluation sets, internal general-capability tests, and a public ARC-style benchmark suite. But the deeper lesson is operational: scaling RL is an efficiency problem before it is a hero-model problem.

The real bottleneck is not “more RL”; it is whether RL has anything to learn from

The common business interpretation of RL for LLMs is still too linear: more rollouts, more reward, more training, better reasoning. That belief is directionally comforting and operationally dangerous.

Group-based RL methods such as GRPO depend on comparing multiple sampled responses for the same prompt. If the group contains variation in reward, the model can learn which behaviors to strengthen and which to suppress. If all samples receive the same reward, the advantage becomes zero. No contrast, no gradient, no learning signal.

Shopee calls this the zero-variance prompt problem. It appears in two opposite situations. Early in training, the model may be too weak, so all responses fail. Later, the model may be strong enough that all responses pass. In both cases, the prompt becomes uninformative for group-based optimization. The system has generated responses, scored them, and paid the compute bill, but the model receives no effective update.

This is the first mechanism in the paper’s chain: RL compute is only valuable when the prompt creates reward variance.

Shopee’s response is not simply to discard zero-variance samples. Existing dynamic sampling can remove bad groups, but at large MoE scale that means throwing away costly rollouts after they have already been produced. The paper instead uses a multi-stage strategy:

Stage Mechanism What it tries to save
Rollout stage Filter overly easy samples and expand exploration through larger sampling size $N$ Avoid generating groups that are uniformly correct or uniformly wrong
Reward stage Use pass-rate-style rewards plus length and repetition penalties Turn otherwise flat outcomes into more informative negative and positive signals
Actor update stage Apply RL-ZVP-style advantage reshaping for remaining zero-variance groups Recover some learning signal even when reward variance remains low

The rollout-stage experiment is a useful sensitivity test, not merely decoration. As Shopee increases sampling size $N$, the zero-variance rate declines while pass@$N$ improves. The paper reports that expanding the exploration space reduced the overall zero-variance rate by 17%. The figure also shows the trade-off: advantage gained per data point rises with larger $N$, but advantage per unit of compute declines. That is exactly the kind of result a production team should care about. The correct sampling size is not “as large as possible”; it is where the marginal learning signal still justifies the marginal compute.

For business readers, the point is not to copy the exact $N$. The point is to track the right failure mode. A training pipeline that reports only aggregate reward curves may hide the fact that too many prompts have become informational dead weight. A useful RL dashboard should expose the zero-variance rate, reward distribution, advantage magnitude, and compute-normalized learning signal. Otherwise, the organization is just staring at a loss curve and hoping it has a personality.

CompassMax begins before RL, because RL cannot repair every missing capability

The paper’s mechanism-first story starts with RL efficiency, but CompassMax-V3-Thinking is not trained by RL alone. Shopee first performs Long-CoT supervised fine-tuning, then merges capabilities from an e-commerce-specialized CompassMax-V3 checkpoint, and only then runs large-scale RL.

This matters because a careless summary might treat the final gains as a pure RL victory. The paper is more pragmatic than that. It recognizes that RL is not a magical importer of domain knowledge. If a model lacks the right base reasoning traces or e-commerce concepts, reinforcement learning may spend a long time pushing on weak internal representations.

The cold-start stage uses Long-CoT data distilled from high-capacity models, with filtering for duplication, test-set overlap, and harmful content. This creates a model capable of explicit long reasoning across domains such as mathematics, code, e-commerce, dialogue, and multilingual understanding. After that, Shopee uses model merging to integrate e-commerce capability from CompassMax-V3 into the thinking model. The paper evaluates several merging methods and reports that TIES produced the most consistent performance.

The practical interpretation is straightforward: RL works best after the model already has something coherent to optimize. In enterprise settings, this often means the training plan should separate three questions:

  1. Does the base model have the domain vocabulary and task schema?
  2. Does it have reasoning traces that make verifier-based training possible?
  3. Does RL provide contrastive reward signals that can improve behavior?

Treating all three as “fine-tuning” is how teams end up with expensive training jobs and vague explanations.

ESPO targets the unstable middle between token-level and sequence-level updates

Once the model is generating informative rollouts, the next problem is how to update it without destabilizing long-horizon reasoning. The paper positions ESPO—Entropy Importance Sampling Policy Optimization—as a response to a tension between token-level and sequence-level importance sampling.

Token-level importance sampling can be too volatile in long outputs. A few token-probability shifts may produce unstable ratios. Sequence-level methods such as GSPO smooth this by applying one ratio across an entire sequence, but that can over-flatten the learning signal. Not every token in a long reasoning trace carries the same uncertainty. Some parts are routine; others are where the model is genuinely exploring.

ESPO’s mechanism is to group tokens by entropy and assign independent importance-sampling behavior to entropy-coherent token groups. High-entropy regions receive more targeted treatment instead of being averaged away with low-entropy regions. The clipping threshold is also entropy-adaptive.

In plain language: ESPO tries to avoid both panic and numbness. Token-level updates can panic over local probability changes. Sequence-level updates can become numb to the important uncertain parts of a reasoning chain. ESPO attempts to keep the update sensitive where the model is uncertain and stable where it is not.

The paper’s ESPO section is primarily a methodological contribution. It gives the objective and explains the entropy grouping mechanism, but it does not isolate ESPO in a clean public ablation table against every alternative across all downstream benchmarks. That boundary matters. We can say the method is part of the pipeline that produced the final model. We should be more careful before saying ESPO alone caused a specific benchmark gain.

For practitioners, the broader lesson is still valuable: long reasoning traces are not uniform objects. A training system that treats every token in a chain of thought as equally informative is making a convenience assumption, not a law of nature.

Reward models can invert the lesson if they cannot recognize “about equal”

The third weak link is reward modeling. Shopee argues that standard Bradley–Terry-style or pointwise reward models can produce nonlinear reward curves near the group mean. When outputs are similar in quality, the reward model may impose an artificial ranking. In group-based advantage computation, that can flip the direction of the advantage: the model is nudged away from a response that should not have been penalized, or toward one that should not have been rewarded.

This is a particularly nasty failure mode because the training system still appears to be functioning. Rewards exist. Gradients flow. The model updates. The only problem is that the learning signal is occasionally pointing the wrong way. Tiny detail.

Shopee’s response is a Generative Reward Model with chain-of-thought reasoning and ternary comparison. Instead of forcing every pair into “A is better than B” or “B is better than A,” the GenRM can also output a tie. The model generates structured reasoning fields and then returns one of three labels: A, B, or TIE.

The reported GenRM results are one of the paper’s more concrete ablations. On Shopee’s in-house evaluation set, the GenRM reaches 83.8% overall accuracy versus 64.3% for the Skywork-v2-based 8B ORM used in CompassMax-V3. Its agreement rate with GPT-4 judgments reaches 84.3%. Most strikingly, tie recognition reaches 98.8%, compared with 51.2% for the ORM baseline.

That tie-recognition number is not a side note. It explains why the ternary design matters. In RL for writing, product descriptions, search relevance, and multilingual e-commerce tasks, many outputs are not cleanly ordered. Forcing near-equal answers into a binary preference is a recipe for fake precision. Fake precision then becomes real gradient. Real gradient then becomes model behavior. This is how small evaluation mistakes learn to wear a lab coat.

For business use, the implication is broader than reward-model architecture. If a company uses AI judges to improve domain assistants, the judge needs a calibrated “no meaningful difference” option. Otherwise, optimization will chase noise and call it alignment.

Router Replay fixes a MoE-specific mismatch that ordinary dashboards may miss

Large MoE models add another complication: routing. In MoE architectures, different tokens may be sent to different expert subnetworks. In modern RL systems, rollout generation and training often run through different infrastructure. The paper uses vLLM for rollout generation and Megatron for training. Because the two systems can differ in implementation details and numerical behavior, the token probability distribution observed during rollout can diverge from the one recomputed during training.

Shopee reports that the discrepancy becomes especially pronounced after the MoE router. Prior approaches have tried importance-sampling corrections, batch-invariant kernels, or precision changes. Shopee argues that these can delay instability or introduce overhead, and that FP16 is not a good solution for extremely large models because of overflow.

The proposed fix is Router Replay. During rollout, the system records router decisions for all tokens. During training, when Megatron recomputes log probabilities, it reuses those routing decisions. This aligns the expert path used in training with the expert path used during inference. The paper reports that this reduces training-inference log-probability discrepancy from the order of $10^{-3}$ to $10^{-4}$ and stabilizes RL training.

This is an implementation detail with strategic consequences. For dense models, a business leader might reasonably think of train-inference mismatch as a numerical issue. For giant MoE models, it becomes a routing issue. The model is not just producing different probabilities; it may be consulting a slightly different internal committee.

The practical lesson is that MoE training quality depends on observability below the usual metric layer. Accuracy, reward, and loss may tell you that something went wrong. Router-level diagnostics can tell you why.

Compass-Gym turns business tasks into rewardable training problems

The paper’s reward system, Compass-Gym, is where the business relevance becomes most explicit. Shopee builds a multi-domain reward framework covering code, instruction following, tool use, GenRM judgment, and e-commerce-specific tasks.

The e-commerce reward system combines three types of signals:

Task shape Reward mechanism Example use
Discrete selection Multilingual keyword-based verifier Brand, category, label, or item choice
Structured extraction JSON schema and field-level correctness checks Attribute extraction, address parsing, product fields
Open-ended text GenRM comparison against reference answers Title rewriting, selling-point extraction, unstructured assistance

This is a useful design pattern because e-commerce workflows are not one task. They are a bundle of classification, extraction, retrieval, recommendation, generation, and judgment. A single reward type would either be too rigid for open-ended tasks or too subjective for structured tasks.

The paper also describes tool-use reward as the sum of format and correctness terms. Format reward checks whether the trace is well-formed and fields appear in the required order. Correctness reward scores tool-name overlap, parameter-key overlap, and parameter-value matches, then uses optimal assignment to compare predicted and ground-truth tool calls. This is not glamorous, but it is exactly the kind of engineering that makes tool-using agents trainable. Agents do not improve just because someone tells them to “use tools better.” They improve when the training system can score what “better” means at the tool-name, argument-schema, and value levels.

For Cognaptus-style business automation, this is the strongest transferable idea in the paper. Most companies do not need to train hundred-billion-scale MoE models. But many do need to convert business workflows into rewardable units: valid JSON, correct field extraction, correct tool call, acceptable reasoning, multilingual consistency, and task-specific judgment.

The model may be huge. The lesson scales down.

The system results show that rollout engineering is not plumbing; it is training strategy

Algorithmic efficiency only matters if the system can execute it. Shopee profiles RL training using 256 H100 GPUs, 1k-token input length, 32k-token output length, batch size 512, and $n=8$. The rollout stage consumes 61% of step time, followed by actor update at 15.4%, reward computation at 13.5%, recompute log-probability at 9.5%, and weight resharding at 0.4%.

That profile tells the story. Long-context RL is dominated by generation. If rollout is slow, everything else waits. If reward computation starts only after all rollouts finish, the pipeline inherits the worst-case latency of the longest samples. If workers receive uneven generation lengths, faster GPUs idle while slower ones complete long traces. The bottleneck is not a single villain. It is a queue.

Shopee reports cumulative speedups from several system optimizations:

Optimization step Reported overall speedup
Baseline 1.00×
Multi-detokenisation parallelism 1.16×
Reward computation overlap 1.17×
FP8-quantised rollout 1.52×
Length-based load balancing 1.66×

The individual details are also useful. Length-based load balancing uses short draft responses to estimate generation length, then distributes samples so workers receive roughly equal expected decoding loads. The paper reports about 8% rollout speed improvement and more than 12% reduction in GPU idle ratio. FP8-quantised rollout reduces rollout time by 30% for 32k generation length and cuts end-to-end training time by nearly 20%. Multi-detokenization improves reward labeling throughput by 14%. Reward computation overlap reduces time spent in the reward stage by about 85%.

These are implementation results, but they are not secondary. At this scale, system optimization changes what is economically trainable. A method that is elegant but slow may never receive enough training steps to matter. A pipeline that reduces idle GPU time can convert the same hardware budget into more useful experiments.

The business translation is blunt: AI training infrastructure is no longer back-office plumbing. It is part of model strategy. The team that understands rollout latency, reward overlap, quantized inference, and worker load balancing may outperform the team with a better-sounding algorithm and worse utilization.

The evidence is strong, but not all tests support the same claim

The paper’s evaluation section reports four major evidence categories: in-house e-commerce, in-house multilingual, in-house general ability, and public ARC benchmarks. These should not be read as one undifferentiated scoreboard. Each serves a different purpose.

Evidence item Likely purpose What it supports What it does not prove
Zero-variance and sampling-size analysis Sensitivity / ablation-style diagnostic Larger exploration can reduce zero-variance prompts, but compute-normalized gain must be managed Exact sampling choice transfers to every model or domain
GenRM vs ORM comparison Reward-model ablation Ternary generative reward modeling improves agreement and tie handling on Shopee’s judge benchmark GenRM is universally better for all reward tasks
Router Replay log-prob discrepancy Implementation diagnostic MoE train-inference mismatch can be reduced by replaying router decisions Router Replay alone explains final benchmark gains
System speedup table Implementation and throughput evidence Rollout-system engineering materially improves RL training efficiency Same speedups appear on non-H100 or smaller-model setups
In-house e-commerce benchmark Main domain evidence CompassMax-V3-Thinking performs strongly on Shopee-relevant workflows Public generalization beyond Shopee-style e-commerce
In-house multilingual benchmark Deployment-relevance evidence The model is balanced across seven SEA languages It dominates all frontier models in every language
In-house general benchmark Auxiliary capability evidence Thinking training improves general internal capability versus CompassMax-V3 It is the best general model overall
Public ARC benchmark Comparison with prior work / public sanity check Large gains over CompassMax-V3 on math, code, and some agent tasks Uniform dominance over DeepSeek-R1 across all tasks

This distinction matters because the results are nuanced.

On Shopee’s in-house e-commerce benchmark, CompassMax-V3-Thinking reaches an average score of 85.79, slightly above CompassMax-V3 at 85.14 and clearly above several external models on this internal benchmark. It performs especially strongly on tasks such as shopping guide, after-sales issue handling, product recommendation, and product name rewriting. Product recommendation reaches 94.58. After-sales remains very high at 98.48, though CompassMax-V3 and DeepSeek-R1 are also extremely strong on that row.

The correct interpretation is not “RL transformed every e-commerce task.” The stronger interpretation is narrower and better: CompassMax-V3-Thinking preserves or improves a strong e-commerce base while adding reasoning-oriented capability. That is commercially important because domain adaptation often creates trade-offs. A model can become better at long reasoning and worse at the operational tasks that paid for the project. Shopee’s evaluation suggests the pipeline avoided that failure.

On the multilingual benchmark, CompassMax-V3-Thinking scores a SEA average of 86.41 across English, Traditional Chinese, Indonesian, Malay, Portuguese, Thai, and Vietnamese. This is competitive, but not the highest in the table: GPT-5-Thinking scores 86.64 and Gemini-2.5-Pro scores 87.27. The value is not outright domination. The value is balanced cross-language behavior in markets where e-commerce operations need consistency more than leaderboard theater.

On the internal general benchmark, the improvement over CompassMax-V3 is large: 76.01 average versus 64.49. Gains appear across creative generation, reasoning, code, safety, and knowledge/comprehension. Here the paper’s claim is more convincing: the thinking pipeline substantially improves the prior Shopee model’s broader capabilities.

The public ARC benchmark shows the most interesting mixed profile. CompassMax-V3-Thinking strongly improves over CompassMax-V3 on AIME24, AIME25, HMMT, Zebralogic, HumanEval, and multi-turn BFCL. It is highly competitive with DeepSeek-R1 on GPQA and MMLU-Redux, beats it on AIME24/25 and HumanEval, but trails it on MBPP and some BFCL variants. Again, this is not a universal victory lap. It is a credible public sanity check showing that the internal pipeline produced real reasoning gains beyond Shopee’s private benchmark environment.

The business value is diagnosis before scale

For companies building domain-specific AI systems, the paper’s most useful message is not “train your own hundred-billion-scale MoE.” Please do not put that sentence into a board deck without adult supervision.

The better business takeaway is that RL training should be managed as an efficiency-and-stability pipeline. Shopee’s contribution is a chain of diagnostic repairs:

Failure mode Observable symptom Shopee-style response Business meaning
Zero-variance prompts Rollouts produce identical rewards and zero advantage Filter easy prompts, expand exploration, reshape rewards and advantages Do not pay for rollouts that cannot teach
Brittle importance sampling Long-horizon updates fluctuate or over-smooth Entropy-grouped ESPO Treat uncertain reasoning regions differently from routine tokens
Reward inversion Near-equal outputs receive misleading preference direction Ternary GenRM with tie recognition Avoid optimizing fake distinctions
MoE train-inference mismatch Recomputed log probabilities diverge after routing Router Replay Diagnose routing as part of training stability
Rollout bottleneck GPUs wait for long generations or delayed rewards FP8 rollout, reward overlap, load balancing, detokenization parallelism Infrastructure determines the practical cost of learning
Domain reward sparsity Business tasks lack clean public benchmarks Compass-Gym with verifiers and GenRM Convert workflows into measurable training signals

This framework travels better than the exact model. A bank, retailer, logistics provider, or SaaS company may not run 256 H100s for Long-CoT MoE RL. But it can still ask: Which prompts provide signal? Which workflow steps are verifiable? Where do judges hallucinate preference differences? Which tool calls can be scored deterministically? Which languages or markets are under-tested? Which latency bottleneck makes experimentation too slow?

That is the real ROI pathway. Not “RL makes models smarter,” but “better measurement makes RL less wasteful.”

Where the paper’s boundary sits

The paper is strong as an industrial systems-and-methods report, but its boundaries are important.

First, much of the most business-relevant evidence is internal. The e-commerce benchmark, multilingual benchmark, and general benchmark are constructed by Shopee. That is not a flaw—public benchmarks often miss real e-commerce operations—but it limits independent comparability. The results are credible as evidence for Shopee’s deployment context; they are weaker as proof of universal superiority.

Second, several methods are evaluated as parts of an integrated pipeline rather than isolated interventions. The paper explains why zero-variance elimination, ESPO, GenRM, Router Replay, and system optimizations matter, and it provides specific diagnostics for some components. But the final benchmark gains should not be mechanically attributed to one technique at a time.

Third, the infrastructure assumptions are high-end. The system profile uses 256 H100 GPUs and long 32k-token rollouts. FP8 rollout and MoE Router Replay are highly relevant to that environment. Smaller dense models, API-only workflows, and low-budget fine-tuning setups will face different bottlenecks.

Fourth, the e-commerce reward system is deeply tied to Shopee’s workflow. Keyword verifiers, JSON schema checks, and GenRM judging are broadly transferable patterns. The exact task mix—after-sales, product recommendation, brand extraction, address parsing, title rewriting, query-item relevance—reflects Shopee’s business environment.

These boundaries do not weaken the paper. They keep it useful. A result that says “this is what worked under these production constraints” is more actionable than a vague claim about general intelligence floating politely above the ground.

A better mental model for RL at scale

CompassMax-V3-Thinking is best understood as a model built by closing leaks. Prompt selection leaks compute. Poor reward design leaks direction. Binary judges leak nuance. MoE routing mismatch leaks consistency. Rollout systems leak time. Multilingual and e-commerce evaluations leak realism if they are not grounded in actual markets.

Shopee’s paper matters because it treats those leaks as one system. That is why a mechanism-first reading works better than a feature list. Multi-stage zero-variance elimination, ESPO, GenRM, Router Replay, Compass-Gym, FP8 rollout, reward overlap, and length-aware scheduling are not isolated tricks. They are repairs to different points in the same learning circuit.

For business leaders, the sober conclusion is this: the next advantage in domain AI may not come from asking for “more reasoning” in the abstract. It may come from building training and evaluation systems where every prompt, every reward, every tool call, and every GPU minute has a job.

No prompt left behind, yes. But also no reward left unexamined, no router left drifting, and no rollout left pretending to be useful just because it was expensive.

Cognaptus: Automate the Present, Incubate the Future.


  1. Shopee LLM Team, “Each Prompt Matters: Scaling Reinforcement Learning Without Wasting Rollouts on Hundred-Billion-Scale MoE,” arXiv:2512.07710, 2025. ↩︎