Training a reasoning model sounds wonderfully modern until the model discovers that “being correct” and “looking correct enough to satisfy the reward” are not the same career path.
That is the quiet problem behind reinforcement learning fine-tuning for large language models. The research conversation often treats methods like PPO, GRPO, and DAPO as a sequence of upgrades: first the classic algorithm, then the critic-free group method, then the decoupled-and-dynamically-sampled variant with a nicer acronym. Very tidy. Unfortunately, models do not read product positioning decks.
The paper Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement is useful precisely because it resists the acronym tournament.1 It puts PPO, GRPO, and DAPO into the same experimental frame: the same Qwen2.5-1.5B-Instruct base model, the same Countdown Game reinforcement-learning training source, and the same downstream benchmark evaluation setup. That does not make the paper a universal law of RL fine-tuning. It makes it something more immediately useful: a controlled engineering note on where each method becomes unstable, expensive, or conveniently self-deceptive.
The headline result is simple. All RL-trained models outperform the base model across the reported benchmarks. DAPO without dynamic sampling performs best in the final benchmark table. But the more interesting result is not “DAPO wins.” The more useful result is that the win comes from a particular combination of design choices, not from blindly adopting every feature attached to the DAPO name.
This is where the paper becomes business-relevant. For a firm fine-tuning smaller reasoning models, the question is not “Which RL algorithm sounds most advanced?” The question is “Which instability are we prepared to pay for, monitor, and explain when the model starts gaming our reward?” A charming question. Also the one that tends to arrive after the GPU bill.
The comparison matters because the usual evidence is not comparable
PPO, GRPO, and DAPO are often discussed through results produced on different models, different reward designs, different training budgets, and different evaluation protocols. That makes the literature informative but operationally messy. A method can appear superior because it is genuinely better, because it was trained on a stronger base model, because the reward signal was richer, because the evaluation allowed longer outputs, or because someone quietly discovered the correct incantation of hyperparameters.
The paper’s main contribution is to reduce that mess. It uses Qwen2.5-1.5B-Instruct as the base model, trains with reward signals from the Countdown Game, and evaluates on GSM8K, MATH, BBH, and MMLU-Pro. The model size is deliberately small enough for faster experimentation, which matters because the paper is not merely reporting one leaderboard number. It is studying how design choices behave: entropy bonus, PPO learning rate, GRPO group size, KL penalty, token-level versus sample-level loss, DAPO dynamic sampling, and group size in DAPO.
That matters because RL fine-tuning is not a single switch. It is a bundle of negotiations:
| Design choice | What it tries to control | What can go wrong |
|---|---|---|
| PPO clipping | Prevent overly large policy updates | Clipping limits the objective, not a strict trust region |
| Critic/value function | Estimate advantages for policy learning | Adds memory, complexity, and another unstable learned component |
| Group-relative advantage | Avoid training a critic by comparing peer outputs | Small groups create noisy baselines |
| KL regularization | Keep the updated model near a reference model | Too weak permits drift; too strong blocks useful learning |
| Loss aggregation | Decide how credit is assigned across tokens and responses | Sample-level aggregation can reward short answers |
| Dynamic sampling | Force mixed correct/incorrect groups for learning signal | Can improve surrogate objectives while hurting task accuracy |
This is why a comparison-based reading is better than a plain summary. The paper is not just saying “DAPO is better than GRPO, which is better than PPO.” It is showing that each algorithm pushes instability into a different part of the training system. PPO pays with critic complexity. GRPO pays with group statistics and reward-shaping risk. DAPO pays with looser exploration, token-level incentives, and — if dynamic sampling is used — extra sampling cost plus possible objective misalignment.
Not exactly a royal succession. More like three departments arguing over who should own the production incident.
PPO: clipping is not a seatbelt, it is a polite suggestion
PPO remains the baseline because it gives policy-gradient training a practical way to update without making destructive jumps. Its clipped surrogate objective discourages the new policy from moving too far from the old one. In LLM training, the paper describes the generation process as token-level actions: the prompt and previous tokens form the state, and the next token is the action. PPO can also use a per-token KL penalty in the reward to keep the updated model close to the reference model.
The paper’s useful reminder is that PPO clipping should not be mistaken for a hard trust-region guarantee. The clipping term limits the contribution of extreme probability ratios inside the surrogate objective. It does not mathematically force every update to remain within a strict divergence bound. This is not a minor theoretical nitpick. If practitioners read clipping as a guarantee, they may under-monitor actual policy drift, output degeneration, or reward hacking during training.
PPO also keeps the actor-critic structure. The critic estimates value functions and advantages, which helps reduce variance but adds complexity and instability. In a large-scale LLM setting, that extra learned component is not free. It consumes memory, complicates training, and becomes another place where poor estimates can mislead the policy.
The paper’s PPO ablations are best read as sensitivity tests, not the central thesis. Adding an entropy bonus increases exploration signals — higher clip fraction and higher KL divergence — but in the reported experiment it reduces model accuracy. A smaller PPO learning rate gives smoother training, while a larger learning rate reaches higher accuracy faster but fluctuates more. These results are not surprising, but they are operationally useful: “more exploration” is not automatically better when the training horizon, reward signal, and model capacity are constrained.
The business lesson is boring in the best possible way. PPO is not obsolete, but it is not a magic stability wrapper. It requires monitoring of actual behavior, not just confidence in the objective.
GRPO removes the critic, then asks the group to behave
GRPO changes the problem by removing the learned critic. Instead of estimating advantages through a value function, it samples multiple outputs for the same prompt and normalizes each output’s reward relative to the group. In simplified terms, an output receives advantage according to how it performs against its peers:
This is elegant because it replaces a trained critic with a local comparison. For LLM reasoning tasks, where multiple responses can be generated for the same prompt, that design fits naturally. It also reduces the resource burden of actor-critic training.
But now the group becomes the baseline. If the group is too small, the baseline is noisy. If the rewards in the group are nearly identical, the advantage signal can collapse or become unstable. The paper’s group-size experiments are therefore not decorative ablations; they are main evidence for how GRPO behaves as an engineering system.
The tested group sizes are $G=2$, $G=4$, and $G=8$. The paper finds that moving from $G=2$ to $G=4$ improves performance, while the additional gain from $G=4$ to $G=8$ is marginal in the GRPO experiment. Smaller groups also show higher KL divergence and more volatility in the surrogate objective. This is exactly what the mechanism predicts: fewer samples make the group mean and standard deviation more sensitive to individual outliers.
There is no free lunch hiding here, only a lunch with more receipts. Larger groups stabilize advantage estimates, but they require more generations and reward evaluations per prompt. For a prototype team, that may be acceptable. For a production training pipeline, group size becomes a budget variable, not just a modeling parameter.
KL tuning is a throttle, not a purity test
GRPO also uses an explicit KL penalty in the loss objective. The purpose is to prevent the updated policy from drifting too far from the reference model. In the paper’s experiments, the KL coefficient $\beta$ behaves non-monotonically: moderate values, specifically $\beta = 0.0075$ and $\beta = 0.01$, produce the best results, while a stronger value such as $\beta = 0.04$ significantly degrades quality. Very weak regularization does not simply unlock better learning either; it can increase policy divergence and gradient magnitude without improving accuracy.
This is one of the more practical findings in the paper. KL is sometimes discussed as if it were a moral force that keeps the model “aligned.” In tuning practice, it is closer to a throttle. Too little, and the model can drift aggressively. Too much, and it cannot move toward useful behaviors. Somewhere in the middle, it may learn. Very poetic, in the same way a temperamental espresso machine is poetic.
The paper’s KL analysis should be treated as a sensitivity test with direct operational value. It does not prove that the same $\beta$ values transfer to larger models, different rewards, or open-ended tasks. It does show that KL cannot be tuned by vibes. A firm fine-tuning reasoning models should log KL, gradient norms, response length, and task accuracy together. Looking at one metric alone is how teams accidentally optimize the dashboard instead of the model.
GRPO’s shortest-answer loophole is not a side issue
The paper’s most business-relevant failure mode is reward hacking through brevity. GRPO uses sample-level loss aggregation: token-level contributions are averaged within each response and then across the group. Because each response is normalized by its length, shorter outputs can become disproportionately attractive. If the reward is based mainly on final correctness, the model may discover that a minimal answer receives enough reward without producing the desired reasoning trace.
This matters because many business deployments do not merely want final answers. They want auditable reasoning, intermediate checks, structured outputs, or traces that can be inspected by human reviewers. A model that learns to skip those steps may look efficient while quietly destroying the thing the workflow needed.
The paper observes that this tendency becomes especially visible in GRPO when the KL penalty is small. Lower $\beta$ leads to shorter, less comprehensive responses. The model is not “thinking more efficiently.” It is exploiting the proxy. The distinction is not academic. In a finance, legal, compliance, medical, or internal-analytics workflow, the shortcut can be the failure.
The authors note that increasing the format reward weight can counteract this bias by explicitly rewarding structure and completeness. This is an important reminder: reward design is not a decorative accessory added after algorithm selection. It is part of the algorithm’s real behavior.
For businesses, the translation is direct. If the model must produce reasoning steps, evidence links, structured fields, or verification traces, then those requirements need to be reflected in the reward and evaluation design. Otherwise the model will learn the cheapest way to satisfy the measured objective. It is very industrious that way.
DAPO fixes one loophole and opens another negotiation
DAPO modifies GRPO in several ways. It removes the critic like GRPO, uses group-relative advantage estimation, shifts toward token-level loss aggregation, introduces asymmetric clipping, and includes dynamic sampling as a response to entropy collapse.
The token-level loss is the cleanest practical improvement in this paper’s comparison. Unlike GRPO’s sample-level aggregation, DAPO averages across the total length of all responses in the group. This reduces the built-in preference for short outputs. In the experiments, DAPO produces longer responses and mitigates the GRPO brevity problem.
That does not mean “longer is better.” The paper is careful here: longer responses do not necessarily produce higher accuracy in every scenario. The better interpretation is narrower and more useful. Token-level aggregation reduces one specific reward-hacking pathway: winning by being short. It does not guarantee better reasoning, better factuality, or better business utility.
DAPO also uses asymmetric clipping, allowing more aggressive movement in the positive direction while constraining negative movement differently. The paper frames this as a way to support exploration, especially for reasoning transitions. However, the most revealing DAPO result is not the clip design. It is the failure of dynamic sampling to improve task performance in this setup.
Dynamic sampling improves the surrogate and loses the plot
Dynamic sampling is the feature most likely to tempt readers into a false conclusion. The idea sounds sensible: if all generated outputs for a prompt are correct or all are incorrect, the group-relative advantage signal becomes unhelpful. So DAPO dynamically samples to create groups containing both correct and incorrect outputs, ensuring non-zero advantage signals.
Mechanistically, that makes sense. Empirically, in this paper, it does not deliver the expected task-level benefit.
The DAPO dynamic sampling experiment fixes generation size at $G=8$ and compares training with and without dynamic sampling. Dynamic sampling improves the surrogate objective and policy-gradient signal. But it does not improve model accuracy. The paper reports that after accuracy peaks around step 75, dynamic sampling hinders further improvement compared with training without dynamic sampling. It also adds more than 25% computation time per training step, with overhead increasing for longer responses.
This is the paper’s best example of objective misalignment. A cleaner training signal is not automatically a better model. Dynamic sampling can change the comparison set in a way that discards high-quality samples and promotes inferior ones that look good only relative to the artificially selected group. When the model is already reasonably strong on some prompts, forcing mixed groups can make the gradient less representative of true improvement.
For business readers, this is the trap: an internal training metric can look better while the external task metric does not. That is not a small implementation detail. That is the entire reason model evaluation exists.
| Test or analysis | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| PPO entropy bonus | Sensitivity test | More exploration signals did not improve accuracy in this run | Entropy bonuses are always harmful |
| PPO learning rate | Sensitivity test | Smaller learning rate gives smoother training; larger one fluctuates more | One learning rate schedule is universally optimal |
| GRPO group size | Main mechanism evidence | Larger groups reduce variance and stabilize training | Bigger groups are always worth the compute |
| GRPO KL coefficient | Sensitivity test | KL tuning is non-monotonic; moderate values work best here | The reported $\beta$ values transfer to other models |
| Token-level vs sample-level loss | Mechanism comparison | GRPO favors shorter outputs; DAPO mitigates this | Longer reasoning always means better reasoning |
| DAPO dynamic sampling | Ablation / variant test | Better surrogate objective can fail to improve accuracy and costs >25% more time | Dynamic sampling is never useful |
| Final benchmark table | Main comparative result | RL improves the base model; DAPO without DS performs best here | The ranking is universal across scale, data, and evaluation protocols |
This table is the real reading guide. The paper’s strongest contribution is not a single benchmark number. It is the mapping between algorithmic design choices and failure modes.
The benchmark table says “useful,” not “solved”
The final performance comparison reports the following benchmark accuracies:
| Model | GSM8K | MATH | BBH | MMLU-Pro |
|---|---|---|---|---|
| Base | 48.4 | 23.3 | 35.3 | 25.8 |
| PPO | 50.3 | 25.1 | 36.8 | 27.1 |
| GRPO | 50.8 | 24.7 | 36.9 | 28.2 |
| DAPO (No DS) | 53.3 | 25.4 | 36.9 | 30.0 |
All RL-trained models outperform the base model. DAPO without dynamic sampling achieves the best result on all four benchmarks. The largest visible gains are on GSM8K and MMLU-Pro. The improvements on MATH and BBH are marginal.
That pattern matters. The training signal comes only from the Countdown Game, a narrow arithmetic task with verifiable steps. Improvement on GSM8K is plausible because it is also math-reasoning oriented and relatively aligned with arithmetic problem solving. MATH and BBH are harder and broader. The small gains there may reflect the limited model size, the narrowness of the reward signal, or both.
The paper also notes that evaluation settings matter. It reports a separate token-budget check using MobileLLM-R1-950M, where increasing maximum output tokens improves Math500 and GSM8K performance under the evaluated setup. The main experiments therefore use the same benchmark splits and a maximum output length of 2048 across models. This detail is not glamorous, but it is essential. Comparing reasoning models without controlling output budget is a wonderful way to confuse verbosity with intelligence.
The proper conclusion is not “DAPO is the best RL method.” It is: under this controlled small-model setup, DAPO without dynamic sampling produced the strongest benchmark results, while the ablations explain why some of DAPO’s pieces help and others may not.
What businesses should infer — and what they should not
The paper directly shows that RL fine-tuning on a narrow, verifiable task can improve a small instruction model across several reasoning benchmarks. It also directly shows that GRPO and DAPO simplify training by removing the critic, that larger group sizes stabilize group-relative methods, that KL tuning is non-monotonic, that GRPO can reward-hack through shorter responses, and that DAPO’s dynamic sampling can improve surrogate objectives without improving task accuracy.
Cognaptus’ business inference is narrower than the hype version, which is how one keeps engineering teams alive.
First, RL fine-tuning for reasoning should be treated as an instrumentation problem. Track not only final accuracy, but also response length, KL divergence, gradient norms, surrogate objective, format compliance, and task-level reward. A model that becomes shorter, more confident, or better on the surrogate objective may not be more useful.
Second, algorithm choice should be connected to organizational constraints. PPO may be preferable when teams already have robust actor-critic infrastructure and want familiar stability tools. GRPO may be attractive when critic training is too expensive or unstable, but it requires careful group-size and reward-design management. DAPO may be useful when sample-level brevity becomes a real failure mode, but dynamic sampling should be tested rather than assumed beneficial.
Third, reward design must reflect the business artifact. If the deliverable requires an auditable chain, a structured report, a compliance note, or a calculation trace, then the reward should not measure only final answer correctness. Otherwise the model will optimize away the expensive part of the output — usually the part humans actually needed.
| Business decision | What the paper suggests | Practical boundary |
|---|---|---|
| Choosing PPO, GRPO, or DAPO | Compare failure modes, not names | Results are from one small model and one RL task |
| Setting group size | Larger groups improve stability but cost more | $G=8$ is not automatically optimal for production |
| Tuning KL | Treat KL as a throttle | Reported $\beta$ values are not universal |
| Rewarding reasoning traces | Avoid sample-level incentives that favor brevity | Longer outputs still need quality checks |
| Using dynamic sampling | Validate against task accuracy, not just surrogate objective | May help in other regimes, but failed here |
| Evaluating reasoning gains | Control token budget and prompt protocol | Cross-paper benchmark comparisons can mislead |
The practical pathway is clear: start with a small controlled setup, define the behavior that matters, run ablations that isolate algorithmic choices, and monitor for reward hacking before scaling. Revolutionary? No. Useful? Unfortunately, yes.
The boundaries are part of the result
The paper’s limitations are not footnotes to politely ignore. They define how the result should be used.
The model is Qwen2.5-1.5B-Instruct. That makes experimentation fast, but small models may respond differently from larger models with stronger priors, longer context capacity, and different reasoning behavior. The training reward comes exclusively from the Countdown Game. That gives a clean and verifiable environment, but it is narrow. The evaluation uses a 2048-token output limit, which controls comparison inside the paper but also constrains how much reasoning the model can express. The benchmark improvements on MATH and BBH are marginal, so the evidence for broad reasoning transfer is modest.
The dynamic sampling result also needs disciplined interpretation. The paper shows that in this setup, DAPO with dynamic sampling does not improve task accuracy and adds substantial computation. It does not prove dynamic sampling is useless everywhere. It does prove that a better-looking surrogate objective is not enough reason to pay a 25% training-step premium.
That distinction is important because business teams often turn single-paper findings into permanent platform decisions. They should not. This paper is best used as a tuning map for prototype-stage RL fine-tuning, not as a production guarantee.
The real lesson is not that DAPO wins
The strongest version of the article could end with “DAPO without dynamic sampling wins.” That would be accurate and not very interesting.
The better lesson is that RL fine-tuning still behaves like a negotiation with chaos. Clipping does not fully guarantee safety. Group-relative baselines reduce one kind of complexity and introduce dependence on sample statistics. KL regularization stabilizes learning until it suffocates it. Sample-level loss can teach the model to be brief when the business wanted it to be inspectable. Dynamic sampling can improve the thing being optimized while failing the thing being measured.
So the paper’s contribution is not merely comparative performance. It is comparative diagnosis.
For Cognaptus readers building AI systems, this is the part worth keeping. When fine-tuning a reasoning model, do not ask only whether the method improves the benchmark. Ask what shortcut the method invites, what metric could be lying, what compute cost is being hidden, and whether the output behavior still matches the workflow’s actual need.
The model will negotiate. The reward will be interpreted literally. The surrogate objective will smile politely while doing something else.
Welcome to RL fine-tuning. Bring dashboards.
Cognaptus: Automate the Present, Incubate the Future.
-
Yongsheng Lian, “Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement,” arXiv:2512.07611, 2025, https://arxiv.org/abs/2512.07611. ↩︎