When training Large Language Models (LLMs) to reason, reinforcement learning has proven to be a powerful yet blunt instrument. Most methods reduce the entire model output to a single pass/fail reward, applying that verdict to every token—regardless of whether it contributed to success or failure. This makes credit assignment vague, verifiability weak, and learning inefficient. Enter CAPO (Credit Assignment Policy Optimization), a method that shifts the paradigm: it brings verifiable, fine-grained credit assignment to the token level, using LLMs themselves as judgment agents.

The Core Problem: Coarse and Unverifiable Signals

Most current Reinforcement Learning with Verifiable Rewards (RLVR) methods use binary rule-based outcomes—rewarding correct answers, punishing wrong ones. However, they treat entire model outputs as monolithic actions. This not only dilutes learning signals but also makes it impossible to pinpoint why a response failed or succeeded.

A correct answer with flawed logic gets the same reward as one with flawless reasoning. An incorrect answer with one minor misstep is punished just as harshly as one that’s entirely off-track.

These undifferentiated signals prevent LLMs from refining their reasoning processes. Methods like PPO try to introduce per-token gradients via value estimates, but these estimates are inherently noisy and unverifiable, especially with limited samples.

CAPO’s Innovation: LLM-as-Judge for Step-Wise Feedback

CAPO introduces a structured yet efficient solution:

  1. Use LLM-as-GenPRM: CAPO leverages a general-purpose, frozen LLM (e.g. Llama-3-70B or Qwen-2.5-72B) to generate token-level, step-wise critiques in a single pass. It doesn’t need supervised process labels or multiple rollouts.

  2. Voting for Verifiability: By prompting the GenPRM multiple times per response and aggregating critiques via majority, intersection, or average vote, CAPO ensures that feedback is not only detailed but robust and verifiable.

  3. Asymmetric Reward Formulation: CAPO distinguishes between outcome rewards (did the model reach the correct answer?) and process rewards (did it reason correctly?). It penalizes tokens in wrong steps more lightly than it rewards correct outcomes, ensuring outcome alignment remains primary while process feedback helps fine-tune the path.

Scenario Outcome Reward Process Penalty Total Reward
Correct answer, correct process +2 0 +2
Correct answer, flawed process +2 -1 +1
Wrong answer, good process 0 0 0
Wrong answer, flawed process 0 -1 -1

This transforms learning into a multi-objective optimization problem—teaching the model to reach the right conclusion and to do so via valid logical steps.

Why This Matters: Verifiable Precision at Scale

Unlike previous densification methods that rely on fuzzy attention weights or Shapley values, CAPO is:

  • Verifiable — thanks to objective, interpretable critiques from LLM-as-GenPRM
  • Token-level — enables pinpoint corrections without punishing the whole sequence
  • Efficient — requires only one forward pass per critique set, with 2–8 critiques per response

This balance of granularity, verifiability, and efficiency solves a long-standing problem in LLM alignment: how to ensure reasoning quality, not just answer accuracy.

Experimental Gains: Stronger Reasoning, Better Generalization

Across six math benchmarks and three general reasoning datasets, CAPO consistently outperforms both supervised fine-tuning and prior RL-based methods like GRPO (Group-Relative PPO). For instance:

  • On Qwen2.5-7B, CAPO-Intersection scored +2.9% higher in math and +1.8% higher overall vs GRPO.
  • On Llama3-3B, CAPO-MajorVote outperformed all others with a fine-tuned balance between strictness and flexibility.

Interestingly, the optimal voting mechanism depends on model scale:

  • 🔍 Smaller models benefit from conservative Intersection voting — high-precision, low-noise signals
  • 🧠 Larger models prefer Majority or Union voting — allowing exploratory steps while still correcting obvious flaws

Case Study: Same Answer, Different Reasoning

Consider two responses that both arrive at the correct answer.

  • One follows a direct and rigorous logical path.
  • The other wanders with mistakes and lucky guesses.

RLVR gives both a thumbs-up. CAPO sees through the illusion — rewarding the former and gently penalizing flawed reasoning in the latter. This kind of differentiated feedback is the backbone of meaningful reasoning improvement.

Beyond CAPO: A Shift in Reward Modeling Philosophy

CAPO doesn’t just offer a better tool — it represents a broader shift toward process-oriented reward modeling. While outcome-based signals are still essential for goal alignment, LLMs now operate in domains (e.g. math, code, law) where how they reach the answer is as important as what they answer.

This philosophy also aligns with emerging directions in LLM alignment:

  • Generative Verifiers as Next-token Predictors
  • Critique-first Reward Models
  • Multi-agent Judging and Voting

The CAPO framework—simple, scalable, and verifiable—offers a blueprint for implementing these ideas in practice.

Final Thoughts: Rewarding Reasoning, Not Just Results

As LLMs move from fluent generators to robust solvers, the blunt force of binary feedback is no longer sufficient. CAPO’s elegance lies in its balanced reward structure, efficient use of judgment LLMs, and scalable design. It gives LLMs the credit they deserve—precisely where it’s due.

Cognaptus: Automate the Present, Incubate the Future.