Thinking About Thinking: When LLMs Start Writing Their Own Report Cards

Report cards are usually written by teachers, managers, examiners, auditors, or other people with the institutional privilege of saying, “Nice effort, but no.”

The paper Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics asks a stranger question: what if the model helps write the report card for its own reasoning process?¹

That sounds like the kind of governance idea that would make a compliance officer reach for coffee. A model evaluating itself is not automatically trustworthy. Sometimes it is self-reflection. Sometimes it is theatre with JSON brackets.

But the paper is more specific than that. It is not asking a model to vaguely “judge whether it reasoned well.” It proposes RLCER: Reinforcement Learning with CoT Supervision via Self-Evolving Rubrics. The method trains one policy model under two roles. One role solves the problem. The other role proposes rubrics for evaluating the reasoning. Those rubrics survive only if satisfying them correlates with final-answer correctness across multiple sampled solutions.

That last clause is the useful part. The rubric is not accepted because it sounds wise. It is accepted because, across rollouts, it appears to separate reasoning paths that lead to correct answers from those that do not. A tiny outbreak of empiricism. We should encourage it.

The business implication is not that models can now supervise themselves in every domain. They cannot. The paper still depends on verifiable-answer tasks, a frozen verifier model, cold-start training data, and extra rollout computation. The more careful interpretation is narrower and more valuable: for domains where correctness can be checked, self-generated rubrics may reduce the cost of process supervision.

In plain language, RLCER tries to move reinforcement learning from:

“Did the model get the answer right?”

toward:

“Which reasoning habits made correct answers more likely?”

That is not just a better reward function. It is a different operating model for training reasoning systems.

The usual RLVR bargain: cheap correctness, expensive reasoning

Reinforcement Learning with Verifiable Rewards, or RLVR, has become a standard recipe for improving reasoning models. The model generates a solution, produces a final answer, and receives a reward based on whether that answer matches the ground truth. Math, coding, puzzle solving, and some scientific QA tasks fit this pattern because the final answer is checkable.

This is attractive because final-answer verification is cheap compared with human annotation of every intermediate reasoning step. You do not need an expert to grade the entire solution path. You only need to know whether the answer is right.

That bargain has a cost.

If two reasoning traces end with the same correct answer, RLVR treats them similarly even if one is systematic and the other is a lucky mess. If a model learns a brittle shortcut that works on the training distribution, the reward may still approve. If the model’s reasoning style shifts during training, a static process reward model may become less useful. This is the familiar problem of rewarding outcomes while pretending the process is somebody else’s department.

For businesses, the weakness is obvious. In consumer chat, final answer quality may be enough most of the time. In finance, compliance, legal review, medical triage, engineering analysis, or operational planning, the reasoning path matters because it affects auditability, robustness, and failure diagnosis. A model that occasionally gets the right answer for the wrong reason is not a reasoning system. It is a spreadsheet with confidence issues.

The hard part is process supervision. Human-written process labels are expensive. Reward models trained on those labels can become stale. Static rubrics may stop matching model behavior as the model improves or shifts.

RLCER enters exactly here: it keeps the verifiable-answer backbone of RLVR, but adds an evolving layer of process-level supervision.

The mechanism: one model, two roles, one feedback loop

RLCER assigns a single policy model two prompted roles:

Role	What it does	What it is rewarded for
Reasoner	Generates the reasoning trace and final answer	Final-answer correctness plus satisfaction of valid CoT rubrics
Rubricator	Generates candidate rubrics for evaluating the reasoning trace	Producing rubrics that are valid, discriminative, and correctly formatted
Frozen verifier	Checks whether a reasoning trace satisfies each rubric	Not trained in the RLCER loop

The reasoner and rubricator share the same underlying policy parameters, but they operate under different prompts. This matters because the method is not simply bolting a separate judge model onto a solver model. It is training the same model to play two complementary roles: solve problems and generate useful criteria for judging reasoning.

The loop works roughly like this:

The reasoner samples multiple reasoning traces and final answers for a question.
The rubricator proposes natural-language rubrics for evaluating the reasoning.
A frozen verifier checks whether each reasoning trace satisfies each rubric.
The system checks whether satisfying a rubric correlates with final-answer correctness across rollouts.
Valid rubrics contribute to the reasoner’s CoT reward.
The rubricator is rewarded for producing a higher fraction of valid rubrics.
The shared policy is updated under both roles.

The paper’s key idea is therefore not “the model writes rubrics.” That alone would be mildly interesting and operationally dangerous. The key idea is:

A rubric becomes useful only when its satisfaction is empirically aligned with correctness and remains discriminative across different reasoning traces.

That converts rubrics from decorative assessment language into candidate reward features.

What makes a rubric valid is correlation, not eloquence

A natural-language rubric can sound impressive while being useless. “Demonstrates rigorous analytical clarity” is the kind of phrase that can survive in a corporate strategy deck for years without doing measurable work.

RLCER uses a more concrete validity rule. A rubric is considered valid when its satisfaction indicator is positively correlated with answer correctness across sampled rollouts for the same question, and when it is discriminative rather than satisfied by everything. The paper sets the default correlation threshold at 0.2.

The reason is straightforward. Suppose the model samples several reasoning traces for a math problem. Some reach the correct answer; others do not. A candidate rubric might say the solution should “implement edge-case validation” or “avoid ad-hoc manual listing when a systematic counting method is needed.” If traces satisfying that rubric are more likely to reach the correct answer, the rubric contains useful process signal. If every trace satisfies it, or if satisfaction has no relationship with correctness, the rubric is weak supervision wearing a blazer.

This gives RLCER a useful internal discipline:

Rubric behavior	Training meaning
Correlates with correctness	Can reward reasoning habits associated with success
Does not correlate with correctness	Should not guide the reasoner
Always satisfied	Provides no discrimination
Never satisfied	Provides little practical reward signal
Becomes easier over time	Risks reward saturation
Evolves toward harder valid criteria	Can continue shaping reasoning

That last row is where self-evolution enters.

Self-evolving rubrics prevent the report card from becoming too easy

A static rubric can become stale. At the beginning of training, “checks boundary cases” may separate good reasoning from poor reasoning. Later, if almost every rollout checks boundary cases, the rubric stops adding signal. The model has learned to pass that part of the exam. Congratulations. Now the exam is too easy.

RLCER addresses this by rewarding the rubricator for producing valid rubrics. The rubricator’s quality reward is based on the fraction of proposed rubrics that pass the validity test. It also receives a format reward so the rubrics remain parseable. This is not glamorous, but parseability is where many elegant AI systems go to die.

The result is an adaptive supervision loop. The reasoner learns to satisfy useful rubrics. As those rubrics become saturated, the rubricator is pushed to generate new rubrics that better separate correct and incorrect reasoning traces. In theory, the process becomes a moving curriculum: not a fixed checklist, but a training signal that evolves as the model’s reasoning behavior changes.

This is why the mechanism-first interpretation is more useful than a benchmark-first summary. The benchmark gains matter, but the paper’s deeper contribution is the reward loop:

Layer	What is being optimized	Why it matters
Outcome reward	Whether the final answer is correct	Keeps training anchored to verifiable success
CoT rubric reward	Whether the reasoning satisfies valid process criteria	Adds process-level guidance without human-written labels
Rubricator validity reward	Whether proposed rubrics predict correctness and discriminate among traces	Prevents rubric supervision from becoming stale or ornamental

The method is not just asking the model to think. It is asking the model to generate better tests of thinking, then rewarding those tests when they actually predict successful answers.

No, this does not make the model wise. It makes the supervision loop less blind. Different achievement.

The evidence: RLCER improves over outcome-only RLVR, but the pattern is uneven

The authors evaluate Qwen3 4B and 8B models. They first cold-start the base models with supervised fine-tuning because the small models did not reliably follow the required reasoner and rubricator formats out of the box. Training then uses DAPO-Math-17K, and evaluation covers math benchmarks including AIME2024, AIME2025, AMC2023, plus general reasoning benchmarks such as GPQA-Diamond and SuperGPQA subsets.

The headline result is that RLCER generally improves over outcome-only RLVR.

For the 8B model, the comparison is:

Benchmark	SFT	+ RLVR	+ RLCER	RLCER vs RLVR
AIME2024	22.29	34.79	37.50	+2.71
AIME2025	23.75	32.50	33.33	+0.83
AMC2023	66.41	84.53	86.41	+1.88
GPQA-Diamond	31.72	46.56	48.77	+2.21
SuperGPQA-Eng	36.00	42.94	45.00	+2.06
SuperGPQA-Med	33.75	38.31	36.50	-1.81
SuperGPQA-Sci	35.19	48.81	50.25	+1.44

For the 4B model, RLCER beats RLVR on five benchmarks, ties on AIME2024, and underperforms on SuperGPQA-Sci:

Benchmark	+ RLVR	+ RLCER	RLCER vs RLVR
AIME2024	29.38	29.38	0.00
AIME2025	30.21	30.63	+0.42
AMC2023	79.53	81.88	+2.35
GPQA-Diamond	44.16	44.91	+0.75
SuperGPQA-Eng	39.75	40.19	+0.44
SuperGPQA-Med	28.50	31.63	+3.13
SuperGPQA-Sci	42.88	41.81	-1.07

Two readings are important.

First, the method is not producing cartoonishly large gains over RLVR. The improvement is incremental but broad. On the reported averages across the seven listed benchmarks, RLCER improves over RLVR by about 0.86 points for 4B and about 1.33 points for 8B. The larger 8B model benefits more, consistent with the paper’s observation.

Second, RLCER’s value is not just “higher scores.” If the only claim were a one-point benchmark improvement, the business interpretation would be modest. The more important evidence is that self-generated rubrics appear to provide usable process-level reward signals, and that those signals can evolve during training.

That is the claim worth paying attention to.

The outcome-free test asks whether the rubrics contain real signal

The paper includes a useful preliminary experiment: remove the outcome reward and train using only rubric-based CoT rewards.

This is not the main business deployment setting. A company would not normally throw away known correctness labels if it has them. The point of the experiment is diagnostic. It asks whether self-proposed rubrics contain learning signal independent of final-answer reward.

According to the paper, rubric-only training improves reasoning performance over training, while random rubric rewards fail to improve performance and can cause sharp drops. That is the cleanest test in the paper for the basic premise: the rubrics are not merely structured noise.

The purpose of this experiment is best understood as main evidence for signal reliability, not as a recommendation to train business systems without outcome verification.

Test	Likely purpose	What it supports	What it does not prove
Rubric-only reward	Main evidence for rubric signal	Self-proposed rubrics can guide learning even without outcome reward	That outcome reward is unnecessary in production
Random rubric reward	Negative control	Improvement is not caused by arbitrary dense rewards	That all valid rubrics are semantically correct
RLCER vs RLVR	Main effectiveness test	Adding CoT rubric rewards improves over outcome-only training on average	That gains transfer to every task or domain
Removing rubricator evolving reward	Ablation	Self-evolving rubrics improve stability and informativeness	That the evolving rule is optimal
Rubrics as prompt hints	Exploratory extension	Generated rubrics can guide inference-time reasoning	That prompting with rubrics is always cost-effective

This distinction matters because AI papers often contain several experiments that look equally important in a result table but serve different argumentative functions. Here, the rubric-only test checks whether the supervision signal exists. The RLVR comparison checks whether the full method improves training. The ablation checks whether self-evolution matters. The prompt-hint test explores whether rubrics are portable beyond training.

Treating all of them as “the method works” would flatten the paper into marketing sludge. We have enough of that already.

The ablation shows why evolving rubrics matter

The authors also test what happens when the rubricator no longer receives the evolving reward and is rewarded only for format. This ablation is important because a two-role system could appear useful simply because rubrics add extra dense reward. The question is whether making the rubricator better over time matters.

The paper reports that full RLCER produces a more stable learning curve and eventually outperforms the non-evolving ablation. It also reports that the average correlation between rubric satisfaction and final-answer correctness increases during RLCER training, while the ablation’s correlation remains largely unchanged. Meanwhile, the CoT reward under full RLCER becomes harder to obtain over time, whereas the non-evolving rubrics become easier to satisfy.

That pattern supports the paper’s mechanism. It suggests that self-evolving rubrics are not just adding more text to the reward pipeline. They are changing the difficulty and informativeness of the process signal.

The business translation is simple: a static checklist can be useful at first, but once the model learns to satisfy it, the checklist loses diagnostic power. An evolving rubric mechanism tries to keep the checklist annoying enough to remain useful. Annoying, in evaluation design, is often a compliment.

The case study makes the mechanism concrete

The appendix includes a case study about counting years after 2013 and before 10000 that consist of four consecutive digits. The ground truth is 149; the model predicts 131.

The generated rubrics identify problems such as using ad-hoc manual listing instead of systematic permutation formulas, failing to validate edge cases, and not organizing digit-set ranges and leading-zero constraints upfront. They also include positive criteria such as categorizing sets by minimum permutation to avoid redundant counting.

This example is small, but it clarifies what a self-proposed rubric can do. It does not merely say “wrong answer.” It points toward the reasoning habits that likely caused the wrong answer:

Failure mode	Rubric-style diagnosis	Practical value
Manual enumeration	Penalize ad-hoc listing when a systematic formula is available	Reduces arithmetic and omission errors
Missing edge checks	Penalize failure to validate leading-zero and boundary cases	Improves reliability under constraints
Poor decomposition	Reward explicit digit-set ranges and boundary handling	Makes the solution auditable
Redundant work	Reward categorization that avoids repeated case analysis	Improves efficiency without skipping logic

For enterprise AI, this distinction matters. The value is not just that the system knows a response is wrong. The value is that it can generate reusable, task-specific diagnostic criteria: the kind of criteria that can be turned into monitoring rules, test suites, prompt hints, or fine-tuning signals.

Again, within limits. The case study is illustrative, not proof of broad deployment readiness.

Rubrics as prompt hints: training artifacts become deployment guidance

One of the paper’s more interesting extensions is using generated rubrics as in-prompt hints during inference. The authors report that adding rubrics to the prompt improves reasoning performance, and that best-of-16 performance improves further on AIME datasets.

This is not the core training contribution, but it is operationally interesting. It suggests that rubrics learned or generated during training may become portable reasoning guidance at deployment time.

That creates a possible workflow for companies building specialized reasoning systems:

Train or adapt a model on verifiable tasks.
Generate rubrics that correlate with correct answers.
Filter rubrics for validity and discriminative power.
Reuse high-quality rubrics as inference-time guidance.
Monitor whether rubric satisfaction continues to predict correctness.

This is not just “prompt engineering,” because the rubrics are not written by someone guessing what sounds helpful. They are selected based on empirical relationship to correctness during rollouts.

The difference is small in implementation and large in governance. A human prompt engineer might say, “Always check edge cases.” A rubric-evolution pipeline can ask, “Does checking edge cases actually predict correctness on this task distribution, and does it still discriminate among model outputs?”

That second question is where evaluation becomes less decorative.

What the paper directly shows, and what Cognaptus infers

The paper’s direct claims are strongest in RLVR-like settings: tasks with verifiable answers, multiple sampled rollouts, and enough structure for rubric satisfaction to be checked by a verifier.

The business implications are plausible but should be separated from the evidence.

Layer	What the paper directly shows	Cognaptus inference	What remains uncertain
Training signal	Self-proposed rubrics can provide meaningful CoT reward signals in the tested setup	Process supervision may be made cheaper for answer-verifiable domains	Whether this works in messy open-ended enterprise tasks
Performance	RLCER generally beats outcome-only RLVR across reported benchmarks	Adding process rewards can improve reasoning beyond final-answer checking	Whether the incremental gain justifies extra compute in every case
Rubric evolution	Rewarding valid rubric generation improves correlation and reduces saturation	Adaptive evaluation criteria may be better than static checklists	Whether the validity rule is robust under distribution shift
Prompt hints	Generated rubrics can improve inference-time performance	Training-time diagnostics may become deployment-time guidance	Whether prompt overhead and latency are worth it
Human labeling	No human annotation is needed for CoT rubrics in the paper’s method	Manual process-labeling costs may fall in some training pipelines	The method still uses cold-start data and a verifier model

For Cognaptus readers, the most realistic near-term opportunity is not “fully autonomous self-improving agents.” That phrase should be placed in a locked drawer until further notice.

The realistic opportunity is cheaper process diagnosis.

If a business has tasks where final correctness can be checked — code generation, math-heavy analytics, structured financial calculations, compliance QA with known rule outcomes, document extraction with ground-truth labels — then a rubric-evolution pipeline could help discover which reasoning behaviors predict success. Those rubrics could then be used to train, evaluate, or guide models.

That is less flashy than “AI teaches itself to think.”

It is also more likely to survive contact with procurement.

The boundary: this is not free supervision

The paper uses the phrase “free-lunch” for improving RL performance with self-proposed rubrics. I understand the point. I also distrust free lunch in machine learning. It usually arrives with a compute invoice.

RLCER reduces one kind of cost: human annotation of process-level reasoning labels. But it adds or preserves several other dependencies:

Dependency	Why it matters
Verifiable final answers	Rubric validity depends on correlation with correctness
Multiple rollouts	Correlation and discriminativeness require sampled variation
Frozen verifier model	Rubric satisfaction still needs to be judged
Cold-start SFT	The paper uses 40k cold-start trajectories because small models needed format and role-following support
Extra rubricator computation	The rubricator role increases rollout burden and training time
Domain structure	Rubrics must be meaningful and checkable enough to guide learning

The most important limitation is the verifiable-domain boundary. RLCER is naturally suited to tasks where correctness can be observed. Extending it to open-ended domains such as strategy writing, investment thesis generation, negotiation advice, or executive decision support is not automatic.

In those domains, there may be no clean final answer. If correctness is delayed, subjective, or multi-objective, then rubric validity becomes harder to define. You can still use rubrics, but the correlation anchor weakens. At that point, the method needs additional validation design: expert review, outcome proxies, retrospective scoring, simulation environments, or human-in-the-loop audits.

The second limitation is verifier quality. The verifier checks whether a reasoning trace satisfies each rubric. If the verifier is weak, biased, or too literal, the reward signal can become distorted. RLCER avoids human-written CoT labels, but it does not avoid the need for reliable evaluation infrastructure.

The third limitation is economics. More roles and more rollouts mean more compute. The right business question is not “Does RLCER improve benchmarks?” The right question is:

Does the process-quality gain justify the extra training and inference cost for this task class?

For high-stakes reasoning systems, the answer may be yes. For low-margin content generation, probably not. For another AI startup promising autonomous compliance transformation by next Tuesday, the answer is “please send the audit log.”

Where this could matter first

RLCER is most relevant where three conditions hold:

The task has checkable outcomes.
The reasoning path affects reliability.
Human process labeling is too expensive to scale.

That points to several practical domains:

Domain	Possible use	Why RLCER-like rubrics help
Code generation	Reward debugging strategy, test coverage, and error localization	Final tests verify correctness, rubrics shape process
Financial analytics	Reward decomposition, assumption checks, and reconciliation steps	Numeric outputs can often be checked
Compliance QA	Reward rule mapping, exception handling, and evidence citation	Outcomes can be validated against policy cases
Education technology	Generate adaptive grading rubrics for solution paths	Student answers provide natural correctness signals
Data extraction	Reward cross-field consistency checks and boundary handling	Ground-truth extraction labels are available
Scientific reasoning benchmarks	Reward method selection and validation behavior	Some tasks have verifiable answers or known references

The common pattern is not “AI replaces experts.” The pattern is that expert supervision shifts upstream. Humans design the task environment, verifier, and outcome labels; the model helps discover process rubrics that predict success within that environment.

That is a more sober form of automation. It does not eliminate governance. It gives governance better instruments.

The strategic lesson: reward what survives contact with evidence

The strongest idea in RLCER is not self-evaluation. Self-evaluation is cheap. Every model can be prompted to produce a critique. Most of those critiques are politely formatted mist.

The stronger idea is evidence-filtered self-evaluation.

A rubric matters only if satisfying it predicts correctness and separates good traces from bad ones. That principle generalizes beyond this paper. Whether a company is evaluating LLM reasoning, sales-call summaries, compliance reviews, trading explanations, or customer-support decisions, the question should not be, “Does this evaluation criterion sound reasonable?”

The question should be:

Does this criterion predict the outcome we care about, and does it still discriminate after the system adapts?

That is the difference between a checklist and an evaluation system.

RLCER is not the final answer to process supervision. It is bounded, computationally heavier than vanilla RLVR, and still tied to verifiable-answer settings. But it points toward a better training philosophy: use outcome verification not only to reward answers, but to discover which reasoning behaviors deserve reward.

The model writes the report card. The data decides whether the report card is worth reading.

For once, that is a meeting I might attend.

Cognaptus: Automate the Present, Incubate the Future.

Leheng Sheng, Wenchang Ma, Ruixin Hong, Xiang Wang, An Zhang, and Tat-Seng Chua, “Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics,” arXiv:2602.10885, February 11, 2026. https://arxiv.org/html/2602.10885 ↩︎

The usual RLVR bargain: cheap correctness, expensive reasoning#

The mechanism: one model, two roles, one feedback loop#

What makes a rubric valid is correlation, not eloquence#

Self-evolving rubrics prevent the report card from becoming too easy#

The evidence: RLCER improves over outcome-only RLVR, but the pattern is uneven#

The outcome-free test asks whether the rubrics contain real signal#

The ablation shows why evolving rubrics matter#

The case study makes the mechanism concrete#

Rubrics as prompt hints: training artifacts become deployment guidance#

What the paper directly shows, and what Cognaptus infers#

The boundary: this is not free supervision#

Where this could matter first#

The strategic lesson: reward what survives contact with evidence#