Thinking About Thinking: When LLMs Start Writing Their Own Report Cards

Opening — Why This Matters Now

For the past two years, reinforcement learning has been the quiet architect behind the reasoning leap of large language models (LLMs). We reward them when they land the right answer. They get better at landing the right answer.

Efficient. Scalable. And slightly naive.

Because if you only reward the final answer, you are implicitly saying: “I don’t care how you think — just get it right.”

The recent paper “Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics” (RLCER) challenges this outcome obsession. Instead of merely asking whether the answer is correct, it asks a more strategic question:

Can models learn to evaluate — and improve — the quality of their own reasoning without human-written reward models?

If the answer is yes, we are no longer optimizing outputs. We are optimizing cognition.

That’s a different category of capability.

Background — The Limits of Outcome-Centric RL

Most large reasoning models today rely on Reinforcement Learning with Verifiable Rewards (RLVR).

The core idea is simple:

Generate a chain-of-thought (CoT)
Produce a final answer
Compare it to ground truth
Reward correctness

Formally, the reward is:

$$ r = \psi(I(A, \hat{A})) $$

Where $I(A, \hat{A})$ indicates whether the predicted answer matches the true answer.

This works well in math and coding — domains where answers are checkable.

But it introduces a structural blind spot:

Problem	Why It Matters
Multiple reasoning paths lead to same answer	Model may learn shortcuts
No direct CoT supervision	Reasoning quality drifts
Static reward models	Break under distribution shift
Heavy human annotation	Expensive and non-scalable

In other words, we reward results, not process.

And that creates underconstrained optimization.

For business applications — finance, healthcare, compliance, strategy — reasoning robustness matters more than isolated accuracy spikes.

Which brings us to RLCER.

Analysis — What RLCER Actually Does

RLCER introduces a surprisingly elegant mechanism:

Let the model generate its own reasoning evaluation criteria — and evolve them over time.

Yes. The model writes its own report card.

Two Roles, One Policy

The framework instantiates a single policy model under two roles:

Role	Function
Reasoner	Generates CoT and final answer
Rubricator	Generates evaluation rubrics for the CoT

These rubrics are structured natural-language criteria like:

“Avoids tangential exploration”
“Uses systematic decomposition”
“Implements edge-case validation”

Each rubric has a score (positive or negative).

Then comes the key filter.

What Makes a Rubric “Valid”?

A rubric is considered valid only if:

Its satisfaction correlates positively with answer correctness
It is discriminative across rollouts

Formally, for rubric $k$:

$$ \text{corr}(v_k, z) > \alpha $$

Where:

$v_k$ = rubric satisfaction vector across rollouts
$z$ = answer correctness vector
$\alpha$ = correlation threshold (0.2 in the paper)

If satisfying the rubric statistically aligns with correctness, it survives.

If not — it dies.

No human in the loop.

Reward Composition

The reasoner receives:

$$ r^{Rea} = r^{outcome} + r^{cot} $$

Where:

$r^{outcome}$ = answer correctness reward
$r^{cot}$ = aggregated rubric satisfaction score

Meanwhile, the rubricator receives reward proportional to the fraction of valid rubrics:

$$ r^{Rub}{evolving} = \frac{K{valid}}{K} $$

Translation:

If your rubrics meaningfully predict correctness, you get rewarded.

That’s self-evolution through correlation pressure.

Findings — Does It Work?

The empirical results are not subtle.

Performance Gains (8B Model)

Model	AIME2024	AIME2025	AMC2023	GPQA-Diamond
SFT	22.29	23.75	66.41	31.72
+ RLVR	34.79	32.50	84.53	46.56
+ RLCER	37.50	33.33	86.41	48.77

Key observations:

RLCER consistently outperforms vanilla RLVR.
Gains are larger for bigger models (8B > 4B).
Improvements generalize beyond math into graduate-level reasoning (GPQA).

Now the more interesting part.

Outcome-Free Training Still Improves Reasoning

The authors run a striking experiment:

They remove outcome rewards entirely.

Only rubric-based CoT rewards remain.

Performance still improves.

That means the self-generated rubrics contain real signal — not just noise dressed as structure.

When replaced with random rubric scores, performance collapses.

Correlation matters.

Self-Evolving Dynamics

As training progresses:

Metric	Trend with RLCER
Rubric–correctness correlation	Increases
CoT reward saturation	Decreases
Final performance	Stabilizes higher

Interpretation:

Rubrics become more aligned with true reasoning quality
They get harder to satisfy
The model must genuinely improve reasoning to earn rewards

That’s curriculum learning — but generated internally.

Rubrics as Inference-Time Hints

Even more interesting:

When these generated rubrics are inserted as prompt hints at inference time, accuracy increases further.

In other words:

The rubrics are not just training scaffolds. They encode reasoning priors.

This opens a strategic avenue:

Learned evaluation criteria can become deployment-time guidance.

That’s portable cognition.

Implications — Why This Matters for Real Systems

Let’s move beyond benchmarks.

1. Reduced Human Labeling Cost

Traditional process reward models require dense human annotation. RLCER removes that requirement.

For enterprises building domain-specific reasoning systems (legal, financial, regulatory), this is significant.

2. Adaptive Supervision Under Distribution Shift

Static reward models decay as policy behavior shifts.

Self-evolving rubrics adapt because their survival depends on correlation with correctness.

That’s dynamic governance.

3. Toward Self-Improving Agents

The architecture mirrors multi-agent reinforcement learning — but within one model.

The system:

Generates
Evaluates
Correlates
Refines

That’s the blueprint of autonomous capability refinement.

4. Business Use Case Potential

Domain	Application
Quant finance	Self-refining trade thesis validation
Compliance	Evolving risk-detection heuristics
Healthcare	Structured reasoning quality checks
Education	Auto-generated adaptive grading rubrics

The meta-lesson is powerful:

If your model can define what good reasoning looks like — and revise that definition — you reduce external supervision dependence.

Which reduces marginal scaling cost.

Limitations — Reality Check

No paper escapes gravity.

Compute Overhead — Rubricator rollouts increase training burden.
Still Verifiable-Domain Focused — Correlation requires correctness labels.
Verifier Dependency — A frozen verifier model judges rubric satisfaction.

So we are not yet in fully self-verifying territory.

But we are closer.

Conclusion — From Answer Optimization to Cognitive Optimization

Outcome-centric RL makes models better at being right.

RLCER makes models better at thinking.

The distinction is subtle but foundational.

If this paradigm scales:

Reward models become internal
Supervision becomes adaptive
Reasoning quality becomes explicitly optimized

And reinforcement learning stops being about what the model outputs.

It becomes about how the model reasons about reasoning itself.

That’s when systems stop being tools — and start being agents.

Quietly. Iteratively. Without asking for permission.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why This Matters Now#

Background — The Limits of Outcome-Centric RL#

Analysis — What RLCER Actually Does#

Two Roles, One Policy#

What Makes a Rubric “Valid”?#

Reward Composition#

Findings — Does It Work?#

Performance Gains (8B Model)#

Outcome-Free Training Still Improves Reasoning#

Self-Evolving Dynamics#

Rubrics as Inference-Time Hints#

Implications — Why This Matters for Real Systems#

1. Reduced Human Labeling Cost#

2. Adaptive Supervision Under Distribution Shift#

3. Toward Self-Improving Agents#

4. Business Use Case Potential#

Limitations — Reality Check#

Conclusion — From Answer Optimization to Cognitive Optimization#