Potential Energy: What Chain-of-Thought Is Really Doing Inside Your LLM

Opening — Why This Matters Now

Chain-of-Thought (CoT) prompting has become the default ritual of modern LLM usage. If the model struggles, we simply ask it to “think step by step.” Performance improves. Benchmarks climb. Investors nod approvingly.

But here’s the uncomfortable question: what exactly inside that long reasoning trace is doing the work?

Is it genuine structured reasoning? Extra compute? A lucky guess wrapped in polite algebra? Or something subtler?

The paper “The Potential of CoT for Reasoning: A Closer Look at Trace Dynamics” (Apple, Feb 2026) offers one of the most precise instruments we’ve seen for dissecting this question. Instead of debating whether CoT is “real reasoning,” the authors introduce a measurable quantity: potential—the probability that a partial reasoning trace will ultimately lead to the correct answer.

For operators building AI systems, this is more than academic curiosity. It reframes how we think about:

Interpretability
Reliability metrics (e.g., pass@k inflation)
Credit assignment in RL training
Transfer learning between models

And perhaps most importantly: how much of CoT is signal versus ceremony.

Background — The Myth of Smooth Reasoning

CoT has been credited with breakthroughs in math, coding, and competitive benchmarks. The intuitive story is elegant:

Generate intermediate steps.
Decompose the problem.
Accumulate evidence.
Arrive at the answer.

If that were fully true, potential should increase smoothly over time. Each token should push us closer to correctness—like climbing a hill.

The authors formalize this intuition. Given a prompt $x$ and a partial reasoning prefix $c_{<t}$, they define:

$$ pot(c_{<t}; x) = \mathbb{P}(y = y^* \mid c_{<t}, x) $$

In plain language: If we freeze the reasoning at token $t$ and resample many completions, how often do we end up correct?

If CoT works like evidence accumulation, potential should rise steadily.

It doesn’t.

Analysis — What the Potential Reveals

Across AIME math problems, coding benchmarks, and GPQA reasoning tasks, the authors observe four recurring patterns:

1. Reasoning Insights (Sharp Upward Spikes)

Large jumps in potential over a short token window—often tied to discovering symmetry, a key substitution, or a structural reformulation.

These resemble human “aha” moments.

2. Reasoning Tangents (Sharp Drops)

Segments that look promising but reduce the probability of success. Overthinking. Exploring elegant but irrelevant paths.

Reasoning models exhibit this more often than non-reasoning ones.

3. Reasoning Jumps (Unintuitive Spikes)

Sudden improvements triggered by seemingly trivial tokens. For example, inserting a word like “corresponding” forces the model to instantiate concrete values, unlocking the solution.

These are hard to interpret from a human perspective.

4. Late Spikes (Guessing)

Potential remains flat near zero until the very end—then suddenly jumps when the final answer is emitted.

Translation: the reasoning didn’t help. The model guessed correctly.

Findings — The Shape of CoT in Numbers

On AIME-2024, the authors summarize behavior across models:

Model	Insights ↑	Tangents ↓	Late Spike	Monotonicity
Qwen2.5-1.5B	40%	5%	20%	45%
Qwen2.5-7B	62%	9.5%	14%	42%
Llama3.1-8B	46%	33%	6%	15%
Qwen3-0.6B (Reasoning)	55%	41%	10%	15%
Qwen3-32B (Reasoning)	36%	18%	0%	36%

Three uncomfortable takeaways:

Only about half of correct CoTs are monotonic.
Reasoning models tangent more (overthinking is real).
Smaller models guess more often (late spikes).

Inflated pass@k

Because benchmarks often count success if any of $k$ attempts succeed:

$$ pass@k = \frac{1}{P} \sum_{i=1}^P \mathbf{1}{y^* \in {y_i^{(1)},…,y_i^{(k)}}} $$

Late guessing artificially boosts pass@k.

In business terms: you may be measuring lottery tickets, not reasoning.

Optimized CoT — Forcing Monotonic Progress

The authors propose a greedy search over reasoning chunks:

Sample multiple candidate next segments.
Estimate potential.
Keep the chunk that increases it most.
Repeat.

Result: near-monotonic potential curves.

But at a cost:

Computationally expensive
Sequential dependency (limited batching)
Risk of degenerating into “optimized guessing” if search width grows too large

This is not a production recipe. It is a proof of concept that models admit cleaner reasoning paths than the ones they naturally emit.

For system designers, that’s fascinating.

Transferability — The 20% Unlock Effect

The most operationally relevant insight is CoT transferability.

Weaker models are given partial reasoning traces from stronger ones. Then they complete the solution.

Result: even 20% of the stronger model’s CoT can unlock problems the weaker model previously could not solve.

This holds:

Within the same model family
Across families (e.g., Qwen3-0.6B benefiting from GPT-OSS-20B traces)
For reasoning and non-reasoning models

Interpretation:

Key reasoning steps are not fully model-specific.
Insights behave like reusable cognitive scaffolding.

For multi-agent systems, distillation pipelines, and RL training:

This is strategic gold.

Implications — What This Means for AI Builders

1. Interpretability Is Token-Local, Not Trace-Global

Most tokens do nothing. A few matter enormously.

We should analyze reasoning at the marginal contribution level, not the narrative level.

2. RL Credit Assignment Can Be Refined

Potential resembles a Monte-Carlo value function. It can serve as a fine-grained reward signal for training.

Instead of rewarding only final correctness, we can reward intermediate states with rising potential.

3. Benchmark Metrics Need Scrutiny

If late spikes drive success, pass@k may exaggerate competence.

High-stakes domains (finance, law, medical AI) cannot rely on such inflated signals.

4. Hybrid Systems Become More Plausible

Stronger model → extract insight segments → feed weaker agents.

This suggests architectural opportunities:

Architecture	Benefit
Insight distillation	Compress reasoning gains
Multi-agent collaboration	Partial-CoT bootstrapping
RL with potential reward	Better credit assignment
CoT pruning	Remove low-value tokens

Conclusion — CoT Is Not a Story. It’s a Landscape.

Chain-of-Thought reasoning is not a smooth logical staircase.

It is a jagged terrain of:

Spikes (insights)
Pits (tangents)
Cliffs (jumps)
And occasionally, a lucky parachute drop

The potential framework does something subtle but powerful:

It turns interpretability from philosophy into measurement.

For those building autonomous agents, governance layers, or reasoning systems at scale, this shift matters.

Not every token earns its place.

Some move the world.

Most merely speak.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why This Matters Now#

Background — The Myth of Smooth Reasoning#

Analysis — What the Potential Reveals#

1. Reasoning Insights (Sharp Upward Spikes)#

2. Reasoning Tangents (Sharp Drops)#

3. Reasoning Jumps (Unintuitive Spikes)#

4. Late Spikes (Guessing)#

Findings — The Shape of CoT in Numbers#

Inflated pass@k#

Optimized CoT — Forcing Monotonic Progress#

Transferability — The 20% Unlock Effect#

Implications — What This Means for AI Builders#

1. Interpretability Is Token-Local, Not Trace-Global#

2. RL Credit Assignment Can Be Refined#

3. Benchmark Metrics Need Scrutiny#

4. Hybrid Systems Become More Plausible#

Conclusion — CoT Is Not a Story. It’s a Landscape.#