Potential Energy: What Chain-of-Thought Is Really Doing Inside Your LLM

The familiar ritual: ask it to think longer

When an LLM gives a weak answer, the standard reflex is now almost ceremonial: ask it to think step by step.

The model writes more. The answer often improves. The benchmark number rises. Everyone feels temporarily reassured.

This habit has become so normal that many teams treat chain-of-thought as if it were a small reasoning engine bolted onto the model: more intermediate steps, more deliberate thought, more correctness. A comforting story. Also, like many comforting stories in AI, not quite what the evidence says.

The paper The Potential of CoT for Reasoning: A Closer Look at Trace Dynamics gives us a sharper instrument for looking inside this ritual.¹ Instead of asking whether a chain-of-thought trace “looks reasonable,” the authors ask a colder question:

At each point in the trace, how much closer is the model to producing the correct answer?

That shift matters. A reasoning trace is not judged as a literary object. It becomes a sequence of states. Some states make success more likely. Some do almost nothing. Some make success less likely. And occasionally, the model wanders through a long mathematical monologue only to jump into the correct answer at the end, like a tourist accidentally arriving at the airport.

The paper’s main contribution is not merely that chain-of-thought can be messy. We already knew that, or at least suspected it after reading enough model explanations with suspicious confidence. The contribution is a way to measure where the mess matters.

Potential turns a reasoning trace into a probability curve

The central concept is CoT potential.

Given a prompt and a partial chain-of-thought prefix, the potential is the probability that the model will complete the answer correctly when conditioned on that prefix. In simplified form:

$$ \text{potential}(c_{

Here, $x$ is the input problem, $c_{<t}$ is the reasoning trace up to a certain point, $y$ is the model’s final answer, and $y^\ast$ is the ground-truth answer.

The practical interpretation is simple. Freeze the model’s reasoning at some prefix. Let it complete the rest many times. If many completions reach the right answer, the prefix has high potential. If few do, the prefix has low potential.

This reframes chain-of-thought from a readable explanation into something closer to a value function in reinforcement learning. Each prefix is a state. Its value is not whether the text sounds intelligent, but whether it improves the odds of ending correctly.

That distinction is the whole article.

A polished reasoning trace can be low-value. An ugly-looking fragment can be highly valuable. A trivial word can force the model into a useful computational path. A sophisticated-looking tangent can reduce the chance of success. The model’s reasoning is not always aligned with our sense of what should be difficult.

Human readers like narrative coherence. Potential measures conditional usefulness. Those are different currencies.

The expected story is smooth progress. The observed story is jagged progress

If chain-of-thought worked like textbook reasoning, the potential curve would usually climb.

The model would start uncertain. Each useful step would narrow the search space. A correct sub-result would raise the probability of success. A reformulation would raise it again. By the end, the answer would be almost inevitable.

The authors formalize a version of this intuition. Conditional on a full chain-of-thought eventually producing the correct answer, potential should improve on average. That theoretical result gives the paper its baseline: if reasoning is evidence accumulation, the curve should be broadly monotonic.

Then the empirical curves ruin the neat picture.

Across competition mathematics tasks such as AIME, and also in extensions to MATH-500, GPQA-Diamond, and HumanEval, the authors observe that many traces are not smooth climbs. They contain spikes, drops, flat regions, and late jumps. Correct traces can include damaging segments. Incorrect traces can contain locally useful segments. Reasoning models can overthink their way into lower potential. Smaller or weaker models can sometimes guess correctly without the preceding reasoning doing much work.

The important point is not “LLMs are bad at reasoning.” That would be too lazy. The more useful point is that CoT progress is unevenly distributed.

Some tokens carry real information. Some tokens are cheap scaffolding. Some tokens are active liabilities. The trace is not one uniform asset. It is a portfolio, and several holdings are underperforming.

Four trace dynamics explain most of the paper

The paper’s qualitative contribution is easiest to understand through four recurring trace dynamics.

Trace dynamic	What happens to potential	What it means	Business reading
Reasoning insight	Sharp upward movement	The model reaches a genuinely useful intermediate step	Find and preserve these segments
Reasoning tangent	Sharp downward movement	The model explores a path that looks plausible but hurts success	Detect overthinking before it spreads
Reasoning jump	Sharp upward movement for a non-obvious reason	A small or strange token unlocks completion for model-specific reasons	Do not assume human-perceived difficulty matches model difficulty
Late spike / guessing	Potential stays low until the final answer	The trace did not justify the answer; correctness arrives at the end	Final-answer metrics may overstate reasoning reliability

The first category, reasoning insight, is the most familiar. In a math problem, the model may discover a symmetry, identify the right substitution, or reduce a complicated expression to a tractable form. Potential rises because the hard part has been crossed. This looks like human problem solving. It is the part of CoT that people want to believe in.

The second category, reasoning tangent, is more awkward. A model may introduce an approach that appears mathematically respectable but does not help this problem. In one of the paper’s examples, the model explores an inequality-based path that looks promising but reduces potential because it is not tight enough for the actual solution. The trace still sounds intelligent. That is precisely the problem. Bad reasoning does not always announce itself in comic sans.

The third category, reasoning jump, is stranger. The model’s potential may rise sharply after a token or phrase that looks trivial to a human reader. In one example discussed in the appendix, a word such as “corresponding” appears to push the model toward outputting concrete values instead of continuing unnecessary calculation. The token is not a grand mathematical insight. It is more like a switch in the model’s internal behavior.

The fourth category, late spike, is the most dangerous for evaluation. Potential remains low through almost the entire trace and then jumps when the final answer appears. The model has not reasoned its way there in any robust sense. It has guessed, sometimes aided by the format of the benchmark. If the benchmark only checks the final answer, the guess looks like success.

That is how a model can appear more capable than it is. Not by being malicious. Just by being stochastic and lucky at scale. AI safety by casino arithmetic — elegant, if one enjoys bad incentives.

The quantitative table is a failure-mode map, not a leaderboard

The paper summarizes potential behaviors across several models on AIME-2024. The exact values matter less than the pattern, but the pattern is worth reading carefully.

Model	Reasoning model	Insights	Tangents	Late spike	Monotonicity
Qwen2.5-1.5B	No	40%	5%	20%	45%
Qwen2.5-7B	No	62%	9.5%	14%	42%
Llama3.1-8B	No	46%	33%	6%	15%
Qwen3-0.6B	Yes	55%	41%	10%	15%
Qwen3-32B	Yes	36%	18%	0%	36%

This is not a clean “bigger is always smoother” story. Nor is it “reasoning models always reason better.” The table is more useful as a map of failure modes.

First, insights appear across models. This supports the paper’s claim that many hard tasks contain concentrated bottlenecks. A small number of steps may carry much of the value. For AI system design, this suggests that reasoning traces should be mined for high-contribution segments rather than treated as indivisible text.

Second, tangents are not rare, especially in some reasoning models. This fits a familiar operational problem: models can continue reasoning after they already have enough information, then degrade the answer by exploring unnecessary alternatives. “More thinking” is not always more control. Sometimes it is just more surface area for mistakes.

Third, late spikes appear more strongly in smaller non-reasoning models. That does not mean small models are useless. It means their occasional correct answers can be less supported by the trace. For benchmark interpretation, this matters because metrics that reward any correct sample among many attempts can inflate apparent capability.

Fourth, monotonicity is limited. If only a fraction of correct traces show smooth progress, then trace length is a weak proxy for reasoning quality. A long answer may contain one crucial insight surrounded by filler, or worse, one insight followed by a damaging tangent.

That is the uncomfortable business lesson: the unit of evaluation should not be “the answer was correct” or “the reasoning was long.” It should be “which intermediate states increased the probability of correctness?”

Pass@k can reward lottery tickets disguised as reasoning

The paper’s discussion of late spikes has direct implications for benchmark metrics.

Many evaluations use a success-if-any-sample-succeeds logic. Generate multiple attempts. If at least one reaches the correct answer, count the problem as solvable. This style of metric is useful when we want to know whether a model has some route to the answer. It becomes less useful when correct answers can emerge from weakly justified guesses.

The issue is especially visible in mathematical benchmarks where the final answer is easy to check but the reasoning path is not evaluated. A model can produce many completions. One completion lands on the right integer. The metric smiles. The potential curve, less easily impressed, shows that the preceding trace did not move much toward correctness.

For research, this is a measurement concern. For business, it is a deployment concern.

A customer-support agent that guesses one refund policy correctly after five attempts is not “capable.” A legal summarization workflow that occasionally lands on the right clause after wandering is not “reliable.” A financial analysis agent that produces one correct forecast rationale among many stochastic attempts is not demonstrating stable competence. It is demonstrating that enough sampling can sometimes look like insight.

Potential is valuable because it separates arriving from earning the arrival.

Optimized CoT proves cleaner reasoning exists, but it is not a production recipe

One of the paper’s more interesting experiments constructs optimized chain-of-thought traces.

The idea is greedy and expensive. At each step, sample candidate chunks of reasoning, estimate their potential, keep the chunk that increases potential most, and repeat. The resulting traces are more monotonic. They avoid some tangents. They show that models can admit cleaner reasoning paths than the ones they naturally produce.

This is an important proof of concept, but it should not be misread.

The paper itself treats optimized CoT as costly. Potential estimation requires repeated completions from prefixes. Greedy optimization adds sequential dependency, limiting the batching advantages available in ordinary inference. There is also a theoretical concern: with excessive search, an “optimized” trace could degenerate toward lucky guessing if the objective is not regularized properly.

So the business implication is not: “run potential-optimized CoT in production tomorrow.” Please do not turn every invoice classifier into a Monte Carlo research cluster.

The better implication is diagnostic. If optimized traces can be smoother than natural traces, then failures in ordinary CoT are not inevitable properties of the task. They are partly properties of decoding, prompting, model behavior, and trace dynamics. That opens a practical design space:

Design question	Potential-based interpretation
Which parts of a trace should be retained?	Segments that raise downstream success probability
Which parts should be pruned?	Segments with flat or negative marginal contribution
When should reasoning stop?	When additional tokens no longer raise potential or begin to reduce it
Which examples should train smaller models?	Prefixes that transfer high-value insights, not merely polished explanations
How should reward be assigned?	Reward intermediate states that improve future correctness, not only final answers

This is where the paper becomes useful for builders. It does not hand over a cheap algorithm. It hands over a better target.

The 20% transfer result is the most operationally suggestive finding

The transferability section asks whether partial reasoning from a stronger model can help a weaker model solve problems it previously could not solve.

The answer is yes, with important boundaries.

The authors test cases where weaker models receive partial CoT traces from stronger models. In reasoning-model setups, Qwen3-0.6B receives partial traces from Qwen3-32B and also from GPT-OSS-20B. In non-reasoning setups, Qwen2.5 models receive partial CoT summaries generated from a stronger Qwen3 model. The paper reports that performance can improve quickly, including with as little as 20% of the stronger model’s CoT.

This matters because it suggests that at least some reasoning content is not purely model-private. A useful intermediate step can travel. A weaker model may fail not because it cannot complete the entire solution, but because it cannot discover the key prefix by itself.

For enterprise AI, this is more interesting than the usual “use a bigger model for hard tasks” advice.

A stronger model might be used to generate insight prefixes, scaffolds, or partial reasoning templates. A weaker model might then complete routine parts. This points toward hybrid architectures where expensive reasoning is concentrated at bottlenecks rather than sprayed across every token.

But the interpretation needs discipline. Transferability does not prove that the weaker model “understands” the problem in the same way the stronger model does. It proves that conditioning on certain partial traces changes the probability of correct completion. In some cases, the prefix may provide genuine conceptual scaffolding. In others, it may leak enough structure to narrow the answer space. In still others, the receiver model’s own competence determines whether the prefix is useful.

So the correct business inference is not “small models can reason like large models if we give them 20% of the answer.” That is the LinkedIn version, and therefore should be handled with gloves.

The better inference is: high-value reasoning prefixes can be reusable operational assets.

What this changes for enterprise LLM evaluation

Most enterprise evaluation still operates at the level of final outputs: accuracy, preference score, human rating, pass/fail, escalation rate, resolution rate. These are necessary metrics. They are not enough for reasoning-heavy workflows.

The paper suggests a more granular evaluation layer.

Evaluation layer	Traditional question	Potential-style question
Final answer	Did the model get it right?	Was correctness supported by rising probability through the trace?
Reasoning trace	Does the explanation look plausible?	Which segments increased or decreased success probability?
Model comparison	Which model has higher aggregate accuracy?	Which model produces fewer tangents and fewer unsupported late spikes?
Prompt design	Which prompt gives better answers?	Which prompt creates more monotonic potential curves?
Agent orchestration	Which agent should solve the task?	Which agent should discover bottleneck insights, and which should complete routine steps?

For business systems, the most immediate use is not full potential estimation on every live query. That would usually be too expensive. The more realistic path is offline diagnostic auditing.

Take representative tasks. Generate reasoning traces. Estimate prefix usefulness through resampling or cheaper approximations. Identify where successful traces actually gain value. Then use those findings to redesign prompts, routing, stopping rules, training examples, and escalation policies.

In other words, potential is less like a production metric dashboard and more like a microscope. You do not stare through a microscope while driving. You use it to understand what you are building before you put it on the road.

The business value is cheaper diagnosis, not magical reasoning

The paper’s findings are especially relevant to three kinds of AI systems.

First, multi-agent workflows. In agentic systems, one model may plan, another may retrieve, another may verify, and another may produce the final output. Potential suggests that the key question is not which agent writes the longest plan. It is which agent moves the workflow into a higher-value state. A planning agent that creates elegant but low-potential subtasks is not helping. It is decorating the queue.

Second, distillation and model cascades. If partial CoT transfers, then expensive models may be used selectively to generate insight-bearing prefixes, while cheaper models complete more mechanical portions. The ROI question becomes: which prefixes are worth buying from the expensive model? Potential gives a way to define “worth” more precisely.

Third, governance and assurance. In regulated or high-stakes workflows, final correctness is not enough. Teams need to know whether the answer emerged from stable reasoning or from a fragile path. A late spike should be treated differently from a gradual potential climb. Both may produce the same answer. They should not receive the same trust.

This is where the paper is most useful for Cognaptus-style automation. It gives a language for separating reasoning theater from reasoning contribution. That separation is essential if companies want agents that can be audited, improved, and trusted under operational pressure.

Where the paper stops, and where business interpretation begins

The paper is careful about its own scope, and the business reader should be equally careful.

Most of the main qualitative analysis focuses on competition-level mathematics, especially AIME. The appendix extends the analysis to MATH-500, HumanEval, and GPQA-Diamond, which helps show that the patterns are not purely one-benchmark artifacts. Still, enterprise tasks are different. A procurement workflow, legal review, customer service escalation, or financial compliance check may have different trace dynamics.

Potential itself is also expensive to estimate directly. It depends on repeated conditional sampling from partial prefixes. That is feasible for research and offline audits. It is not automatically feasible for every production request.

There is also an interpretation boundary. A rising potential curve tells us that a prefix improves the probability of correctness. It does not fully explain the internal computation that caused the improvement. Potential is a behavioral diagnostic, not a complete mechanistic explanation of the model.

Finally, human interpretability remains imperfect. Some insights are readable. Some jumps are not. The model may find certain tokens operationally useful for reasons that do not map cleanly onto human reasoning categories. That is not a defect in the metric. It is a reminder that LLM cognition is not obligated to flatter our metaphors.

The practical framework: audit traces as state transitions

A useful way to apply the paper is to treat every reasoning workflow as a sequence of state transitions.

Step	Audit question	Action if weak
Initial prompt	Is the starting potential already high enough?	Skip CoT or use direct answering
Early reasoning	Does the first segment raise potential?	Improve task framing or retrieval
Middle reasoning	Are there tangents or flat regions?	Add stopping rules, pruning, or verifier checkpoints
Key insight	Which segment creates the largest upward movement?	Preserve it as a reusable scaffold
Final answer	Did correctness emerge gradually or via late spike?	Lower confidence if unsupported
Cross-model transfer	Can a cheaper model use the same prefix?	Build cascade or distillation pipeline

This framework is not a replacement for standard evaluation. It is an additional lens for reasoning-heavy tasks where the path matters.

For many enterprise uses, the most valuable discovery may be negative: the model’s explanation is not doing much. That is still useful. It tells the team not to pay for long reasoning, not to show it to users as evidence, and not to mistake verbosity for reliability.

A model that says less but moves potential more is better than a model that writes an essay and arrives by luck. This should not be controversial. Somehow, in AI product design, it still often is.

Conclusion: CoT is not a story; it is a landscape

Chain-of-thought is often presented as a transparent reasoning narrative. The paper makes that view harder to maintain.

A trace is not a smooth staircase. It is a landscape with ridges, pits, shortcuts, dead ends, and occasional trapdoors. Some segments are genuine insights. Some are tangents. Some are strange model-specific switches. Some are just pre-answer fog before a lucky guess.

The potential framework matters because it moves the discussion from aesthetics to measurement. It asks whether each partial trace actually improves the odds of correctness. That is the question AI builders should have been asking all along, but the industry was busy admiring long answers in a nice font.

For enterprise systems, the lesson is straightforward:

Do not trust reasoning because it is long.

Do not trust reasoning because it sounds coherent.

Do not trust reasoning because the final answer passed once.

Audit which parts of the reasoning change the probability of being right.

That is where the value is. Not in the ceremony of thinking step by step, but in identifying the steps that actually think.

Cognaptus: Automate the Present, Incubate the Future.

Gregor Bachmann, Yichen Jiang, Seyed Mohsen Moosavi Dezfooli, and Moin Nabi, “The Potential of CoT for Reasoning: A Closer Look at Trace Dynamics”, arXiv:2602.14903, 2026. ↩︎

The familiar ritual: ask it to think longer#

Potential turns a reasoning trace into a probability curve#

The expected story is smooth progress. The observed story is jagged progress#

Four trace dynamics explain most of the paper#

The quantitative table is a failure-mode map, not a leaderboard#

Pass@k can reward lottery tickets disguised as reasoning#

Optimized CoT proves cleaner reasoning exists, but it is not a production recipe#

The 20% transfer result is the most operationally suggestive finding#

What this changes for enterprise LLM evaluation#

The business value is cheaper diagnosis, not magical reasoning#

Where the paper stops, and where business interpretation begins#

The practical framework: audit traces as state transitions#

Conclusion: CoT is not a story; it is a landscape#