Thinking is expensive.

That sounds obvious when the thinker is a human consultant billing by the hour. It sounds less obvious when the thinker is a large reasoning model producing long chains of thought, checking itself, trying another route, doubting the first answer, then generously spending another few thousand tokens to arrive at the same wrong place with better punctuation.

For the last year, the AI industry has treated longer reasoning as a sign of seriousness. More steps. More reflection. More alternatives. More “aha moments.” The implicit theory is simple enough: if a model thinks longer, it should make fewer mistakes. That theory is attractive because it resembles human problem-solving. It is also operationally convenient, because it lets teams buy reliability with inference tokens instead of redesigning systems.

The paper behind this article, FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models, is a useful slap on the wrist.1 Its central claim is not merely that long reasoning costs more. Everyone with an API bill already knows that. The sharper claim is that, in large reasoning models, later alternative solutions often fail to repair wrong first answers and can even corrupt correct first answers. The model does not always deliberate its way out of error. Sometimes it cultivates error.

The authors call this pattern “The First is The Best.” They then explain it through a mechanism called the Forest of Errors: errors in a reasoning trace do not behave like isolated slips. They branch, reproduce, and become reused artifacts inside later reasoning. Once a wrong assumption or wrong intermediate value enters the context, the model may build an entire little bureaucracy around it. And like most bureaucracies, it becomes harder to remove once it has forms, subforms, and a committee.

This article uses the paper’s own mechanism-first logic. The interesting question is not “does RED improve benchmark scores?” It does. The interesting question is why extra reasoning becomes a liability in the first place, and what that means for enterprise AI systems that currently treat “think step by step” as a universal reliability button.

The first answer is not always right, but later answers are often worse

The paper begins with a deceptively simple empirical question: when a large reasoning model generates multiple solutions in one reasoning trace, do the later solutions actually help?

The answer, across the paper’s reported settings, is mostly no.

The authors distinguish between the model’s First solution and its later Subs solutions. They examine how Subs affect the answer relative to First. The resulting categories are operationally important:

Relationship between First and Subs What it means Why it matters
Correct First → incorrect final answer Later reasoning misleads the model Extra thinking actively damages a good trajectory
Incorrect First → still incorrect later Later reasoning fails to repair the mistake Extra thinking burns tokens without reliability gain
Incorrect First → correct later Later reasoning successfully repairs the mistake The hoped-for benefit of alternative exploration
Correct First → correct later, but unstable Later solution lands on the right answer but has fragile reasoning Accuracy hides latent risk

The striking result is the asymmetry. When First is wrong, Subs usually do not save it. The paper reports that an incorrect First remains incorrect through Subs with a probability in the range of 75.4% to 82.8% across the reported model-dataset combinations. Successful repair is much rarer, with maximum success rates reported around 2.0% to 7.1%.

The more uncomfortable result is that Subs can damage a correct First. The paper reports cases where later reasoning misleads the model from a correct first solution to an incorrect answer, with rates reaching up to 21.2% in the reported table. In another phrasing in the main text, this misleading effect is described as reaching 18.8% in the observed settings. The exact figure depends on the dataset-model view being discussed, but the conclusion is stable: later reasoning is not a neutral add-on. It can be adversarial to the model’s own earlier correctness.

This is where the paper becomes more useful than another “LLMs overthink” complaint. The authors do not stop at the visible failure. They ask why later reasoning becomes worse.

That mechanism is the Forest of Errors.

A reasoning mistake is not a dot; it is a branching structure

In normal product discussions, model errors are often described as if they were independent events: one hallucinated citation, one wrong number, one missed condition. That framing is convenient, because independent errors are easy to count.

The paper argues that reasoning errors behave differently. In a multi-step trace, an early wrong artifact can become the parent of later wrong artifacts. A mistaken assumption can induce a wrong formula. The wrong formula can produce a wrong intermediate value. That value can then be reused in a later subproblem. The result is not a list of errors but a forest: multiple trees, each with root errors, branches, and leaves.

The authors formalize this by identifying error nodes in chronological order and assigning parent-child relationships between them. A later error becomes a child of an earlier error if it reuses or depends on the earlier wrong artifact. If no earlier error explains it, it becomes a new root node. In other words, the model’s reasoning trace is treated as an error-propagation graph, not just a text transcript.

That is the paper’s key conceptual move.

Error concept Plain meaning Operational interpretation
Root error A new wrong artifact that is not caused by a prior error The mistake that should be caught early
Child error A later error induced by a previous wrong artifact The cost of failing to repair the root
Forest size Number of error trees How many independent wrong structures exist
Nodes per tree Average number of errors within each tree How much each root error reproduces
Depth per tree How many layers error propagation reaches How long wrong artifacts remain influential
Reproduction rate How quickly error nodes generate child errors How aggressively errors expand during reasoning

The most important managerial lesson is almost embarrassingly old-fashioned: fixing symptoms is not the same as fixing causes.

The paper’s manual correction experiments show that correcting child or grandchild errors does not stop the parent error from spawning new downstream mistakes. Correcting the root error, even after it has already produced children, significantly slows later error generation. That matters because many “self-correction” workflows in LLM applications are closer to child correction than root correction. They ask the model to review its final answer, adjust a surface inconsistency, or produce another route. The root assumption may remain untouched, still sitting there like a bad cell reference in a financial model.

This explains why later reasoning can look reflective while remaining structurally trapped.

Subs are not just longer; they grow larger error forests

The paper then measures how the Forest of Errors differs between First and Subs.

On Qwen3-8B-thinking across AIME25, MATH500, GSM8K, and GPQA, the authors report that Subs have larger and more reproductive error structures than First. For example, in the aggregated table, AIME25 shows First/Subs forest size of 6.9/8.1, nodes per tree of 7.1/8.4, depth per tree of 4.9/5.7, and reproduction rate of 0.084/0.126. Similar patterns appear on MATH500, GSM8K, and GPQA, where the reproduction-rate increase for Subs is especially visible.

The paper interprets this as evidence that later reasoning does not simply add more opportunity for correction. It adds more opportunity for error reproduction.

A useful business analogy is a project review meeting. The first analyst makes a wrong assumption. The second analyst does not identify the assumption; instead, they build an alternative spreadsheet using the same hidden premise. The third analyst checks the totals and writes a clearer explanation. The deck now looks more reviewed, more mature, and more expensive. It is also wrong in a more institutionalized way.

That is the difference between accuracy checking and error ecology.

The paper’s rollback-sampling experiment makes this point sharper. The authors take reasoning trajectories that originally ended with correct answers, roll back the generation state at different percentages of the trace, and resample the continuation 100 times. First remains highly stable, with reported error rates below 1% across rollback points, averaging about 0.69%. Subs are far more fragile: Solution 2 averages about 14.85%, Solution 3 about 21.76%, and Solution 4 about 18.0%. Aggregated Subs average about 18.20%.

This is an important distinction. A later solution may be correct in a single run, but its intermediate state may be less locked onto the correct answer. It is correct in the way a coin can land on its edge in a marketing demo: possible, impressive, and not something to build a production workflow around.

Entropy is not enough; entropy plus variance marks dangerous moments

The next question is where root errors come from.

The paper investigates token-level uncertainty using entropy and entropy variance. Entropy alone measures uncertainty over the next token distribution. But high entropy by itself does not necessarily mean the model is making an error. Sometimes the model simply has multiple valid ways to express the next step. Entropy variance alone is also insufficient; valid reasoning can involve transitions between different modes of expression or calculation.

The paper’s useful claim is about the combination. Root errors are most associated with moments where both entropy and entropy variance are high.

The authors divide events into four regions:

Region Interpretation Paper’s finding
Low entropy / low variance Stable continuation Lowest root-trigger risk
High entropy only Many possible next tokens More errors, often shallow
High variance only Unstable local transition Deeper error positioning, not maximal root creation
High entropy / high variance Uncertain and unstable transition Highest root-trigger rate

In the appendix, the paper reports node-generation statistics for the default window length $L = 15$. The high-high region has the highest root-trigger rate for both First and Subs: 0.187 for First and 0.264 for Subs. A predictive ablation also supports the joint signal: using entropy alone gives AUC 0.642/0.658 for First/Subs, variance alone gives 0.615/0.631, entropy plus variance gives 0.724/0.742, and adding the interaction term raises AUC to 0.816/0.835.

The practical interpretation is not “monitor entropy and panic.” The paper is more specific than that. The danger zone is a local reasoning state where the model is both uncertain and unstable. That is the point where a root error is more likely to appear, and root errors are precisely the expensive ones because they can seed downstream branches.

For enterprise systems, this suggests a different form of reasoning orchestration. Instead of letting the model produce a long trace and then asking for a final self-check, the system should monitor the reasoning process near likely root-error moments. That does not require exposing private chain-of-thought to end users. It does require the system designer to treat decoding dynamics as reliability signals, not just as invisible plumbing.

Self-reflection often edits the story, not the structure

A natural objection is that reasoning models are trained to reflect. If they make a mistake, should they not catch it later?

The paper tests this at two levels.

At the intra-solution level, the authors measure reflection frequency, completeness, and depth. Frequency asks how often the model initiates reflection. Completeness asks whether the reflection actually performs the corrective actions implied by the situation. Depth asks whether the model reaches the underlying cause rather than touching a surface symptom.

Their reported pattern is grim but familiar: reflection degrades in later solutions. For Qwen-8B-thinking, comparing the first to the last solution shows a 62.5% reduction in reflection frequency, plus large decreases in completeness and depth. In plain language, later reasoning may contain less useful self-correction precisely when the error forest has had more time to grow.

At the inter-solution level, the authors manually inject prompts that signal errors and observe the model’s correction behavior. They classify responses into true correction, refusal to correct, and fake correction. In First, true correction dominates at 67.1%, while fake correction and refusal are 32.4% and 0.5%. In Subs, the pattern nearly reverses: fake correction reaches 64.2%, and refusal reaches 31.1%.

Fake correction is the dangerous category. The model appears to revise itself, but later reasoning continues to depend on the same wrong artifact. It changes the costume, not the actor.

This has a direct implication for AI governance. A review step that merely asks the same model to “check your answer carefully” may create a false assurance layer. The output becomes more fluent, more apologetic, and more review-shaped. But if the root error remains embedded in the context, the review is mostly theater. The model has performed epistemic compliance. Very moving. Not very useful.

RED works because it attacks both the root and the surplus reasoning

The paper’s proposed method, RED, stands for Refine First and Discard Subs. The name is unusually literal, which is refreshing. No mythical animal required.

RED has two components.

The first component, Refining First, intervenes during the first solution at points that look prone to root-error generation. It maintains a sliding entropy window and monitors entropy variance plus top-$K$ entropy behavior. When the trigger condition is met, RED appends a short negative prompt such as “No, I made a mistake” after the stored KV cache to generate a negative sampling branch. It then subtracts the logits associated with that likely erroneous direction.

In simplified terms, the system asks: “What direction would the model take if it were about to go wrong?” Then it steers away from that direction.

The second component, Discarding Subs, tries to stop before later alternative solutions begin contaminating the answer. RED periodically probes the current hidden state using multiple short prompt templates. It exits only when two conditions hold:

  1. within each prompt template, a dominant answer appears with sufficient frequency;
  2. across templates, the dominant answers agree.

This dual-consistency mechanism is important because a single probe can be prompt-sensitive. The paper uses four semantically equivalent but stylistically different answer-extraction prompts, then requires cross-prompt agreement. That makes the early stop less likely to be an artifact of one phrasing.

RED’s logic can be summarized as follows:

Problem found in the paper RED response Business interpretation
Root errors drive later branches Intervene during First near high-risk entropy states Prevent expensive downstream repair
Subs often fail to correct First Stop before generating unnecessary alternatives Reduce token burn from low-value deliberation
Subs can corrupt correct First Discard later solution paths once answer is stable Protect good early trajectories
Single answer probes may be brittle Require internal and cross-prompt consistency Avoid premature stopping from prompt artifacts

This is why the mechanism-first reading matters. RED is not just a shorter-output trick. It is a control system built around the paper’s diagnosis of error growth.

The experimental evidence is strongest when read as mechanism evidence

The main results show RED improving both accuracy and efficiency across six backbone models and five benchmarks: AIME 2024, AIME 2025, MATH500, GSM8K, and GPQA-Diamond. The reported models include Qwen3-thinking models, DeepSeek-R1-Distill-Qwen variants, and DeepSeek-R1-Distill-Llama variants.

The headline result is that RED improves Pass@1 by 0.3 to 5.6 points, corresponding to 3.2% to 19.0% relative gains over vanilla in the authors’ summary, while reducing token consumption by 37.7% to 70.4%.

Those numbers are good. But the more interesting evidence is not just the benchmark table. It is the FoE metric table.

On AIME25, RED reduces FoE metrics by 41.0% to 68.0% across reported backbones. On DeepSeek-R1-Distill-Qwen-32B, for example, RED improves Pass@1 from 58.9 to 63.3, while reducing forest size from 7.0 to 3.2, nodes per tree from 7.8 to 4.6, depth from 5.8 to 3.6, and reproduction rate from 0.095 to 0.040.

That matters because a method can improve accuracy by luck, task fit, or benchmark artifacts. Reducing the internal error-forest metrics supports the authors’ mechanism: RED improves outcomes because it suppresses the growth structure that made later reasoning risky.

The appendix extends this interpretation. On MATH500, RED improves Pass@1 by +1.1 to +2.1 while reducing FoE metrics by 37.1% to 68.0%. On GPQA-Diamond, it improves Pass@1 by +1.0 to +1.7 and reduces FoE metrics by 38.6% to 68.1%. These are not second theses; they are robustness and domain-shift checks. They show that the FoE-pruning pattern is not confined to one math benchmark.

The ablation study also matters. Removing Discarding Subs hurts RED because later solutions can still contaminate the improved first trajectory and increase token length. Removing Refining First while keeping Discarding Subs often helps relative to vanilla, which supports the idea that stopping harmful later reasoning is already useful. But full RED performs best because it does both: improves the first trajectory and prevents later deterioration.

In deployment language: do not merely cap the reasoning length. First improve the path you keep, then stop the paths you do not need.

The ROI is not “shorter answers”; it is fewer expensive wrong trajectories

For businesses, the lazy takeaway would be: “Use fewer tokens.”

That is not wrong, but it is incomplete. Token reduction is only one surface benefit. The deeper business point is that long reasoning can create expensive wrong trajectories: outputs that consume more inference budget, take longer to return, and may become less reliable because the model has had more room to reuse its own mistakes.

The paper points toward a better operating model for reasoning systems:

System design question Common default Better RED-inspired design
When should the model reason longer? Whenever the task seems complex Only while answer stability has not emerged
How should uncertainty be handled? Ask for another solution Monitor high-risk uncertainty states and intervene early
How should self-checking work? Final review prompt after full reasoning Root-focused correction before error branches expand
How should cost be optimized? Reduce max tokens globally Stop dynamically when dual consistency is reached
How should reliability be evaluated? Final answer accuracy Accuracy plus stability, token cost, and error propagation risk

This is especially relevant for agentic workflows. Agents often loop: plan, execute, reflect, replan, critique, revise. That loop looks like intelligence when it works. It looks like a committee rewriting the same mistaken assumption when it fails.

The paper does not prove that every agentic loop is bad. It does suggest that open-ended reflection should not be treated as automatically corrective. In workflows involving financial analysis, compliance review, scientific extraction, legal summarization, or operational planning, a system should ask whether each extra reasoning segment is improving the answer or merely giving existing errors more descendants.

A practical enterprise stack might therefore separate reasoning into three layers:

  1. Routing: decide whether the task needs reasoning at all.
  2. Monitored reasoning: during reasoning, detect high-risk uncertainty/variance states where root errors may emerge.
  3. Stability-based stopping: stop when independent answer probes converge, rather than waiting for the model to finish narrating its intellectual autobiography.

That last phrase is not a technical term. It should be.

What the paper directly shows, and what business users should not overclaim

The paper’s results are valuable, but they are not a universal law of cognition.

First, the evidence is concentrated on mathematical and scientific reasoning benchmarks: AIME, MATH500, GSM8K, and GPQA-Diamond. These are useful stress tests for structured reasoning, but they are not the same as enterprise document review, customer support, sales forecasting, or policy analysis. The mechanism may transfer, but the paper does not prove transfer across all business tasks.

Second, the evaluated backbones are open-source LRMs from Qwen, DeepSeek-R1-Distill-Qwen, and DeepSeek-R1-Distill-Llama families. The results should not be automatically assigned to every proprietary reasoning model or every future architecture. Model-specific training can change how reflection, uncertainty, and answer convergence behave.

Third, FoE construction depends on error identification and parent-child scoring. The paper makes a serious effort to validate the LLM-based parent-child scoring protocol against human judgments, reporting strong agreement, including overall Spearman correlation of 0.88, MAE of 0.28, and threshold F1 of 0.92. That supports the method, but it does not remove all judgment dependence. Error genealogy is harder than counting final answers.

Fourth, RED introduces extra operations during decoding. The authors report a worst-case latency overhead of about 4.6% when early stopping is disabled, and argue that active early exit more than offsets that overhead. This is plausible in their setting, especially given the large token reductions. But production latency depends on batching, serving architecture, hardware, model size, and whether intermediate probing can be implemented efficiently.

So the correct business inference is not “always use RED exactly as written.” The correct inference is: reasoning length should be governed by monitored stability and root-error risk, not by the hope that more reflection will magically become more truth.

Reasoning should become a controlled resource, not a status symbol

The paper’s deeper value is that it changes how we should talk about reasoning.

The industry often treats reasoning as a capability tier. A model that thinks longer feels more advanced. A model that shows more intermediate work feels more trustworthy. A model that tries multiple solutions feels more diligent.

But diligence is not the same as reliability. In the paper’s evidence, later alternatives often fail to repair wrong first answers, sometimes corrupt correct first answers, and frequently carry larger error forests. The failure mode is not merely verbosity. It is structured error propagation.

For Cognaptus-style business automation, this points to a design principle:

Use reasoning where it changes the decision, control it where it creates risk, and stop it when stability has already arrived.

That principle is less glamorous than “give every workflow an autonomous reasoning agent.” It is also more likely to survive contact with latency budgets, audit requirements, and users who do not enjoy waiting for a model to dramatically rediscover arithmetic.

The first answer is not sacred. But the paper shows that later answers are not innocent. In large reasoning models, extra thinking can be a reliability mechanism, a cost center, or an error amplifier. The difference depends on whether the system understands where errors grow.

And if it does not, the model may not be thinking harder.

It may just be giving the mistake a forest.

Cognaptus: Automate the Present, Incubate the Future.


  1. Kehan Jiang, Haonan Dong, Zhaolu Kang, Zhengzhou Zhu, and Guojie Song, “FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models,” arXiv:2604.02967, 2026. https://arxiv.org/pdf/2604.02967 ↩︎