A manager asks an AI system for a risk assessment. It gives a plausible answer. The manager asks again with a slightly different prompt. Another plausible answer appears, with different reasoning. Ask five more times and the system scatters clues across the attempts like a consultant who has read the documents but refuses to assemble the memo in one draft.
The obvious response is to sample more and pick the best. That is the comforting lottery view of LLM reasoning: buy enough tickets, one answer will be right. The less comforting view is that hard reasoning is often not a lottery. It is a construction process. The correct solution may be distributed across several flawed attempts, and the system needs a procedure for combining useful fragments without drowning them in noise.
That is the real point of Algorithmic Thinking Theory by MohammadHossein Bateni, Vincent Cohen-Addad, Yuzhou Gu, Silvio Lattanzi, Simon Meierhans, and Christopher Mohri.1 The paper is not just another “LLMs reason better when they think longer” essay. We have enough of those; the shelf is full. Its contribution is more specific: it asks what kind of algorithm is being run when we call a model repeatedly, feed previous answers back into context, and ask it to synthesize a better answer.
The answer is surprisingly business-relevant. If enterprise AI systems are moving from one-shot chatbots to multi-step agents, then inference is no longer a single model call. It is a workflow design problem.
The failure of best-of-k is the clue
The paper begins from a now-familiar puzzle in advanced reasoning. A model’s pass@1 accuracy on a hard problem may be low, while pass@k can be much higher. That means the model sometimes can produce a correct answer, but not reliably on the first attempt. So far, nothing shocking.
The interesting part is what happens next. Simply sampling many attempts and selecting the best does not fully unlock the model’s capability on very hard tasks. The paper cites prior work where best-of-32 on IMO 2025 problems reached only 31.6–38.1% accuracy for leading models, while a more structured verification-and-refinement pipeline reached 85.7%, or 5 out of 6 problems. The gap is not “more attempts.” The gap is orchestration.
That distinction matters because many production AI teams still treat inference-time scaling as a larger sampling budget. Run the model more times. Vote. Ask a judge model. Maybe add a reflection step. Then hope the dashboard improves.
The paper’s correction is sharper: repeated model calls should be treated as an algorithm over a probabilistic reasoning oracle. Each call can depend on previous calls. The structure of those dependencies determines whether the process improves, stagnates, or collapses into correlated overthinking. Yes, “correlated overthinking” is a useful phrase. Also a reasonable description of several corporate strategy meetings.
The oracle: a model call whose accuracy depends on the context it sees
The paper formalizes a model as a reasoning oracle, written as $A$. Given a question $Q$, the oracle generates a candidate solution $s$ from a solution space $S$. The algorithm may call the oracle with no context, or with a context $C$ containing previous solutions.
The central object is the transfer function $F$. It describes how the quality of the output depends on the quality of the solutions in the context. In the paper’s main setting, each solution is binary: correct or incorrect. That sounds brutally simplified, but it makes the theory analyzable.
The decaying model captures two assumptions:
The mechanism is easy to state:
If the context contains a correct solution, the model is more likely to generate a correct output. But as the context grows, especially with wrong solutions, the benefit decays.
That is the paper’s useful abstraction. It turns “context helps” into a conditional statement. Context helps when it carries signal. Context hurts when signal is diluted, when the model is forced to synthesize from a pile of mutually inconsistent attempts, or when the system keeps feeding it its own recent errors.
The immediate business translation is not “use longer context windows.” It is almost the opposite: use controlled context. Curate the intermediate outputs that the model sees. Keep diversity. Avoid stuffing the prompt with every previous attempt just because the token budget allows it. The context window is not a filing cabinet. It is an operating surface.
Three reasoning algorithms, three workflow designs
The paper studies three algorithms: branching, genetic, and random sampling. These are not product names. They are abstract forms of inference orchestration.
| Algorithm | Mechanism in the paper | Operational translation | Main trade-off |
|---|---|---|---|
| Branching | Generate independent solutions, combine them in tree-like groups, and recursively synthesize upward. | Parallel candidate generation followed by staged synthesis. | Strong theoretical behavior, expensive call growth. |
| Genetic | Maintain populations of solutions; generate new candidates by sampling from the previous layer. | Population-based refinement, similar to recursive self-aggregation. | More efficient than pure branching, but depends on population size. |
| Random sampling | At each step, sample from all previously generated solutions and use that subset as context. | A simpler memory-pool strategy for iterative synthesis. | Can converge well, but context quality and independence remain crucial. |
The branching algorithm is the cleanest theoretical object. It starts with independent pass@1 solutions, then combines groups of them into higher-level solutions. Because the branches are independent, the analysis is tractable. The paper shows that for decaying models, branching can achieve the maximum possible success probability. It is also optimal among fixed-depth algorithms.
This is not a recommendation to implement exponential branching in production. If every reasoning workflow becomes a massive tree of model calls, the CFO will discover AI governance very quickly. The point is theoretical: independence is valuable. A good reasoning pipeline should not merely recycle the latest output. It should preserve independent paths long enough for synthesis to be meaningful.
The genetic algorithm is the practical sibling. Instead of building a full tree, it keeps a population at each layer and samples from that population to produce the next one. As population sizes grow, it approaches the branching algorithm. In product terms, this resembles a system that keeps several candidate analyses alive, periodically recombines them, and avoids betting everything on one draft too early.
Random sampling is even simpler. It samples from all previous outputs rather than only the previous layer. The paper shows it can also reach optimal success probability for decaying models, with convergence behavior analyzed through stochastic approximation. This matters because many real agent systems will not maintain neat layers. They will maintain memory pools, scratchpads, retrieved prior attempts, and intermediate artifacts. Random sampling gives a theoretical foothold for thinking about those messier designs.
The fixed point is the ceiling, not a dashboard metric
The key theoretical result is a fixed-point characterization of the maximum achievable success probability. For a fixed context size $k$, the success probability ceiling is tied to the largest solution of:
A less algebraic reading: the best you can do depends on the chance that at least one useful solution appears in the context, and on how strongly the oracle can exploit that useful solution after decay.
This is where the paper moves beyond “iterative reasoning works.” It gives a way to reason about limits. If the oracle cannot benefit much from correct context, no orchestration trick will magically fix it. If context decay is severe, large synthesis prompts become counterproductive. If initial pass@1 probability is tiny, you need enough independent exploration to seed the process.
For enterprise systems, this fixed-point view is more useful than it first appears. It suggests that an AI workflow has an accuracy ceiling determined by three things:
- the quality of independent first-pass generation;
- the model’s ability to recognize and reuse correct material in context;
- the rate at which context noise degrades that ability.
That is not something you learn from a single benchmark score. You learn it by testing the workflow as a system.
The appendix experiments are model-grounding, not the main theorem
The paper’s appendix uses Gemini 2.5 Pro on AIME 2025 to motivate the modeling assumptions. This is important, but it should be interpreted carefully. The experiments are not a broad empirical benchmark of reasoning algorithms. They are experimental grounding for the transfer-function assumptions.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Figure 1: AIME 2025 accuracy across 30 questions using 780 model calls per question | Baseline mapping and question selection | The model is strong on many questions but struggles on some, making questions 10 and 12 useful for context experiments. | It does not compare orchestration algorithms. |
| Appendix A.1: one correct solution plus 0–12 incorrect solutions | Sensitivity test for context decay | Accuracy generally decays as more incorrect solutions surround one correct solution. | It does not quantify a universal decay function for all tasks. |
| Appendix A.2: five total solutions, varying correct versus incorrect count | Modeling support for monotonicity and verification behavior | Accuracy increases smoothly as the number of correct solutions rises; sampling five solutions and using them as context improves average accuracy over base performance on the two tested questions. | It does not establish that all models are strong verifiers, or that five is generally optimal. |
The most useful empirical detail is Figure 3. For the two selected AIME questions, the model’s weighted average correctness when sampling five solutions and using them as context is about 0.78 for question 10 and 0.89 for question 12, while the base accuracies shown are about 0.30 and 0.49. The paper describes the gap as about 40% for those two questions.
This supports the intuition that the model can act as a strong verifier or synthesizer when given multiple candidate solutions. But the sample is intentionally narrow: two math questions, one model, thirty calls per configuration in the context experiments. The appendix makes the theory plausible; it does not make it universal.
That boundary matters. A business reader should not walk away thinking, “Use five answers, get 40% improvement.” That would be delightfully easy and probably wrong by Tuesday. The useful lesson is procedural: test whether your model improves when exposed to curated candidate outputs, and test how quickly that benefit degrades as incorrect or irrelevant material is added.
Sliding windows fail because recent does not mean independent
One of the paper’s most practically relevant results concerns a tempting shortcut: always use the most recent $k$ outputs as context. This “sliding window” approach sounds reasonable. It is cheap, simple, and naturally sequential. It is also suboptimal in the uniform model.
Why? Because recent outputs are correlated. If the process drifts into a bad local pattern, the next call receives a context dominated by related mistakes. In the decaying models studied later, the paper notes that a sliding-window process can eventually contain only wrong solutions, after which the oracle may never recover. That provides a clean theoretical lens on overthinking: not simply “too many tokens,” but repeated reuse of correlated low-value context.
This is the part enterprise agent builders should tattoo on their orchestration diagrams, preferably somewhere near “memory.”
A memory system that blindly carries forward recent thoughts can amplify errors. A multi-agent workflow that passes every intermediate output downstream may look transparent but behave like a rumor chain. A long-context agent that rereads its entire scratchpad may become more confident while less grounded.
The replacement principle is simple: preserve useful independence. Generate diverse attempts. Evaluate them. Select context deliberately. Recombine the best signals without forcing the model to digest every wrong turn.
Branches, populations, and pools map cleanly to enterprise AI patterns
The paper’s theoretical algorithms map naturally to three enterprise workflow patterns.
| Enterprise pattern | Corresponding algorithmic idea | Where it fits |
|---|---|---|
| Parallel expert drafts | Branching | High-stakes analysis where latency is acceptable and independence matters. |
| Iterative analyst room | Genetic algorithm | Strategy, legal review, research synthesis, or complex planning where candidate solutions evolve. |
| Memory-backed reasoning pool | Random sampling | Agent systems with accumulated intermediate outputs, retrieved attempts, and reusable reasoning artifacts. |
For example, in a compliance review workflow, branching would mean asking several independent model instances to analyze a policy document from different angles, then synthesizing selected outputs. A genetic approach would maintain a population of draft findings, refine them through multiple rounds, and combine partial strengths. A random-sampling approach would build a larger pool of findings and draw controlled subsets into later synthesis calls.
The mistake would be to treat all three as equally cheap versions of the same thing. They are not.
Branching buys independence at high compute cost. Genetic reuse buys efficiency but risks population collapse if diversity is not maintained. Random sampling is flexible but requires careful pool management. The choice is not a model choice. It is an inference architecture choice.
What Cognaptus infers for business use
The paper directly shows that, under its decaying-model assumptions, certain reasoning algorithms can reach optimal success probability, while correlated sliding-window reuse can be suboptimal. It also provides small-scale experimental evidence that correct context helps and wrong context dilutes that help.
Cognaptus’ business inference is this: enterprise AI reliability will increasingly depend on inference orchestration, not only model selection.
That inference leads to several design rules.
First, do not evaluate an AI workflow only by one-shot accuracy. If the workflow is naturally iterative—contract analysis, portfolio risk review, supply chain exception diagnosis, due diligence, incident response—then benchmark the complete reasoning process. The model call is not the product. The pipeline is the product.
Second, treat context as a scarce resource even when tokens are abundant. The cost of context is not only price and latency. It is contamination. Wrong intermediate outputs can reduce the value of correct ones.
Third, separate generation, verification, and synthesis. A model may be weak at producing a correct answer on the first attempt but strong at recognizing or integrating correct fragments once they appear. The appendix experiments point in this direction for the tested Gemini/AIME setting. Whether the same is true in your workflow is an empirical question, not a vibe.
Fourth, preserve diversity across attempts. If every candidate answer comes from the same prompt, same retrieval bundle, same examples, and same failure mode, then “multi-agent” may just mean “one agent wearing different hats.” Very fashionable, not very helpful.
Fifth, measure the decay curve. Add correct and incorrect candidate outputs in controlled combinations. Observe when synthesis improves and when it degrades. This is the practical version of estimating the transfer function.
The boundary: binary correctness is useful, but business work is rarely binary
The paper is careful about its own limits. Its main model treats solutions as correct or incorrect. That simplification enables clean results, but many enterprise tasks do not behave that way.
A legal memo can be partially right. A market forecast can be directionally useful but numerically weak. A compliance finding can identify the right clause while misreading the operational implication. In these settings, the binary score function is too coarse.
The authors explicitly point toward richer score functions and diversity measures. That is exactly where business relevance would expand. In real workflows, two partially correct answers are useful only if their errors are different and their strengths complement each other. Combining five near-duplicate answers does not create intelligence. It creates a very confident paragraph.
Another boundary is evaluation. The theory assumes a score exists, even if the reasoning algorithm does not access it directly. In production, scoring is often the hard part. You may not know whether an answer is correct until a human expert reviews it, a transaction settles, or a regulator disagrees with you in writing. Slightly late, but educational.
Finally, the experiments focus on math reasoning with Gemini 2.5 Pro and AIME 2025. That is a valuable controlled setting, not a universal deployment guide. The right conclusion is not “this algorithm will improve every enterprise agent.” The right conclusion is “agent workflows need measurable reasoning dynamics.”
From prompt engineering to inference engineering
The most important shift in this paper is not mathematical decoration. It is vocabulary.
Prompt engineering asks: what should we put in the prompt?
Inference engineering asks: how should we allocate model calls, preserve independent attempts, select context, combine partial solutions, and stop before reuse becomes self-contamination?
That is a more serious discipline. It is also where enterprise AI is heading. One-shot assistants are easy to demo and hard to trust. Multi-step systems are harder to design, but they allow reliability to come from process rather than personality. The model does not need to become magically consistent in one call if the system around it can generate, compare, verify, and synthesize intelligently.
The paper does not solve enterprise AI reliability. It gives us a theory-shaped handle on one part of it: why structured reasoning pipelines can outperform naive sampling, and why more context is not automatically better.
The practical lesson is almost annoyingly simple. Do not ask whether the model can think. Ask what algorithm your system forces it to run.
Sometimes the answer is a branching process. Sometimes it is a population. Sometimes it is a memory pool. And sometimes, regrettably, it is just a very expensive way to repeat yesterday’s mistake with better formatting.
Notes
Cognaptus: Automate the Present, Incubate the Future.
-
MohammadHossein Bateni, Vincent Cohen-Addad, Yuzhou Gu, Silvio Lattanzi, Simon Meierhans, and Christopher Mohri, “Algorithmic Thinking Theory,” arXiv:2512.04923, 2025. https://arxiv.org/abs/2512.04923 ↩︎