Thinking Isn’t Free: Why Chain-of-Thought Hits a Hard Wall

Reasoning budgets look harmless until they become a line item.

A user asks an AI system to reconcile a long contract, inspect a transaction trail, trace dependencies in a knowledge graph, or verify whether one operational event can lead to another. The model “thinks.” The answer improves. The invoice also improves, in the less charming direction. The usual response is to ask for shorter reasoning: compress the chain of thought, use fewer tokens, impose a budget, maybe add a prompt that says “be concise,” because apparently invoices can be negotiated with adjectives.

That works for some tasks. It does not work for all tasks.

A new paper, Reasoning About Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs, gives a theoretical reason why.¹ The authors extend the Bounded Attention Prefix Oracle, or BAPO, model into a BAPO-CoT model for chain-of-thought reasoning. Their central result is simple enough to be uncomfortable: for several canonical problems, any constant-bandwidth BAPO-CoT needs linearly many reasoning tokens as input size grows.

Not “current models happen to use too many tokens.” Not “we have not found the right compression trick yet.” The result says that under this abstraction, some tasks have a minimum reasoning length that scales with the size of the problem.

That matters because reasoning tokens are not decorative commentary. In many hard tasks, they are the computation.

The real bottleneck is information flow, not verbosity

Most business discussions about chain-of-thought reasoning start at the wrong layer. They treat long reasoning as a writing-style problem. The model is too verbose. The prompt is loose. The scratchpad is inefficient. A better instruction should produce the same result with fewer words.

Sometimes, yes. Models overthink trivial problems. They repeat themselves. They spend tokens performing ritual seriousness. Nobody needs a 900-token meditation on $2 + 3$.

But the paper is not about those cases. It studies problems where the answer depends on information distributed across the input. The model must move enough information from earlier tokens to later computation steps to produce the right output. This is where BAPO becomes useful.

BAPO models a transformer-like system as communication-constrained. For any split between a prefix and a suffix, only limited information can cross the split. There are two channels:

Channel	What it roughly represents	Why it matters
Prefix bandwidth	A bounded summary of information already processed in the prefix	The model cannot pack arbitrary global state into a tiny hidden summary
Attention bandwidth	A bounded number of prefix tokens that the suffix can effectively attend to	The model cannot reliably pull unlimited exact details through attention at one step

A single token-generation step is therefore not magical. It is a bounded communication event.

Chain-of-thought helps because it gives the model more events. Instead of solving a difficult global problem in one step, the model generates intermediate tokens. Each token can carry forward some state. The scratchpad becomes a serialized computation trace.

That is the mechanism. CoT does not remove the bottleneck. It amortizes the bottleneck over more steps.

This is the core idea behind BAPO-CoT. The model starts with the input, generates reasoning tokens one at a time, and eventually emits the answer. Each generated token is modeled as another bounded BAPO step. If a task requires a lot of information to be moved and updated, then the number of steps cannot remain constant.

In plain language: if each reasoning token can carry only so much useful computation, some problems require many tokens for the same reason a narrow pipe needs more time to move more water. A more poetic prompt will not widen the pipe.

Three toy tasks expose a non-toy constraint

The paper proves lower bounds for three canonical BAPO-hard tasks:

Task	What the model must decide	Lower bound shown
Majority	Whether a binary string has more 1s than 0s	$\Omega(n)$ CoT tokens
MATCH3$_n$	Whether two earlier modular values combine with the final value to satisfy a modular condition	$\Omega(n)$ CoT tokens
Reachability	Whether a target node is reachable from a source node in a directed graph	$\Omega(m)$ CoT tokens, where $m$ is the number of edges

These are not chosen because businesses wake up each morning desperate to solve modular triplet matching. They are chosen because they isolate different forms of global information flow.

Majority is the cleanest example. To know whether 1s outnumber 0s, the system needs enough information about the whole sequence. A model can try to “just know” the count, but a bounded communication abstraction does not allow an arbitrary exact count to appear from nowhere.

MATCH3$_n$ is closer to a search-over-combinations problem. The model must detect whether a relationship exists among values spread across the input. A compact answer requires checking enough candidates or maintaining enough information about possible complements.

Reachability is the graph case. The answer is not local to one edge. It depends on how edges compose into paths. In business language, this resembles dependency tracing: can this event reach that downstream state through a chain of relations?

The proof strategy is based on indistinguishability. The authors construct pairs of inputs that must lead to different answers, while ensuring that a too-short BAPO-CoT sees the same bounded information at every reasoning step. If the system cannot distinguish the two cases, it must produce the same reasoning sequence and the same answer. Since the correct answers differ, the short reasoning process fails.

This is the important part: the lower bound is not saying the model is lazy. It is saying that with bounded per-step communication, a reasoning chain below a certain length cannot carry enough distinguishing information.

Upper bounds keep the result honest

A lower bound without an upper bound can be dramatic but operationally vague. It says a task is hard, but not whether the proposed abstraction has anything sensible to say about actual reasoning procedures.

The paper therefore complements the lower bounds with matching or near-matching upper-bound constructions.

Problem	Lower bound	Constructive upper bound under cBAPO	Interpretation
Majority	$\Omega(n)$	$O(n \log n)$	Count bit by bit; the log factor comes from writing counts in a small alphabet
MATCH3$_n$	$\Omega(n)$	$O(n)$	Reduce to a sequence of easier pair-matching checks
Reachability	$\Omega(m)$	$O(n^2)$	Perform a graph search, tracking explored and visited nodes
Pointer chasing	BAPO-hard for growing chain length	$O(\kappa(n))$	Hardness does not always imply linear-in-input CoT length

The cBAPO part matters. The original BAPO model has a strange loophole in the chain-of-thought setting: with enough freedom in the prefix oracle, repeating the input can make unrealistic upper bounds possible. In effect, the model can exploit an oracle-like copy trick. That is not how real causal transformers behave.

The authors introduce self-consistent BAPO, or cBAPO, to make upper bounds more meaningful. The prefix computation must itself be computable in a BAPO-like way. This preserves the idea that the model’s intermediate state has to be built through bounded operations rather than smuggled in by an omniscient prefix function.

For readers who do not live inside theoretical computer science, the practical value is this: the paper is not merely shouting “hard!” from a hilltop. It also sketches the kinds of serialized algorithms that can solve the tasks. Majority becomes counting. MATCH3$_n$ becomes repeated complement checking. Reachability becomes graph search.

That is exactly the business lesson. When reasoning tokens scale linearly, they are often acting as an implicit algorithm. The model is not “explaining” the answer after the fact. It is spending tokens to perform the work.

The experiments are sanity checks, not the main theorem

The paper’s main contribution is theoretical. The experiments are not there to prove the theorem. They are there to ask whether frontier reasoning models behave in a way consistent with the theory.

The authors test GPT-5.2 and Gemini 2.5 Pro on Majority, MATCH3$_n$, and Reachability across input sizes $n = 5, 11, 25, 51, 75, 101$. For Reachability, they generate graphs with $m \leq 3n$ edges so input size remains linear in the number of nodes. For each size and model, they generate 250 positive and 250 negative instances.

The evidence can be read as three different tests:

Test	Likely purpose	What it supports	What it does not prove
Internal reasoning on BAPO-hard tasks	Main empirical sanity check	GPT-5.2 uses approximately linear reasoning tokens while maintaining high accuracy; Gemini uses much larger budgets on some tasks	That all real business tasks follow these exact bounds
Fixed or limited external CoT	Budget sensitivity test	Too-small reasoning budgets degrade performance as input size grows	That every fixed budget fails on every task
BAPO-easy comparison tasks	Contrast / robustness check	Token usage is much lower and often plateaus on easier tasks	That BAPO fully captures all transformer behavior

The results are directionally sharp. With no reasoning, accuracy quickly degrades toward random guessing on the hard tasks. With reasoning enabled, GPT-5.2 achieves near-perfect accuracy, but reasoning token use scales approximately linearly with problem size. Gemini 2.5 Pro also benefits from reasoning, but in the paper’s reported experiments it uses far more tokens for MATCH3$_n$ and Majority, reaching roughly 8,000–9,000 tokens at $n = 101$ for those tasks, and its Majority accuracy degrades to about 0.75 at larger $n$.

The external-CoT experiments are especially useful for interpreting “compression.” The authors disable internal reasoning for GPT-5.2 and test three prompting styles: a plain “think step by step” prompt, a fixed word-budget prompt, and an algorithmic prompt that explicitly asks for the relevant procedure—counting, complement search, or BFS/DFS.

For MATCH3$_n$ and Reachability, plain and algorithmic CoT again show roughly linear completion-token growth and better accuracy. Fixed 50- and 100-word budgets only partially close the gap. A 200-word budget appears sufficient up to $n = 101$ for those two tasks, which is an important nuance: real models may be inefficient by large constant factors even when sublinear scaling is impossible.

That distinction is worth keeping. The paper does not say compression is useless. It says compression has a ceiling when the task has linear token complexity. Cutting waste is possible. Compressing below the scaling law is not.

The Majority case adds another useful wrinkle. GPT-5.2 sometimes refuses to count step by step or jumps to a final count, which hurts external-CoT performance. This is not a contradiction of the theory. It is a reminder that the existence of a clean decomposition does not guarantee that an LLM will execute it when instructed. The model may know the relevant procedure and still choose a shortcut, because training and instruction-following behavior interfere with the desired algorithm.

That is a quietly brutal finding for prompt-only automation. “Use the right algorithm” is not the same as “the model will reliably execute the right algorithm.”

The misconception: shorter reasoning is not always smarter reasoning

The tempting belief is that long reasoning is mostly waste. If the chain is long, the model is rambling. If the model is rambling, compression should preserve accuracy. Therefore, every reasoning-heavy workflow can be optimized by forcing the model to think less.

This paper suggests a better distinction:

Case	What shorter reasoning means	Operational response
Redundant reasoning	The model repeats, over-explains, or overthinks a simple task	Compress, distill, tune prompts, lower reasoning effort
Inefficient reasoning	The model uses the right scaling pattern but too many tokens per unit of work	Optimize constants, provide algorithmic structure, improve representations
Irreducible reasoning	The task requires information flow that grows with input size	Route to tools, explicit algorithms, retrieval, search, or architecture changes

Most enterprise AI cost management treats these as one category. That is convenient and wrong.

A contract review task may contain easy local extraction, medium-difficulty classification, and hard global consistency checking. A transaction-monitoring workflow may include simple threshold checks, pattern matching across records, and graph-style relationship tracing. A customer-support agent may answer a local policy question cheaply but require expensive reasoning when policy clauses interact with long account history.

If all of this is sent to one “reason harder” endpoint, cost will look mysterious. It is not mysterious. It is mixed task complexity wearing a product UI.

Business implication: route reasoning by information structure

The business value of this paper is not that managers should learn BAPO proofs. They have suffered enough.

The value is that it gives a cleaner diagnostic question: does this task require global information flow that grows with input size?

If the answer is no, reasoning compression may be a straightforward win. If the answer is yes, fixed budgets become dangerous. The system may look fine in small demos, then fail when the input grows.

A useful operational framework is:

Workflow pattern	CoT risk	Better design choice
Local extraction from a short document segment	Low	Use cheap models, small reasoning budget, structured output
Classification using a few salient fields	Low to medium	Prompt optimization and calibration are usually enough
Multi-document consistency checking	Medium to high	Chunk, retrieve, compare explicitly, maintain evidence tables
Dependency tracing across processes, contracts, or graphs	High	Use graph algorithms or tool calls; let the LLM explain results
Long audit trail reconstruction	High	Build deterministic state machines or database queries; use LLM for summarization and exception narratives
Search over many candidate combinations	High	Use explicit search or constraint-solving tools before asking for natural-language interpretation

The point is not to remove the LLM. The point is to stop pretending that the LLM’s scratchpad is the cheapest place to run every algorithm.

For Cognaptus-style business automation, the practical architecture should separate four layers:

Data selection: retrieve the relevant records, documents, or graph neighborhoods.
Deterministic computation: run counting, matching, search, joins, constraint checks, or graph traversal outside the model when possible.
LLM reasoning: use the model where ambiguity, policy interpretation, explanation, and judgment matter.
Audit narrative: ask the model to explain what was checked, what evidence was used, and where uncertainty remains.

That architecture does not worship deterministic code. It merely gives arithmetic, search, and graph traversal back to the machines that were already good at them before everyone decided a chatbot should become a database, calculator, paralegal, analyst, and intern with anxiety.

Compression should be benchmarked by scaling, not percentage savings

A major editorial strength of the paper is its attack on percentage-based reasoning compression.

Many compression studies report results like “30% shorter chains with similar accuracy.” That can be useful. It is also incomplete. A constant-factor reduction says little about what happens as input size grows.

Suppose a task needs at least $cn$ reasoning tokens. A method that reduces the constant from $10n$ to $4n$ is valuable. It may cut cost by more than half. But it has not changed the scaling. When $n$ becomes large enough, the chain is still long.

Conversely, if a task’s true minimal reasoning grows sublinearly but the model uses superlinear chains, aggressive compression may be possible. The paper’s pointer-chasing example is important here. It shows that BAPO-hard does not automatically mean linear-in-input token complexity. The necessary reasoning length can scale with the relevant dependency chain $\kappa(n)$ rather than the entire input size.

That gives a more useful evaluation principle:

Do not ask only, “Can we shorten reasoning by 30%?”

Ask:

How does required reasoning length grow with input size?
What part of the input actually determines the answer?
Is the model using tokens to do necessary serial work or to compensate for poor decomposition?
Does accuracy collapse at a task-specific critical budget?
Can explicit tools change the scaling, not just the constant?

This is where many AI pilots mislead. A small sample proves that the model can reason. It does not prove that the reasoning budget scales acceptably.

What the paper directly shows, and what we infer from it

The practical interpretation should stay disciplined.

Level	Statement
Directly shown by the paper	Under the BAPO-CoT abstraction, Majority, MATCH3$_n$, and Reachability require linear CoT token lower bounds for constant-bandwidth BAPO-CoTs. The paper also gives matching or near-matching constructive upper bounds and reports experiments where frontier reasoning models show broadly consistent scaling behavior.
Reasonable business inference	Some enterprise tasks with global information flow will not be made cheap merely by asking for shorter reasoning. As input grows, fixed token budgets can produce brittle failures. Systems should route counting, search, graph traversal, and large candidate matching to explicit tools where possible.
Still uncertain	BAPO-CoT is an abstraction, not a full model of every production LLM system. The experiments use synthetic canonical tasks and limited model access. Tool use, retrieval, specialized training, different architectures, and task-specific data structures can change practical performance.

The middle row is where the money is. The paper does not directly study procurement workflows, audit review, legal analysis, customer-support routing, or financial anomaly investigation. But it gives a theoretical language for a failure pattern those workflows already display: reasoning cost grows when the model must carry global state through a narrow sequential interface.

The right conclusion is not “CoT is bad.” The right conclusion is “CoT is an expensive substrate for some kinds of computation.”

The boundary: this is not a universal anti-reasoning argument

The cleanest way to misuse this paper is to turn it into a slogan: chain-of-thought hits a hard wall, therefore reasoning models are doomed.

No.

The paper shows lower bounds for specific problem families under a specific abstraction. BAPO-CoT isolates a communication bottleneck, but it does not model every feature of modern AI systems. It does not fully capture external tool use, retrieval, database operations, code execution, specialized memory, or architectures that alter the communication pattern.

The experiments are also limited. They use synthetic tasks with clean scaling parameters. That is exactly why they are useful for theory, but it is not the same as measuring a messy enterprise workflow.

There are also two different optimization targets:

Changing constants: reducing unnecessary reasoning tokens while preserving the same asymptotic scaling.
Changing the computational substrate: moving the hard part into tools, algorithms, retrieval structures, or architectures with better scaling.

Most prompt engineering changes constants. Product architecture can change the substrate.

That is the boundary businesses should care about.

The end of unpriced thinking

The paper’s most useful contribution is not that it proves three lower bounds. It is that it changes the question.

The old question was: “Can we make the model reason more efficiently?”

The better question is: “What kind of computation are we asking reasoning tokens to perform?”

If the model is using chain-of-thought to clarify a judgment, refine an explanation, or structure a small ambiguous decision, shorter reasoning may work. If it is using chain-of-thought to count, search, traverse, match, or carry global state across a growing input, then the token budget is not fluff. It is the runtime.

That is the hard wall. Not that reasoning stops working. It keeps working—by spending tokens.

For enterprise AI, this should push design away from monolithic reasoning calls and toward routed systems: LLMs for language and judgment, tools for explicit computation, retrieval for evidence, graphs for dependencies, and audit layers for accountability.

Thinking is useful. It is also not free.

And any AI system that prices thinking as if it were free is not intelligent. It is just temporarily subsidized.

Cognaptus: Automate the Present, Incubate the Future.

Kiran Tomlinson, Tobias Schnabel, Adith Swaminathan, and Jennifer Neville, “Reasoning About Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs,” arXiv:2602.02909, 2026. ↩︎

The real bottleneck is information flow, not verbosity#

Three toy tasks expose a non-toy constraint#

Upper bounds keep the result honest#

The experiments are sanity checks, not the main theorem#

The misconception: shorter reasoning is not always smarter reasoning#

Business implication: route reasoning by information structure#

Compression should be benchmarked by scaling, not percentage savings#

What the paper directly shows, and what we infer from it#

The boundary: this is not a universal anti-reasoning argument#

The end of unpriced thinking#