Thinking Out Loud — Why LLMs Might *Need* Chain‑of‑Thought

Audit trails are boring until something goes wrong.

In ordinary business operations, this is not controversial. If a payment approval, legal review, procurement decision, or trading order leaves intermediate records, people can reconstruct what happened. If the whole decision is buried inside a black-box system that simply outputs “approved,” “rejected,” or “buy now,” the audit team has a less glamorous job: guessing which invisible machinery produced the visible answer. Charming, in the way dental surgery is charming.

The same issue sits inside today’s debate about chain-of-thought reasoning in language models. Many people talk about chain-of-thought as if it were mainly a prompting style: ask the model to explain itself, and maybe it becomes easier to monitor. That is only half the story. The more serious question is architectural: how much sequential reasoning can the model perform before it is forced to pass through something humans can read?

That is the question Jonah Brown-Cohen, David Lindner, and Rohin Shah address in Quantifying the Necessity of Chain of Thought through Opaque Serial Depth.¹ The paper does not claim that every chain-of-thought trace is honest, complete, or safe. It makes a narrower and more useful move: it formalizes the amount of serial computation a model can perform between interpretable bottlenecks, such as visible tokens. The authors call this quantity opaque serial depth.

The key implication is simple enough to remember and uncomfortable enough to matter:

Chain-of-thought monitorability is not just a behavioral property. It is partly an architectural property.

That changes how businesses should evaluate AI systems. The question is no longer only “Does the model show its reasoning?” It becomes “Does the architecture leave the model with any other place to put long sequential reasoning?”

Opaque serial depth measures hidden sequential work between visible checkpoints

The paper starts from a familiar intuition: some tasks require serial reasoning. Not everything can be solved by doing many operations in parallel. Planning, multi-step deduction, and certain forms of search require one step to depend on the result of a previous step.

Transformers are powerful partly because they perform vast parallel computation over tokens. But a standard autoregressive Transformer has an important constraint: after it produces a token, that token becomes part of the next input. If the model needs to carry information from later layers in one forward pass back to earlier layers in another forward pass, it must pass through the generated token stream.

That token stream may be visible to humans. So, for sufficiently long serial reasoning, the model may need to “think out loud.” Not because it has suddenly developed a fondness for transparency. The architecture is simply making visible text a bottleneck.

Opaque serial depth gives this bottleneck a quantitative language. Informally:

Opaque serial depth is the maximum amount of serial computation that can happen without passing through an interpretable intermediate node.

The paper treats visible input and output tokens as interpretable nodes. It also allows other nodes to be marked interpretable if there is a defensible reason to do so. The computation between those interpretable nodes is the “opaque” part.

This is a useful framing because it avoids an easy but bad shortcut: counting layers. A layer count tells us something, but not enough. A normalization operation, a linear projection, a softmax, a multi-head attention block, and a feed-forward block do not all contribute the same kind of serial computation. Saying “this model has 48 layers” is not wrong. It is just too blunt to settle an auditability question.

The authors instead borrow a ruler from computational complexity theory: circuit depth.

Circuit depth is the paper’s ruler, not a decorative math accessory

Circuit depth measures the length of the longest path through a computational circuit. A shallow circuit can do many things in parallel, but it cannot perform much sequential computation. A deep circuit can support longer chains of dependency.

The paper defines the depth of a neural network function by asking: among polynomial-size circuits that compute the same function, what is the minimum possible maximum path length?

In simplified form:

$$ \text{Depth}(f_\theta) = \min_{\text{poly}(S)\text{-size circuits }C} \max_{\text{paths }P\text{ in }C} \text{Length}(P) $$

Here, $f_\theta$ is the function computed by the neural network, and $S$ is the number of parameters. The polynomial-size restriction matters because, without it, one could “compute” arbitrary functions with giant lookup tables. That would be a wonderfully useless way to declare victory.

The paper does not pretend that exact minimum depth is easy to calculate. It is not. Instead, the authors compute upper bounds by constructing valid circuits corresponding to actual neural network computations and calculating their depth. This is already valuable because it gives a principled way to compare architectures.

The mechanism is worth spelling out:

Concept	What it means	Why it matters
Serial depth	Longest chain of dependent computation	Captures how much step-by-step reasoning can occur
Interpretable node	A checkpoint humans can plausibly inspect, usually text	Creates a visibility boundary
Opaque serial depth	Serial depth between interpretable checkpoints	Estimates hidden reasoning capacity
Upper bound	A calculated maximum from a known circuit representation	Useful for comparison, not a certificate of exact behavior

This is the paper’s central contribution. It transforms “the model might hide reasoning” from a vague concern into a question about how computation flows through the architecture.

Tokens become audit checkpoints in standard autoregressive Transformers

For a standard autoregressive Transformer, each output token can be treated as an interpretable node. The model produces one token, appends it to the context, and then uses the expanded context to produce the next token. If the model is doing long serial reasoning across multiple generated tokens, the reasoning has to repeatedly pass through the token stream.

This does not mean the token stream is perfectly faithful. The paper is careful about this. But it does mean the architecture gives us a natural monitorable bottleneck.

That is the business-relevant insight. The visibility of reasoning is not merely an interface decision. It is partly a consequence of where the model’s serial computation can go.

In a standard Transformer, the longest opaque path for one generated token is limited to the forward pass that produces that token. The paper’s asymptotic result expresses this as:

$$ O(L(\log T + \log D)) $$

where $L$ is the number of layers, $T$ is sequence length, and $D$ is activation dimension.

The logarithmic dependence on sequence length is important. Attention looks across many tokens, but summing or aggregating over many values can be organized in a tree-like parallel structure. That creates logarithmic depth rather than linear depth. The model can see a long context, but seeing many things in parallel is not the same as performing a long serial computation hidden inside one forward pass.

This is the point many casual discussions miss. More context does not automatically mean proportionally more hidden sequential thought. The architecture matters.

The Gemma 3 calculations put scale on the mechanism

The paper then applies the method to the Gemma 3 model family, treating input and output tokens as interpretable nodes. These calculations are the main worked evidence that the definition can be used on real architectures, not just whiteboard diagrams.

The upper-bound depths at maximum sequence length are:

Model	Final depth formula	Total depth at maximum sequence length
Gemma 3 1B	$4370 + 8 \cdot \log_2 T$	4,490
Gemma 3 4B	$6036 + 10 \cdot \log_2 T$	6,206
Gemma 3 12B	$8482 + 16 \cdot \log_2 T$	8,754
Gemma 3 27B	$11322 + 20 \cdot \log_2 T$	11,662

These are not tiny numbers. A forward pass can still contain thousands of units of serial depth under the paper’s circuit accounting. So the result is not “Transformers cannot think unless every step is written in English.” That would be too neat, and therefore suspicious.

The more precise interpretation is this: standard Transformer architectures bound opaque serial reasoning between visible token checkpoints. If a task needs serial depth beyond that bound, the model must either use visible tokens, fail, or rely on some other mechanism not captured by the standard architecture.

Magnitude matters here. The Gemma 3 27B upper bound of 11,662 is much larger than the 1B model’s 4,490. Larger models and deeper architectures can support more hidden computation between tokens. But the dependence on sequence length remains logarithmic in the formula. The main driver is not simply “longer context”; it is how the architecture composes operations.

For enterprise model evaluation, this suggests a different kind of due diligence question. Instead of asking only whether a model has a reasoning mode, ask how much opaque serial computation the architecture permits before reasoning is forced through an inspectable representation.

Architecture changes the audit boundary

The most useful section of the paper compares standard Transformers with alternative architectures. This is where opaque serial depth becomes a governance concept rather than a technical curiosity.

The comparison is not an empirical benchmark. It is a theoretical architecture analysis. Its purpose is to show how design choices alter the amount of reasoning that can occur away from interpretable checkpoints.

Architecture	Opaque serial depth result	What changes	Governance interpretation
Autoregressive Transformer with visible tokens	$O(L(\log T + \log D))$	Tokens act as repeated interpretable bottlenecks	Chain-of-thought monitoring has architectural support
Recurrent network blocks	$O((L + T)\log D)$	Hidden state can carry serial computation across positions	More reasoning can stay inside activations
Continuous latent chain-of-thought, if latent states are not interpretable	$O(LT_{\text{cot}}(\log T + \log D))$	“Thoughts” bypass discrete visible tokens	Reasoning capacity may grow while monitorability shrinks
Persistent black-box memory	Unbounded in $L$, $T$, and $D$ alone	Hidden memory can accumulate computation over invocations	Auditability can degrade across sessions
Text diffusion with interpretable intermediate tokens	$\text{Depth}(f_\theta)$ per diffusion step checkpoint	Intermediate token states may be visible	Low opaque depth does not automatically explain the algorithm

The standard Transformer case is the cleanest. Tokens form checkpoints. Opaque paths can go “up” through layers and “right” through attention over positions, but the serial depth stays bounded in the way described above.

Recurrent networks change the picture. If information can be carried forward through hidden state, then serial computation can move across positions without being externalized as natural language. That creates a linear dependence on $T$ rather than only a logarithmic one. In plain English: recurrence gives the model a private corridor.

Continuous latent chain-of-thought pushes the issue further. If a model passes hidden vectors as “thoughts” instead of discrete tokens, and those vectors are not interpretable, then multiple reasoning steps can occur without leaving readable traces. The paper notes that such vectors may or may not be interpretable depending on how they are constructed and trained. That conditional matters. A continuous representation based on token embeddings might preserve some interpretability; a representation optimized only for solving hard tasks probably should not receive that benefit of the doubt.

Persistent black-box memory is the sharpest warning. If a model can read from and write to an uninterpretable memory across invocations, opaque serial computation can accumulate indefinitely. The paper says the depth cannot be bounded solely in terms of layers, tokens, and activation dimension. For businesses building long-running agents, this is the part worth underlining. A system that remembers through hidden activation-like memory may become less auditable over time even if each individual response looks harmless.

Text diffusion is more subtle. If intermediate diffusion outputs are token-like and interpretable, opaque serial depth may remain bounded by the depth of one neural network step. But the paper also notes a distinction between understanding an intermediate state and understanding the algorithm that produced the transition. Seeing “the answer is 8” change into “the answer is 7” tells us that the model revised its answer. It does not necessarily tell us why. Useful, but not magic. AI governance keeps trying to buy magic and receiving invoices for engineering.

The automated calculator is implementation evidence, not a second thesis

Hand calculations are possible, but tedious. The paper’s appendix provides detailed by-hand calculations for Gemma 3, including architectural details such as sliding-window attention, global attention layers, pre- and post-attention normalization, QK normalization, RMSNorm, feed-forward blocks, and output decoding.

The authors also implement a JAX-based automated depth calculator. It works by applying the opaque serial depth algorithm to the jaxpr intermediate representation of a model. The implementation hard-codes depth formulas for JAX operations: roughly 75 jaxpr operations are enough for the full Gemma 3 architecture, with most being depth-0 wiring operations or depth-1 coordinate-wise operations. The authors also implement additional cases for associative operations and recursive or special-case operations.

This part of the paper should be interpreted as implementation validation. It is not a new theory of reasoning. It asks whether an automated tool can produce useful upper bounds without a human spending hours tracing every architectural component.

The answer is yes, with overestimation.

Model	By-hand upper bound	JAX calculator upper bound	Approximate JAX excess
Gemma 3 1B	4,490	5,728	27.6%
Gemma 3 4B	6,206	7,958	28.2%
Gemma 3 12B	8,754	11,268	28.7%
Gemma 3 27B	11,662	14,856	27.4%

The overestimation comes from missed circuit optimizations. One example is a dot product followed by bias addition. A naïve intermediate representation may treat the bias addition as another serial step, while a human can fold it into the dot-product sum and reduce depth.

For governance use, that trade-off is attractive. The automated calculator is less tight, but it is fast and works from implementation. Manual calculation is tighter, but slower and easier to get wrong. In business language, this is the usual choice between a repeatable diagnostic and a bespoke expert ceremony. The ceremony may be better; the diagnostic scales.

The paper also uses the calculator for a proof-of-concept Mixture-of-Experts comparison. It implements a vanilla MoE Transformer with 91.32B total parameters, 11.42B active parameters, 28 layers, 64 experts per layer, 8 experts per token, and sequence length 512. The JAX-calculated depth is 4,096. That is far lower than the JAX-calculated 11,268 for Gemma 3 12B and lower than the 8,754 by-hand Gemma 3 12B bound.

This is best read as an exploratory architecture comparison. It suggests MoE may reduce serial depth relative to dense models because only a subset of experts is active for each token. It does not prove every MoE design is more monitorable, nor does it say MoE models are safer. The result is a clue, not a halo.

The misconception: monitorability is not mainly a prompting problem

The accepted reader mistake here is easy to understand. People see chain-of-thought as something the model chooses to provide, or something the prompt engineer extracts with sufficient politeness. “Please explain step by step” becomes the audit strategy. There are worse strategies, but that is a low bar. The floor is not a place to live.

The paper replaces that belief with a stronger one:

Reader belief	Correction	Why it matters
Chain-of-thought monitorability is mostly about prompting	Architecture determines whether long serial reasoning must pass through visible checkpoints	Prompting cannot recover reasoning that never enters an interpretable channel
Larger context automatically means more hidden reasoning	Standard Transformer depth grows logarithmically with sequence length in this analysis	Context size and hidden serial depth are different dimensions
Latent reasoning is just a performance optimization	Latent or recurrent computation may increase opaque serial depth	Efficiency gains may come with auditability costs
Memory improves agents without changing transparency	Black-box persistent memory can accumulate unbounded opaque computation	Long-running agents need memory governance, not just response review
A readable intermediate state explains the algorithm	A node may expose information content without revealing the reasoning process	Monitorability is not full interpretability

For business readers, this is the article’s central translation: chain-of-thought should not be evaluated only as a user-facing feature. It should be treated as part of the system’s control architecture.

What businesses can do with opaque serial depth

The paper directly provides a metric and a calculation method. Cognaptus’ practical inference is that opaque serial depth can become part of AI architecture due diligence, especially for models used in regulated or high-stakes workflows.

This does not mean every procurement team now needs to perform circuit-depth analysis before buying a chatbot. Please do not turn every software purchase into a PhD qualifying exam. But for systems that make decisions, trigger actions, use memory, or operate as agents, the metric points to concrete questions.

Due diligence question	Good answer	Riskier answer
What are the interpretable checkpoints in the system?	Natural-language tokens, structured logs, auditable tool calls, inspectable memory records	Hidden vectors, undocumented latent states, opaque memory
Does reasoning pass through visible tokens or structured traces?	Yes, for major planning and decision steps	No, reasoning is compressed into latent state
Does the architecture include recurrence or persistent black-box memory?	No, or memory is interpretable and versioned	Yes, memory stores opaque activations or learned state
Can the vendor estimate opaque serial depth or provide architectural details?	Provides calculation or enough architecture to compute one	Treats architecture as magic sauce, served with confidence
Are intermediate states optimized mainly for human-readable semantics?	Pretraining, supervised tuning, or human-evaluated traces dominate	Strong task reward optimizes traces as private control codes

The direct business value is not that opaque serial depth “certifies” a model. It does not. The value is cheaper diagnosis.

A risk team can use the concept to classify model architectures before deployment. A product team can decide whether a latent reasoning feature is acceptable for a customer-support bot but unacceptable for compliance review. A model governance function can require stronger logging when recurrence or persistent memory is introduced. An enterprise AI architect can treat hidden memory as a material design change rather than a harmless product enhancement.

The practical rule is:

The more the system can reason through non-interpretable intermediate states, the less confidence we should place in visible chain-of-thought monitoring alone.

That is an inference from the paper, not a theorem the paper directly proves for every deployed system. But it is a sensible operational reading.

What the paper shows, what we infer, and what remains uncertain

The paper is careful about its boundaries, and the business interpretation should be equally disciplined.

Layer	Statement
What the paper directly shows	Opaque serial depth can be formalized using circuit depth between interpretable nodes; upper bounds can be calculated for real architectures; architecture choices substantially alter these bounds.
What the Gemma 3 calculations show	Gemma 3 models have computable upper bounds at maximum sequence length, ranging from 4,490 for 1B to 11,662 for 27B in the by-hand analysis.
What the JAX calculator shows	Automated calculation gives fast, implementation-based upper bounds that overestimate by roughly 27–29% on Gemma 3 compared with by-hand calculations.
What the MoE example suggests	A vanilla MoE Transformer may have lower serial depth than a dense model of comparable active scale, but this is exploratory evidence.
What Cognaptus infers for businesses	Opaque serial depth can be used as a technical risk metric for architecture review, especially in agentic, memory-using, or high-stakes AI systems.
What remains uncertain	The metric is an upper bound, depends on judgment about interpretability, and does not prove that visible reasoning is faithful or sufficient.

This distinction matters because AI governance often fails in two opposite ways. One side treats every new metric as a compliance trophy. The other dismisses any imperfect metric as useless. Both are convenient. Neither is serious.

Opaque serial depth is useful because it gives architecture review a sharper vocabulary. It is not useful as a slogan.

The boundaries: upper bounds, interpretability judgment, and the missing algorithm

The first limitation is mathematical. The calculated depths are upper bounds, not exact depths. The true minimum circuit depth could be lower. The JAX calculator can overestimate because implementation representations are not optimized for shallow equivalent circuits. Manual calculations can tighten bounds, but at the cost of expert effort and possible human error.

The second limitation is interpretability. Opaque serial depth depends heavily on which nodes count as interpretable. Visible natural-language tokens are plausible candidates because models are trained on human text and often shaped by human feedback. But the paper does not pretend that “interpretable” has a clean formal definition.

The appendix discusses two broad ways to assess interpretability. One is question-answering: can a human or proxy evaluator answer relevant questions about the computation from the intermediate node? Monitorability evaluations fit here. The other is human-understandable information content: was the intermediate representation optimized to imitate or satisfy human-readable semantics, with limited pressure from other incentives?

This is where reasoning models become interesting. Chain-of-thought in instruction-tuned models is likely interpretable because it inherits a strong natural-language prior. But in reasoning models, reinforcement learning adds pressure for the trace to be useful for solving tasks. Today, the paper expects pretraining to remain the stronger force. If reinforcement learning is scaled substantially, that assumption may weaken. The trace may remain fluent while becoming less human-semantic. A beautifully written private code is still private code, only with better grammar.

The third limitation is that readable intermediate states do not necessarily expose the algorithm. A chain-of-thought trace may explain both content and reasoning because it resembles written human reasoning. But other systems, such as text diffusion models, may expose readable intermediate states without revealing the process that caused one state to become the next.

The fourth limitation is implementation coverage. The automated calculator supports a subset of JAX operations. New operations require hand-derived depth formulas. That is not a fatal flaw, but it means the tool is a starting point for infrastructure, not a universal scanner.

For business users, the practical boundary is this: opaque serial depth can inform architecture risk, but it should sit alongside behavioral evaluations, adversarial tests, monitoring performance, memory audits, and deployment controls.

The real lesson: visible reasoning is load-bearing infrastructure

The paper’s most important contribution is not the Gemma table, although the numbers are useful. It is not the JAX calculator, although repeatable tooling matters. It is the mechanism:

Model architecture determines how much serial reasoning can happen before the system must pass through an interpretable checkpoint.

That makes chain-of-thought less like a decorative explanation and more like an audit boundary. In standard autoregressive Transformers, visible tokens are not merely output. They are part of the control surface. Change the architecture, and that surface can move. Add recurrence, latent thoughts, or black-box memory, and the model may gain private routes for serial computation.

For companies adopting AI agents, this should shape the next generation of governance questions. The issue is not whether a model can produce a convincing explanation after the fact. Many systems can do that. The issue is whether the system’s actual reasoning had to pass through something inspectable while it was being constructed.

That difference is small in wording and large in consequence.

The future of AI monitoring will not be won by asking models to “be transparent” and hoping they take the corporate values workshop seriously. It will depend on building systems where the easiest way to solve hard tasks is also the easiest way to inspect them.

That is what opaque serial depth helps us measure.

Cognaptus: Automate the Present, Incubate the Future.

Jonah Brown-Cohen, David Lindner, and Rohin Shah, “Quantifying the Necessity of Chain of Thought through Opaque Serial Depth,” arXiv:2603.09786v1, 10 March 2026, https://arxiv.org/abs/2603.09786. ↩︎

Opaque serial depth measures hidden sequential work between visible checkpoints#

Circuit depth is the paper’s ruler, not a decorative math accessory#

Tokens become audit checkpoints in standard autoregressive Transformers#

The Gemma 3 calculations put scale on the mechanism#

Architecture changes the audit boundary#

The automated calculator is implementation evidence, not a second thesis#

The misconception: monitorability is not mainly a prompting problem#

What businesses can do with opaque serial depth#

What the paper shows, what we infer, and what remains uncertain#

The boundaries: upper bounds, interpretability judgment, and the missing algorithm#

The real lesson: visible reasoning is load-bearing infrastructure#