How Sparse is Your Thought? Cracking the Inner Logic of Chain-of-Thought Prompts

TL;DR for operators

Chain-of-thought prompting is often sold as a window into model reasoning. This paper is more useful because it treats CoT as something less mystical and more testable: a prompt-induced change in internal representations.¹

The researchers train sparse autoencoders on hidden activations from two Pythia models solving GSM8K math problems under CoT and NoCoT prompts. They then patch CoT-derived sparse features into NoCoT runs and ask a sharper question: does inserting those internal features increase the log-probability of the correct answer?

For Pythia-2.8B, yes. CoT-derived features improve correct-answer confidence when transferred into NoCoT trajectories. For Pythia-70M, no reliable benefit appears, and some interventions degrade performance. The implication is not “CoT works.” That sentence is doing far too little work. The better interpretation is: CoT can create causally useful internal structure, but only when the model has enough representational capacity to organise and exploit that structure.

The paper also finds that the useful CoT signal is not concentrated only in the most activated features. In the larger model, random sets of CoT-activated features can outperform the Top-K features selected by activation difference. That is operationally interesting because it suggests CoT-related reasoning may be distributed across many moderately active features, not stored neatly in a heroic little reasoning neuron. Naturally, the model declined to organise itself for human convenience.

For product teams, the lesson is straightforward: use CoT as a design variable, not as evidence of truth. A serious deployment path looks like prompt design, internal or behavioural audit, causal or counterfactual validation where feasible, and then controlled rollout. CoT output may be useful, but it is not automatically a faithful transcript of reasoning. It is a mechanism to inspect, not a confession to believe.

The useful question is not whether CoT sounds reasonable

A familiar business pattern: an AI system is asked to make a recommendation, it gives a tidy answer, then it explains itself step by step. The explanation feels reassuring. It has numbered logic. It uses arithmetic. It may even contain the word “therefore,” that tiny bureaucratic stamp of intellectual respectability.

The obvious mistake is to treat the written chain as the model’s actual reasoning process.

The less obvious mistake is to reject CoT entirely because it can be unfaithful. That is too blunt. The more interesting question is whether CoT changes the model’s internal computation in ways that matter causally. In other words: even if the words are not a perfect transcript, does the prompt format push the model into a different, more useful internal state?

That is the question this paper tries to answer. It does not simply compare final accuracy under CoT and NoCoT. It looks inside the model’s activations, decomposes them into sparse features, swaps selected features between prompt conditions, and observes whether the model becomes more confident in the correct answer.

This is why the paper is best read mechanism-first. The headline is not “chain-of-thought improves reasoning.” We have already had enough of that slogan, thank you. The real story is how a prompting format appears to reorganise internal features in a larger model, and why the same procedure does not reliably help a much smaller one.

The mechanism: turn hidden activations into patchable features

The paper combines two interpretability tools: sparse autoencoders and activation patching.

A sparse autoencoder, in this setting, learns to reconstruct a model activation using a sparse set of latent features. The objective is roughly:

$$ L_{\text{total}} = L_{\text{recon}} + \lambda \lVert h \rVert_1 $$

The reconstruction term rewards the autoencoder for preserving the original activation. The sparsity penalty encourages it to represent the activation using relatively few active latent dimensions. The hope is that these sparse dimensions correspond to more interpretable features than raw neurons, which are often polysemantic.

The researchers train separate sparse autoencoders for CoT and NoCoT activations. They use two frozen EleutherAI Pythia models: Pythia-70M and Pythia-2.8B. Both are tested on GSM8K math word problems. The CoT condition uses three fixed few-shot examples with step-by-step solutions; the NoCoT condition uses only the current problem. The activations are extracted from the residual stream at layer 2, at the final token position.

That last detail matters. This is not a full movie of reasoning unfolding token by token. It is a snapshot from a specific layer and token position. Useful, but not omniscient. Interpretability work often begins with a flashlight and is then accused of not being the sun.

Once the sparse features are extracted, the key intervention is patching. For the same math problem under CoT and NoCoT conditions, the researchers replace selected NoCoT sparse feature values with their CoT counterparts, decode the patched sparse vector back into activation space, and continue the model forward. They then measure the change in log-probability assigned to the correct answer:

$$ \Delta \log p = \log p_{\text{patched}}(y^\ast) - \log p_{\text{original}}(y^\ast) $$

If inserting CoT-derived features into a NoCoT run increases the correct answer’s log-probability, then those features are not merely decorative. They carry causal influence over the model’s answer confidence.

What each experiment is really for

The paper contains several tests that can blur together if read as a normal results section. They should not. Each test has a different job.

Test	Likely purpose	What it supports	What it does not prove
Feature explanation scores	Semantic interpretability check	Whether SAE features under CoT are easier to describe consistently	That the natural-language CoT is faithful
CoT-to-NoCoT activation patching	Main causal evidence	Whether CoT-derived features improve answer confidence when inserted into NoCoT runs	That all reasoning steps are causally traced
NoCoT-to-CoT reverse patching	Directionality check	Whether the useful signal is asymmetric across prompt modes	That CoT has a single clean causal pathway
Top-K patch curves	Sensitivity and accumulation test	How performance changes as more high-difference features are patched	That the most activated features are always the most important
Random-K patching	Control and distribution test	Whether useful CoT information is broadly distributed across features	That random features are universally better
Activation sparsity analysis	Structural explanation	Whether CoT creates more selective internal activation patterns	That sparsity alone guarantees correctness

The main evidence is the patching result. The interpretability and sparsity tests explain why that result may occur. The Random-K result is not a cute appendix detail; it changes the interpretation of where the CoT signal lives.

Larger models turn CoT into interpretable features; smaller models mostly shrug

The feature interpretation results are the first clue that scale matters.

For Pythia-70M, CoT and NoCoT explanation scores are nearly indistinguishable. The mean explanation score is 0.018 under CoT and 0.016 under NoCoT, with a t-statistic of 0.082 and a p-value of 0.935. The paper also notes that the NoCoT median looks slightly better in the box plot. In plain English: there is no meaningful evidence that CoT makes sparse features more interpretable in the 70M model.

For Pythia-2.8B, the picture changes. The CoT mean explanation score is 0.056, compared with -0.013 under NoCoT. The reported t-statistic is 2.96 with p = 0.004. The distribution under CoT also includes features with substantially higher explanation scores, reaching around 0.6.

Model	CoT mean	NoCoT mean	Statistical result	Interpretation
Pythia-70M	0.018	0.016	t = 0.082, p = 0.935	No meaningful CoT interpretability advantage
Pythia-2.8B	0.056	-0.013	t = 2.96, p = 0.004	CoT features are more semantically interpretable

This does not mean the larger model’s written chain-of-thought is a reliable explanation. It means that, under this setup, CoT prompts are associated with sparse internal features that are easier to label semantically. That is weaker than “faithful reasoning,” but much stronger than “the explanation sounds nice.”

For operators, the distinction matters. A visible rationale can be polished nonsense. A causally useful internal feature is a different object entirely.

The causal test: patch CoT features into NoCoT runs

The most important experiment is activation patching.

The researchers patch selected CoT-derived features into NoCoT forward passes and measure whether the correct answer becomes more likely. They also test the reverse direction, replacing CoT features with NoCoT features.

In Pythia-2.8B, CoT-to-NoCoT patching consistently improves the log-probability of the correct answer. The distributions shift positive under both dictionary ratios tested. The patch curves show that only a small number of CoT features can produce a large gain. Under dictionary ratio 4, the CoT-to-NoCoT curve jumps above +2.5 log-prob at K = 1 and then settles around +1.8. Under dictionary ratio 8, the gain exceeds +3.2 at K = 2 and stabilises around +2.4.

That is the paper’s strongest evidence that CoT induces causally useful internal features in the larger model.

Pythia-70M behaves differently. CoT-to-NoCoT patching is unstable and often harmful. Under dictionary ratio 4, the curve declines after K = 1 and eventually reaches around -8 log-prob. Under ratio 8, the decline is less severe, around -3 at its lower point, but the direction is still not beneficial. The distribution of effects is high-variance, with both large gains and losses. The small model appears unable to encode CoT features in a form that transfers reliably into NoCoT trajectories.

The mechanism therefore has a scale threshold. Not a magical one, and certainly not a universal one, but a threshold in this experiment: 2.8B shows stable causal benefit; 70M does not.

The surprise: random CoT features can beat the obvious Top-K choice

A normal feature-selection instinct says: find the features with the largest activation difference between CoT and NoCoT, patch those, and expect the strongest effect. The paper calls this Top-K patching.

Then Random-K misbehaves.

In Pythia-2.8B, randomly sampled CoT-activated features often outperform the Top-K features. The paper reports one example where correct-answer confidence improves from 1.2 to 4.3. That finding matters because it changes the mental model of CoT.

If Top-K dominated, the story would be simple: CoT activates a small set of high-impact reasoning features, and the analyst’s job is to find them. But Random-K doing better suggests a more distributed structure. Useful CoT information is spread across many moderately activated features. The most visibly different features are not necessarily the most causally complete set.

This is a useful irritant for AI governance. Many audit workflows implicitly assume that importance is concentrated: find the most salient token, the largest activation, the highest attribution, the dominant component. The paper suggests that, at least here, CoT’s causal signal is not neatly stored in the brightest lights. It is distributed across the wiring.

That has two implications.

First, internal audits should not rely too heavily on a single ranking method. Top activation difference may miss supporting features that matter in combination. Second, prompt-induced reasoning may be more compositional than dashboard metrics imply. A model can distribute useful signal broadly enough that random subsets capture more of the reasoning state than a brittle “top features only” strategy.

Annoying, yes. Also plausible. Neural networks have never shown much interest in making org charts.

Sparsity explains why the random result is not just noise

The sparsity analysis is best understood as structural explanation, not as a separate thesis.

The researchers compare residual activation sparsity under CoT and NoCoT prompts. CoT leads to significantly sparser residual activations in both models: most neurons stay near zero, with a smaller subset strongly activated. The effect is more pronounced in Pythia-2.8B.

Then they examine activated neuron counts per SAE feature. Under NoCoT, features tend to involve broader neuron activation. Under CoT, each feature tends to activate fewer neurons. In the larger model, the pattern becomes more extreme: many CoT features are supported by only a handful of neurons, and there is higher variation across features.

That combination matters. CoT does not merely suppress everything. In the larger model, it appears to create structured sparsity: some features are highly concentrated, while others recruit broader support. The paper interprets this as more strategic representational allocation. One can quibble with the word “strategic,” since the model is not sitting there with a resource allocation spreadsheet, but the functional pattern is clear enough: CoT changes how activation mass is distributed.

This helps explain Random-K. If useful CoT signal is spread across many moderately active features, and the model uses a sparse but distributed internal representation, then selecting only the largest activation differences can overfit to local peaks. Random sampling may capture a broader causal bundle.

In the 70M model, CoT also increases sparsity to some extent, but the useful structure does not emerge strongly enough. The small model’s CoT and NoCoT feature distributions remain more similar, and the patched features are noisy or mismatched. Sparse does not automatically mean useful. A clean-looking drawer can still contain junk.

What the paper directly shows, and what operators can infer

The business value of this paper is not that companies should force every model to “think step by step.” That advice aged badly before it was fully printed. The real value is a better diagnostic frame.

Level	Statement	Status
Direct paper result	CoT-derived SAE features improve NoCoT correct-answer log-probabilities in Pythia-2.8B on GSM8K.	Directly shown
Direct paper result	The same intervention does not reliably help Pythia-70M and can degrade performance.	Directly shown
Direct paper result	CoT features in Pythia-2.8B have higher explanation scores than NoCoT features.	Directly shown
Direct paper result	CoT induces sparser residual and feature-level activation patterns, especially in Pythia-2.8B.	Directly shown
Cognaptus inference	CoT should be treated as a structured prompting control, not as a faithful explanation by default.	Operational inference
Cognaptus inference	Prompting strategy should be validated with behavioural, causal, or counterfactual tests before being used in high-stakes workflows.	Operational inference
Still uncertain	Whether the same mechanisms hold in frontier proprietary models, other tasks, other layers, or token-level reasoning traces.	Open boundary

The practical workflow looks like this:

Prompt design: Use CoT, scratchpads, or structured reasoning prompts as experimental treatments, not as proof of cognition.
Behavioural evaluation: Check whether they improve task outcomes under realistic data conditions.
Internal or counterfactual audit: Where feasible, test whether the prompt changes internal representations or causal behaviours, not just the visible rationale.
Robustness checks: Vary prompt templates, model sizes, feature-selection methods, and task distributions.
Deployment control: Treat rationales as user-interface artefacts unless validated as causally connected to outputs.

This is especially relevant for regulated or semi-regulated workflows: credit review, medical triage support, contract analysis, compliance screening, and investment research. In these settings, a neat explanation can become a liability if it persuades reviewers without reflecting the actual decision path.

The paper does not give operators a plug-and-play audit kit. It gives them a warning label and a research direction: prompt-induced reasoning needs to be tested as an internal mechanism, not admired as prose.

The uncomfortable part: CoT may help without being a faithful transcript

The likely misconception is simple: if the model writes a chain of thought, the chain is what the model used to think.

This paper argues for a subtler replacement belief: CoT can reshape internal computation in ways that make reasoning more structured and causally useful, but the written explanation is still not guaranteed to be a faithful transcript.

That distinction is uncomfortable because it removes two easy positions.

The first easy position is naive trust: “The model explained itself, so we understand it.” No. You understand a generated explanation, which may or may not correspond to the decision process.

The second easy position is cynical dismissal: “CoT is just theatre.” Also no. In Pythia-2.8B, CoT-induced features have measurable causal effects when patched into NoCoT runs. Theatre does not usually increase correct-answer log-probability after internal feature transfer.

The better position is operationally less elegant but more accurate: CoT is a prompt-induced computational regime. Sometimes it organises internal representations. Sometimes it does not. The difference depends on model capacity, task structure, and the measurement lens.

That is less tweetable. A pity. It is also more useful.

Boundary conditions that materially affect interpretation

The limitations here are not decorative. They directly shape how far the result can travel.

First, the intervention targets residual activations at layer 2 and the final token position. That gives a snapshot, not a causal trace through every reasoning step. If a business team cares about whether a model followed a particular multi-step procedure, this paper does not prove that token-level chain.

Second, the experiments use Pythia-70M and Pythia-2.8B, not LLaMA-scale, GPT-scale, Claude-scale, or domain-fine-tuned enterprise models. The result establishes a scale-sensitive pattern inside this setup. It does not license universal claims about all large models.

Third, the dataset is GSM8K, and the analysis uses the training split for activation collection and evaluation. Math word problems are useful because they have clear answers, but enterprise reasoning tasks often involve ambiguity, retrieval, tool calls, policy constraints, and messy documents. Welcome to reality, where datasets go to lose their manners.

Fourth, the semantic interpretation scores rely on LLM-generated explanations of SAE features. That is useful but indirect. The paper’s stronger evidence is patching, not the mere fact that features can be described.

Fifth, SAE-based analysis itself has biases. Sparse autoencoders can miss distributed or entangled representations, and not every interpretable feature is causally important. The authors acknowledge this. Operators should too.

These boundaries do not make the paper weak. They make it readable. The danger would be converting a careful mechanistic result into a general claim that “CoT makes models faithful.” That would be exactly the sort of reasoning failure the paper is trying to help us avoid.

The business value is diagnosis, not decoration

For AI product teams, the immediate takeaway is not “show more reasoning to users.” In many contexts, showing hidden reasoning may be unnecessary, confusing, or even undesirable. The deeper value is diagnostic.

CoT can be used as a controlled intervention: change the prompt format, measure behaviour, inspect representations where possible, and test whether the change is causally connected to better outputs. That is a much more mature posture than treating prompt engineering as folklore with version control.

A useful internal evaluation might ask:

Does CoT improve performance on the task distribution that actually matters?
Does it improve calibration, or merely verbosity?
Does the benefit survive prompt variation?
Does it depend strongly on model size?
Do counterfactual or patching-style tests suggest that the reasoning scaffold changes causal behaviour?
Are explanations being used as user persuasion, model debugging, or compliance evidence? These are not the same job.

The paper’s scale result is especially important for cost-sensitive deployments. Smaller models may not benefit from CoT in the same way larger models do. Worse, forcing CoT-like structures into a small model’s internal trajectory may introduce representational conflict. If a team is using a compact model for latency or cost reasons, it should not assume that CoT prompting imports the same reasoning machinery observed in larger systems.

The operational lesson is not anti-CoT. It is anti-laziness.

Conclusion: sparse thought is still not transparent thought

The paper gives us a more disciplined way to talk about chain-of-thought prompting. CoT is not automatically a faithful explanation, but neither is it merely decorative text. In the larger Pythia model studied here, CoT induces sparse, more interpretable, and causally useful internal features. In the smaller model, those benefits largely fail to materialise.

That is the useful shape of the finding: mechanism, scale, causality, boundary.

For business readers, the lesson is to stop treating rationales as evidence by default. A written chain may improve user trust, but trust is not the same as faithfulness. CoT should be evaluated as a structured control over model computation. Sometimes it helps organise the model’s internal features. Sometimes it is just a verbose passenger.

The good news is that papers like this make the question testable. The bad news is that testing is work. As usual, reality has declined the invitation to become a dashboard metric.

Cognaptus: Automate the Present, Incubate the Future.

Xi Chen, Aske Plaat, and Niki van Stein, “How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding,” arXiv:2507.22928, 2025. https://arxiv.org/abs/2507.22928 ↩︎

TL;DR for operators#

The useful question is not whether CoT sounds reasonable#

The mechanism: turn hidden activations into patchable features#

What each experiment is really for#

Larger models turn CoT into interpretable features; smaller models mostly shrug#

The causal test: patch CoT features into NoCoT runs#

The surprise: random CoT features can beat the obvious Top-K choice#

Sparsity explains why the random result is not just noise#

What the paper directly shows, and what operators can infer#

The uncomfortable part: CoT may help without being a faithful transcript#

Boundary conditions that materially affect interpretation#

The business value is diagnosis, not decoration#

Conclusion: sparse thought is still not transparent thought#