TL;DR for operators
Chain-of-thought prompting is often sold as a window into model reasoning. This paper is more useful because it treats CoT as something less mystical and more testable: a prompt-induced change in internal representations.1
The researchers train sparse autoencoders on hidden activations from two Pythia models solving GSM8K math problems under CoT and NoCoT prompts. They then patch CoT-derived sparse features into NoCoT runs and ask a sharper question: does inserting those internal features increase the log-probability of the correct answer?
For Pythia-2.8B, yes. CoT-derived features improve correct-answer confidence when transferred into NoCoT trajectories. For Pythia-70M, no reliable benefit appears, and some interventions degrade performance. The implication is not “CoT works.” That sentence is doing far too little work. The better interpretation is: CoT can create causally useful internal structure, but only when the model has enough representational capacity to organise and exploit that structure.
The paper also finds that the useful CoT signal is not concentrated only in the most activated features. In the larger model, random sets of CoT-activated features can outperform the Top-K features selected by activation difference. That is operationally interesting because it suggests CoT-related reasoning may be distributed across many moderately active features, not stored neatly in a heroic little reasoning neuron. Naturally, the model declined to organise itself for human convenience.
For product teams, the lesson is straightforward: use CoT as a design variable, not as evidence of truth. A serious deployment path looks like prompt design, internal or behavioural audit, causal or counterfactual validation where feasible, and then controlled rollout. CoT output may be useful, but it is not automatically a faithful transcript of reasoning. It is a mechanism to inspect, not a confession to believe.
The useful question is not whether CoT sounds reasonable
A familiar business pattern: an AI system is asked to make a recommendation, it gives a tidy answer, then it explains itself step by step. The explanation feels reassuring. It has numbered logic. It uses arithmetic. It may even contain the word “therefore,” that tiny bureaucratic stamp of intellectual respectability.
The obvious mistake is to treat the written chain as the model’s actual reasoning process.
The less obvious mistake is to reject CoT entirely because it can be unfaithful. That is too blunt. The more interesting question is whether CoT changes the model’s internal computation in ways that matter causally. In other words: even if the words are not a perfect transcript, does the prompt format push the model into a different, more useful internal state?
That is the question this paper tries to answer. It does not simply compare final accuracy under CoT and NoCoT. It looks inside the model’s activations, decomposes them into sparse features, swaps selected features between prompt conditions, and observes whether the model becomes more confident in the correct answer.
This is why the paper is best read mechanism-first. The headline is not “chain-of-thought improves reasoning.” We have already had enough of that slogan, thank you. The real story is how a prompting format appears to reorganise internal features in a larger model, and why the same procedure does not reliably help a much smaller one.
The mechanism: turn hidden activations into patchable features
The paper combines two interpretability tools: sparse autoencoders and activation patching.
A sparse autoencoder, in this setting, learns to reconstruct a model activation using a sparse set of latent features. The objective is roughly:
The reconstruction term rewards the autoencoder for preserving the original activation. The sparsity penalty encourages it to represent the activation using relatively few active latent dimensions. The hope is that these sparse dimensions correspond to more interpretable features than raw neurons, which are often polysemantic.
The researchers train separate sparse autoencoders for CoT and NoCoT activations. They use two frozen EleutherAI Pythia models: Pythia-70M and Pythia-2.8B. Both are tested on GSM8K math word problems. The CoT condition uses three fixed few-shot examples with step-by-step solutions; the NoCoT condition uses only the current problem. The activations are extracted from the residual stream at layer 2, at the final token position.
That last detail matters. This is not a full movie of reasoning unfolding token by token. It is a snapshot from a specific layer and token position. Useful, but not omniscient. Interpretability work often begins with a flashlight and is then accused of not being the sun.
Once the sparse features are extracted, the key intervention is patching. For the same math problem under CoT and NoCoT conditions, the researchers replace selected NoCoT sparse feature values with their CoT counterparts, decode the patched sparse vector back into activation space, and continue the model forward. They then measure the change in log-probability assigned to the correct answer:
If inserting CoT-derived features into a NoCoT run increases the correct answer’s log-probability, then those features are not merely decorative. They carry causal influence over the model’s answer confidence.
What each experiment is really for
The paper contains several tests that can blur together if read as a normal results section. They should not. Each test has a different job.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Feature explanation scores | Semantic interpretability check | Whether SAE features under CoT are easier to describe consistently | That the natural-language CoT is faithful |
| CoT-to-NoCoT activation patching | Main causal evidence | Whether CoT-derived features improve answer confidence when inserted into NoCoT runs | That all reasoning steps are causally traced |
| NoCoT-to-CoT reverse patching | Directionality check | Whether the useful signal is asymmetric across prompt modes | That CoT has a single clean causal pathway |
| Top-K patch curves | Sensitivity and accumulation test | How performance changes as more high-difference features are patched | That the most activated features are always the most important |
| Random-K patching | Control and distribution test | Whether useful CoT information is broadly distributed across features | That random features are universally better |
| Activation sparsity analysis | Structural explanation | Whether CoT creates more selective internal activation patterns | That sparsity alone guarantees correctness |
The main evidence is the patching result. The interpretability and sparsity tests explain why that result may occur. The Random-K result is not a cute appendix detail; it changes the interpretation of where the CoT signal lives.
Larger models turn CoT into interpretable features; smaller models mostly shrug
The feature interpretation results are the first clue that scale matters.
For Pythia-70M, CoT and NoCoT explanation scores are nearly indistinguishable. The mean explanation score is 0.018 under CoT and 0.016 under NoCoT, with a t-statistic of 0.082 and a p-value of 0.935. The paper also notes that the NoCoT median looks slightly better in the box plot. In plain English: there is no meaningful evidence that CoT makes sparse features more interpretable in the 70M model.
For Pythia-2.8B, the picture changes. The CoT mean explanation score is 0.056, compared with -0.013 under NoCoT. The reported t-statistic is 2.96 with p = 0.004. The distribution under CoT also includes features with substantially higher explanation scores, reaching around 0.6.
| Model | CoT mean | NoCoT mean | Statistical result | Interpretation |
|---|---|---|---|---|
| Pythia-70M | 0.018 | 0.016 | t = 0.082, p = 0.935 | No meaningful CoT interpretability advantage |
| Pythia-2.8B | 0.056 | -0.013 | t = 2.96, p = 0.004 | CoT features are more semantically interpretable |
This does not mean the larger model’s written chain-of-thought is a reliable explanation. It means that, under this setup, CoT prompts are associated with sparse internal features that are easier to label semantically. That is weaker than “faithful reasoning,” but much stronger than “the explanation sounds nice.”
For operators, the distinction matters. A visible rationale can be polished nonsense. A causally useful internal feature is a different object entirely.
The causal test: patch CoT features into NoCoT runs
The most important experiment is activation patching.
The researchers patch selected CoT-derived features into NoCoT forward passes and measure whether the correct answer becomes more likely. They also test the reverse direction, replacing CoT features with NoCoT features.
In Pythia-2.8B, CoT-to-NoCoT patching consistently improves the log-probability of the correct answer. The distributions shift positive under both dictionary ratios tested. The patch curves show that only a small number of CoT features can produce a large gain. Under dictionary ratio 4, the CoT-to-NoCoT curve jumps above +2.5 log-prob at K = 1 and then settles around +1.8. Under dictionary ratio 8, the gain exceeds +3.2 at K = 2 and stabilises around +2.4.
That is the paper’s strongest evidence that CoT induces causally useful internal features in the larger model.
Pythia-70M behaves differently. CoT-to-NoCoT patching is unstable and often harmful. Under dictionary ratio 4, the curve declines after K = 1 and eventually reaches around -8 log-prob. Under ratio 8, the decline is less severe, around -3 at its lower point, but the direction is still not beneficial. The distribution of effects is high-variance, with both large gains and losses. The small model appears unable to encode CoT features in a form that transfers reliably into NoCoT trajectories.
The mechanism therefore has a scale threshold. Not a magical one, and certainly not a universal one, but a threshold in this experiment: 2.8B shows stable causal benefit; 70M does not.
The surprise: random CoT features can beat the obvious Top-K choice
A normal feature-selection instinct says: find the features with the largest activation difference between CoT and NoCoT, patch those, and expect the strongest effect. The paper calls this Top-K patching.
Then Random-K misbehaves.
In Pythia-2.8B, randomly sampled CoT-activated features often outperform the Top-K features. The paper reports one example where correct-answer confidence improves from 1.2 to 4.3. That finding matters because it changes the mental model of CoT.
If Top-K dominated, the story would be simple: CoT activates a small set of high-impact reasoning features, and the analyst’s job is to find them. But Random-K doing better suggests a more distributed structure. Useful CoT information is spread across many moderately activated features. The most visibly different features are not necessarily the most causally complete set.
This is a useful irritant for AI governance. Many audit workflows implicitly assume that importance is concentrated: find the most salient token, the largest activation, the highest attribution, the dominant component. The paper suggests that, at least here, CoT’s causal signal is not neatly stored in the brightest lights. It is distributed across the wiring.
That has two implications.
First, internal audits should not rely too heavily on a single ranking method. Top activation difference may miss supporting features that matter in combination. Second, prompt-induced reasoning may be more compositional than dashboard metrics imply. A model can distribute useful signal broadly enough that random subsets capture more of the reasoning state than a brittle “top features only” strategy.
Annoying, yes. Also plausible. Neural networks have never shown much interest in making org charts.
Sparsity explains why the random result is not just noise
The sparsity analysis is best understood as structural explanation, not as a separate thesis.
The researchers compare residual activation sparsity under CoT and NoCoT prompts. CoT leads to significantly sparser residual activations in both models: most neurons stay near zero, with a smaller subset strongly activated. The effect is more pronounced in Pythia-2.8B.
Then they examine activated neuron counts per SAE feature. Under NoCoT, features tend to involve broader neuron activation. Under CoT, each feature tends to activate fewer neurons. In the larger model, the pattern becomes more extreme: many CoT features are supported by only a handful of neurons, and there is higher variation across features.
That combination matters. CoT does not merely suppress everything. In the larger model, it appears to create structured sparsity: some features are highly concentrated, while others recruit broader support. The paper interprets this as more strategic representational allocation. One can quibble with the word “strategic,” since the model is not sitting there with a resource allocation spreadsheet, but the functional pattern is clear enough: CoT changes how activation mass is distributed.
This helps explain Random-K. If useful CoT signal is spread across many moderately active features, and the model uses a sparse but distributed internal representation, then selecting only the largest activation differences can overfit to local peaks. Random sampling may capture a broader causal bundle.
In the 70M model, CoT also increases sparsity to some extent, but the useful structure does not emerge strongly enough. The small model’s CoT and NoCoT feature distributions remain more similar, and the patched features are noisy or mismatched. Sparse does not automatically mean useful. A clean-looking drawer can still contain junk.
What the paper directly shows, and what operators can infer
The business value of this paper is not that companies should force every model to “think step by step.” That advice aged badly before it was fully printed. The real value is a better diagnostic frame.
| Level | Statement | Status |
|---|---|---|
| Direct paper result | CoT-derived SAE features improve NoCoT correct-answer log-probabilities in Pythia-2.8B on GSM8K. | Directly shown |
| Direct paper result | The same intervention does not reliably help Pythia-70M and can degrade performance. | Directly shown |
| Direct paper result | CoT features in Pythia-2.8B have higher explanation scores than NoCoT features. | Directly shown |
| Direct paper result | CoT induces sparser residual and feature-level activation patterns, especially in Pythia-2.8B. | Directly shown |
| Cognaptus inference | CoT should be treated as a structured prompting control, not as a faithful explanation by default. | Operational inference |
| Cognaptus inference | Prompting strategy should be validated with behavioural, causal, or counterfactual tests before being used in high-stakes workflows. | Operational inference |
| Still uncertain | Whether the same mechanisms hold in frontier proprietary models, other tasks, other layers, or token-level reasoning traces. | Open boundary |
The practical workflow looks like this:
- Prompt design: Use CoT, scratchpads, or structured reasoning prompts as experimental treatments, not as proof of cognition.
- Behavioural evaluation: Check whether they improve task outcomes under realistic data conditions.
- Internal or counterfactual audit: Where feasible, test whether the prompt changes internal representations or causal behaviours, not just the visible rationale.
- Robustness checks: Vary prompt templates, model sizes, feature-selection methods, and task distributions.
- Deployment control: Treat rationales as user-interface artefacts unless validated as causally connected to outputs.
This is especially relevant for regulated or semi-regulated workflows: credit review, medical triage support, contract analysis, compliance screening, and investment research. In these settings, a neat explanation can become a liability if it persuades reviewers without reflecting the actual decision path.
The paper does not give operators a plug-and-play audit kit. It gives them a warning label and a research direction: prompt-induced reasoning needs to be tested as an internal mechanism, not admired as prose.
The uncomfortable part: CoT may help without being a faithful transcript
The likely misconception is simple: if the model writes a chain of thought, the chain is what the model used to think.
This paper argues for a subtler replacement belief: CoT can reshape internal computation in ways that make reasoning more structured and causally useful, but the written explanation is still not guaranteed to be a faithful transcript.
That distinction is uncomfortable because it removes two easy positions.
The first easy position is naive trust: “The model explained itself, so we understand it.” No. You understand a generated explanation, which may or may not correspond to the decision process.
The second easy position is cynical dismissal: “CoT is just theatre.” Also no. In Pythia-2.8B, CoT-induced features have measurable causal effects when patched into NoCoT runs. Theatre does not usually increase correct-answer log-probability after internal feature transfer.
The better position is operationally less elegant but more accurate: CoT is a prompt-induced computational regime. Sometimes it organises internal representations. Sometimes it does not. The difference depends on model capacity, task structure, and the measurement lens.
That is less tweetable. A pity. It is also more useful.
Boundary conditions that materially affect interpretation
The limitations here are not decorative. They directly shape how far the result can travel.
First, the intervention targets residual activations at layer 2 and the final token position. That gives a snapshot, not a causal trace through every reasoning step. If a business team cares about whether a model followed a particular multi-step procedure, this paper does not prove that token-level chain.
Second, the experiments use Pythia-70M and Pythia-2.8B, not LLaMA-scale, GPT-scale, Claude-scale, or domain-fine-tuned enterprise models. The result establishes a scale-sensitive pattern inside this setup. It does not license universal claims about all large models.
Third, the dataset is GSM8K, and the analysis uses the training split for activation collection and evaluation. Math word problems are useful because they have clear answers, but enterprise reasoning tasks often involve ambiguity, retrieval, tool calls, policy constraints, and messy documents. Welcome to reality, where datasets go to lose their manners.
Fourth, the semantic interpretation scores rely on LLM-generated explanations of SAE features. That is useful but indirect. The paper’s stronger evidence is patching, not the mere fact that features can be described.
Fifth, SAE-based analysis itself has biases. Sparse autoencoders can miss distributed or entangled representations, and not every interpretable feature is causally important. The authors acknowledge this. Operators should too.
These boundaries do not make the paper weak. They make it readable. The danger would be converting a careful mechanistic result into a general claim that “CoT makes models faithful.” That would be exactly the sort of reasoning failure the paper is trying to help us avoid.
The business value is diagnosis, not decoration
For AI product teams, the immediate takeaway is not “show more reasoning to users.” In many contexts, showing hidden reasoning may be unnecessary, confusing, or even undesirable. The deeper value is diagnostic.
CoT can be used as a controlled intervention: change the prompt format, measure behaviour, inspect representations where possible, and test whether the change is causally connected to better outputs. That is a much more mature posture than treating prompt engineering as folklore with version control.
A useful internal evaluation might ask:
- Does CoT improve performance on the task distribution that actually matters?
- Does it improve calibration, or merely verbosity?
- Does the benefit survive prompt variation?
- Does it depend strongly on model size?
- Do counterfactual or patching-style tests suggest that the reasoning scaffold changes causal behaviour?
- Are explanations being used as user persuasion, model debugging, or compliance evidence? These are not the same job.
The paper’s scale result is especially important for cost-sensitive deployments. Smaller models may not benefit from CoT in the same way larger models do. Worse, forcing CoT-like structures into a small model’s internal trajectory may introduce representational conflict. If a team is using a compact model for latency or cost reasons, it should not assume that CoT prompting imports the same reasoning machinery observed in larger systems.
The operational lesson is not anti-CoT. It is anti-laziness.
Conclusion: sparse thought is still not transparent thought
The paper gives us a more disciplined way to talk about chain-of-thought prompting. CoT is not automatically a faithful explanation, but neither is it merely decorative text. In the larger Pythia model studied here, CoT induces sparse, more interpretable, and causally useful internal features. In the smaller model, those benefits largely fail to materialise.
That is the useful shape of the finding: mechanism, scale, causality, boundary.
For business readers, the lesson is to stop treating rationales as evidence by default. A written chain may improve user trust, but trust is not the same as faithfulness. CoT should be evaluated as a structured control over model computation. Sometimes it helps organise the model’s internal features. Sometimes it is just a verbose passenger.
The good news is that papers like this make the question testable. The bad news is that testing is work. As usual, reality has declined the invitation to become a dashboard metric.
Cognaptus: Automate the Present, Incubate the Future.
-
Xi Chen, Aske Plaat, and Niki van Stein, “How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding,” arXiv:2507.22928, 2025. https://arxiv.org/abs/2507.22928 ↩︎