Anchored Thinking: Mapping the Inner Compass of Reasoning LLMs

In the world of large language models (LLMs), answers often emerge from an intricate internal dialogue. But what if we could locate the few sentences within that stream of thoughts that disproportionately steer the outcome—like anchors stabilizing a drifting ship? That’s exactly what Paul Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy aim to do in their new work, “Thought Anchors: Which LLM Reasoning Steps Matter?”. This study presents an ambitious trifecta of methods to trace the true influencers of LLM reasoning.

Why Sentences, Not Tokens?

Autoregressive LLMs generate thoughts token by token, but their reasoning is more meaningfully chunked at the sentence level. Prior interpretability techniques focused either too narrowly (on token saliency) or too broadly (on document-level logic). Sentence-level analysis offers the sweet spot: granular enough to distinguish function (planning, checking, backtracking), yet abstract enough to trace ideas rather than syntax.

The authors build on an 8-class taxonomy of sentence roles—from Plan Generation to Uncertainty Management—to uncover which ones act as thought anchors, the pivotal steps that guide and constrain downstream reasoning.

Three Views Into a Model’s Mind

1. Black-Box Resampling: Counterfactual Influence

The first method is deceptively simple: delete a sentence, resample the rest of the reasoning 100 times, and see how much the final answer shifts—especially when the replacement sentence is semantically different. This allows them to measure counterfactual importance.

Key finding: Sentences involving plan generation or uncertainty management have far higher counterfactual impact than active computations or factual recall. These are not just transitional steps—they’re turning points.

2. White-Box Analysis: Receiver Attention Heads

Here, the authors transform token-level attention into sentence-to-sentence matrices, identifying attention heads that disproportionately focus downstream reasoning on specific past sentences. These are dubbed receiver heads.

These receiver heads act as cognitive floodlights, illuminating key past thoughts across multiple reasoning hops. Planning and re-checking sentences dominate here too.

Further, ablating these heads in the model degrades reasoning accuracy more than random ablations, empirically confirming their functional importance.

3. Causal Suppression: Direct Influence Paths

The third method injects a causal twist: what happens if we block all attention to a sentence? This creates an out-of-distribution intervention but reveals direct impacts. The authors correlate this with the resampling results to validate both.

Result: Many key transitions (e.g., switching from a wrong answer to a correct one) occur immediately after a high-suppression-impact sentence—further confirming the idea of “pivot points” in reasoning.

A Model’s Internal Arc of Realization

The paper’s case study is poetic in its structure: a model initially concludes that a hexadecimal number has 20 bits, but mid-way realizes the mistake. Sentence 13—“Maybe I should convert the number to decimal first…”—is identified by all three methods as a thought anchor, redirecting the flow. Later, sentences about backtracking, rechecking, and leading-zero logic form a web of interdependent realizations.

In visualizations, these anchors appear as dense nodes in the DAG of reasoning, with heavy incoming and outgoing edges. They’re not endpoints—they’re decision forks.

From Interpretability to Safety and Training

This sentence-level lens could become a cornerstone for:

Faithfulness audits: Are the reasons the model gives actually the ones that guide its answer?
Debugging: Identify missing or misleading anchors in incorrect traces.
Training feedback: Reinforce valuable anchors; suppress distractors.

Final Thoughts

This paper offers a compelling middle path between neural circuits and symbolic logic. It doesn’t just ask, What did the model say? but rather, Which sentence changed its mind? And in a world where LLMs increasingly act as autonomous agents, knowing where the mind pivots is key.

Cognaptus: Automate the Present, Incubate the Future.

Why Sentences, Not Tokens?#

Three Views Into a Model’s Mind#

1. Black-Box Resampling: Counterfactual Influence#

2. White-Box Analysis: Receiver Attention Heads#

3. Causal Suppression: Direct Influence Paths#

A Model’s Internal Arc of Realization#

From Interpretability to Safety and Training#

Final Thoughts#