Policy rules are boring until a chatbot applies the wrong one.
A customer asks whether they qualify for a refund. The rule says refunds require purchase within 30 days, unused condition, and no prior replacement claim. The model answers confidently. It even writes a neat step-by-step explanation. Wonderful. The explanation looks like reasoning. It may even be correct.
The uncomfortable question is whether the model actually followed the rule chain, or whether it produced a plausible-looking legal-office cosplay of reasoning. That difference matters. A business process cannot be audited by vibes, however well-formatted the vibes may be.
The paper Revealing Algorithmic Deductive Circuits for Logical Reasoning by Nguyen, Dang, and Inoue is useful because it does not ask the usual surface question: “Did the model get the final answer right?” It asks a more mechanical question: when a language model performs a multi-step deductive reasoning task, which internal attention heads appear to carry the work of reading facts, reading rules, selecting premises, selecting rules, deciding when to stop selecting premises, and coordinating the overall traversal strategy?1
That sounds narrow. It is narrow. That is precisely why it is interesting.
For business users, the paper should not be read as a ready-made diagnostic tool for production systems. It is not a plug-in that says, “Congratulations, your compliance bot used head 17, therefore the answer is safe.” That would be a charmingly efficient way to misunderstand the paper. Its real value is conceptual and operational: it encourages us to treat reasoning as a routed process with separable stages, not as a final answer with a decorative chain-of-thought attached.
The paper turns deductive reasoning into a graph traversal problem
The authors work in a symbolic-aided Chain-of-Thought setting. Instead of asking the model to reason freely over messy natural language, they represent deductive reasoning problems using facts and rules. A reasoning chain is written as a sequence of inference steps: the model keeps a knowledge base of proven facts, selects premises from that knowledge base, selects a rule, derives a new fact, updates the knowledge base, and eventually validates the query.
In business language, this resembles a controlled rule workflow. There are input facts, rule definitions, intermediate states, and a final eligibility or truth decision. The model is not merely asked to “think carefully”; it is asked to follow a structured procedure.
The paper frames the problem as traversal over an inference graph. Facts and derived conclusions are nodes. Rules create possible transitions. A correct reasoning process is not only about reaching the right endpoint; it is also about taking a valid path. In the paper’s examples, the demonstrations can imply a traversal strategy such as breadth-first search or depth-first search. This detail matters because two different reasoning paths can be locally valid, while only one follows the demonstrated algorithmic pattern.
That is where the paper becomes more interesting than another “LLMs still struggle with reasoning” report. The authors are not merely checking whether the model can output “True” or “False.” They ask which token positions actually steer the reasoning path.
They divide the generated reasoning chain into several component types:
| Reasoning component | Plain meaning | Why it matters operationally |
|---|---|---|
| Premise in knowledge base | The model repeats or references already-proven facts | Checks whether the model is using the current state, not inventing a premise |
| Premise selection | The model chooses which proven fact to use next | Controls the next branch of the reasoning process |
| Premise selection termination | The model decides whether enough premises have been selected | Prevents under-specified or over-eager rule application |
| Rule selection | The model chooses which rule to apply | Determines whether the inference is legally/logically valid |
| Fact derivation | The model states the derived conclusion | Updates the reasoning state |
| Syntax/template tokens | Formatting and procedural scaffolding | Keeps the explanation readable but does not guarantee reasoning quality |
The distinction between syntax and steering is important. A model can be very good at maintaining the format of a reasoning trace while failing at the points where the trace actually branches. This is not a small footnote. It is the whole enterprise-software problem wearing a lab coat.
The steering tokens are where the reasoning actually becomes expensive
The preliminary experiment in the paper looks for low-confidence tokens in the generated reasoning chain. The authors define “uncertain tokens” as tokens whose probabilities fall below 0.8, then group those tokens by reasoning component. Across Llama-3.1-8B-Instruct, Qwen3-8B, Phi-4, and Qwen3-4B, the same pattern appears: uncertainty concentrates in three components — premise selection, premise selection termination, and rule selection.
That result is intuitive once the task is unpacked.
A syntax token is cheap. After KB =, the next symbols are mostly format continuation. A fact derivation token may also become relatively constrained after the premise and rule have already been chosen. But premise selection is different. To choose the next premise, the model has to satisfy several constraints at once: the premise must already be proven, it must be relevant to at least one applicable rule, and it may need to follow the traversal pattern demonstrated in the prompt.
Rule selection is similarly loaded. A rule is not selected because it looks familiar. It is selected because its conditions match the currently available premises and because the reasoning process has reached the point where that rule is the appropriate transition.
Premise selection termination is the quiet killer. The model has to decide whether one premise is enough or whether another premise is required before applying a rule. In business settings, this is exactly where many logic failures hide. The model applies the “discount eligibility” rule after finding one condition, forgetting that the rule requires three. It has not misunderstood English. It has stopped too early.
The paper reports that Qwen3-8B and Llama-3.1-8B-Instruct achieve about 65% and 50% inference-step accuracy, respectively, on the synthesized dataset. That is not final-answer accuracy; it is closer to a step-level view of reasoning. The difference matters because final answers can be lucky, especially in binary tasks. Step-level evaluation asks whether the machine walked through the rule system correctly, not whether it stumbled into the right room.
The mechanism is split between reading heads and decision heads
After identifying the difficult steering positions, the authors use causal mediation analysis, especially activation patching and path patching, to identify attention heads that causally affect those positions.
The experimental design uses clean and corrupted prompt pairs. A clean prompt preserves the original facts, rules, demonstrations, and expected reasoning. A corrupted prompt changes a causal element in a controlled way: a fact may be altered, a rule condition may be changed, a rule’s content may be modified, or demonstrations may be switched to imply a different traversal strategy. The point is to isolate what information must be read and what decision must change.
Activation patching then asks a targeted question. If the model sees a corrupted prompt, and we restore the activation of a particular attention head from the clean run, does the model recover the clean behavior at the target token? If yes, that head is treated as causally important for that component.
The paper separates two broad roles:
| Head role | What it appears to do | Example in the paper’s setup |
|---|---|---|
| Reading heads | Retrieve relevant causal information from facts, rules, rule conditions, or demonstrations | Read fact; read rule; read rule condition; read traversal algorithm |
| Decision heads | Use routed information to choose the next reasoning action | Select premise; match rule condition; select rule; implement traversal algorithm |
This split is the paper’s central mechanism. Early-to-middle layers are more associated with retrieving factual and rule-based information. Deeper layers are more associated with integrating that information into reasoning decisions. The paper reports that information-reading heads tend to appear earlier than decision-making heads across evaluated models.
That does not mean the model contains a neat little legal department in layer 12 and a tidy compliance committee in layer 28. Transformer internals are not org charts, mercifully. But the paper does suggest a temporal organization: read relevant information first, integrate it later, then emit the token that steers the next inference step.
The authors also report sparsity. For rule selection, a small number of heads can account for a large share of the causal effect. In Llama-3.1-8B-Instruct, the highest average indirect effect for a rule-selection head exceeds 30%; in the Qwen models, the peak exceeds 12%. This does not prove that one head “does reasoning” by itself. It does suggest that some reasoning decisions are not evenly distributed across the model. A small set of routing components may carry disproportionate responsibility.
For enterprise AI design, this is the first major lesson: reasoning failures are unlikely to be evenly smeared across the whole model. They may concentrate around specific process transitions — choosing a premise, matching a rule, deciding whether the evidence set is complete, or selecting the next rule. The practical response is not mystical interpretability theater. It is stage-level testing.
The circuit network shows reasoning as interaction, not isolated tricks
The paper then uses path patching to study how the discovered heads communicate. Instead of merely listing important heads, the authors examine causal effects between pairs of heads. This produces a circuit network: reading heads pass information to decision heads, and decision heads integrate information for later reasoning actions.
The evidence here is not just “these heads light up.” The intended interpretation is about information flow. In the rule condition matching circuit, for example, rule-condition information is transferred from reading heads to decision heads. For rule selection and premise selection, information is integrated in deeper-layer decision heads. The paper’s figures show this as networks among top-scoring attention heads.
Two points are worth separating.
First, there is specialization. Different heads are associated with different reasoning roles: reading facts, reading rules, matching rule conditions, selecting premises, selecting rules, and implementing traversal behavior.
Second, there is polysemantic reuse. Some attention heads appear to participate in multiple sub-tasks. The paper notes that Llama-3.1-8B-Instruct tends to share heads more among causal reading roles, while Qwen models show more sharing across decision-making roles. This should make readers slightly suspicious of overly clean “one head equals one function” storytelling. The model is organized enough to analyze, but not polite enough to obey our labels perfectly.
This is the right level of interpretability for the article’s business interpretation. We should not say, “The model has a rule-selection module that can now be certified.” We should say, “The internal evidence is consistent with a staged routing process, and business systems should be evaluated as staged routing processes too.”
The ablations are the receipt: remove the heads, and reasoning breaks
Circuit discovery can easily become interpretability jazz: elegant, technical, and hard to verify from the audience’s seat. The ablation experiments are therefore crucial. The authors knock out the identified logical-reasoning heads and compare the result with random head ablation.
The setup is deliberately small in percentage terms. For Phi-4, they use top-$k$ with $k=8$; for Qwen and Llama models, $k=5$. The configurations include random ablation of about 3% of heads, ablation of heads associated with rule selection, premise selection, premise selection termination, and a combined /3Roles condition that ablates all three sets, totaling about 3% of heads.
The tests have different purposes:
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Synthesized dataset | Main evidence and controlled mechanism test | The identified heads matter for the specific symbolic reasoning process used to discover them | Full generalization to messy enterprise text |
| ProntoQA | Out-of-discovery logical reasoning benchmark | The heads are not merely artifacts of one synthetic dataset | Robustness across all logical reasoning formats |
| ProofWriter | Out-of-discovery logical reasoning benchmark with three-answer structure | Similar degradation appears in another rule-based reasoning task | Production reliability in legal, finance, or policy workflows |
| MMLU | General-knowledge comparison | Deductive heads have smaller or mixed effects on broad knowledge tasks, with larger impact when all roles are removed | That MMLU is a precise reasoning diagnostic |
| Random ablation | Control condition | Targeted heads are more important than arbitrary heads | That only these heads matter |
On the synthesized data, ablating the targeted logical-reasoning heads causes much sharper degradation than random ablation. The strongest result is the combined /3Roles condition: removing the three important reasoning-component sets causes reasoning ability to collapse to nearly zero across the evaluated models.
On ProntoQA and ProofWriter, the pattern remains severe. The paper explicitly notes that after ablation, remaining final-answer accuracies such as 44.6% on ProntoQA and 29.2% on ProofWriter for Phi-4 largely reflect random guessing after the reasoning chain has already become incorrect. That sentence deserves attention. It means the model can preserve the outer format of reasoning while the meaningful reasoning process has failed. The suit is still pressed; the person inside has left the building.
MMLU behaves differently. Individual component ablations produce drops comparable to random ablation, while removing all three reasoning roles together causes a larger degradation than random removal of the same overall proportion. The authors interpret this as evidence that deductive reasoning heads are also recruited in broader knowledge retrieval and application, but less cleanly than in explicit logical reasoning tasks.
For practical readers, the MMLU result should be treated as a boundary, not a victory lap. The paper’s strongest evidence concerns explicit deductive reasoning. The broader claim is suggestive: general tasks may use some of the same routing machinery. But the business implication should remain conservative. If your use case is contract rule checking, policy eligibility, claims adjudication, regulatory mapping, or audit workflow automation, this paper is highly relevant as a mechanism guide. If your use case is “general assistant answers everything,” the connection is looser.
The real business lesson is process control, not interpretability decoration
The paper does not directly show how to build a safer enterprise agent. It shows that, under a controlled symbolic-aided reasoning format, LLM deductive reasoning appears to rely on sparse, interacting attention-head circuits associated with separable reasoning functions.
Cognaptus’ practical inference is this: business LLM systems should be designed and evaluated around reasoning stages, not merely final answers.
A rule-following workflow can be decomposed like this:
| Business reasoning stage | Model-side analogue from the paper | Practical control to add |
|---|---|---|
| Extract relevant facts | Read fact | Evidence extraction with source spans and structured state |
| Identify applicable rule conditions | Read rule condition / match rule condition | Rule parser, condition checklist, and deterministic validation where possible |
| Select which proven facts to use | Premise selection | Explicit premise table; forbid unsupported premises |
| Decide whether enough premises are collected | Premise selection termination | Completeness gate before rule application |
| Choose the rule to apply | Rule selection | Rule ID selection with justification and conflict checks |
| Derive and store intermediate conclusions | Fact derivation | State update log with audit trail |
| Produce final answer | Query validation | Final answer generated only after trace validation |
This is not glamorous. That is good. Glamour is a poor substitute for control.
The paper’s mechanism-first contribution supports a boring but powerful architecture: make the agent expose its intermediate state, separate rule matching from answer generation, and evaluate the stages where branching decisions occur. Do not only test whether the final answer sounds correct. Test whether the system selected the right premise, waited until the premise set was complete, selected the right rule, and updated the state correctly.
This also changes evaluation design. A benchmark that only asks for the final decision will miss failures masked by lucky guessing. For binary decisions, random survival can look embarrassingly competent. A stage-level benchmark should score at least five artifacts: extracted facts, applicable rules, selected premises, intermediate derivations, and final decision. The final answer should be the last score, not the entire score.
For operations teams, this offers a practical ROI pathway. The value is not that mechanistic interpretability will let a business debug production models head by head next quarter. The value is cheaper diagnosis at the workflow level. When a compliance assistant fails, the team should know whether it failed because it missed a fact, matched the wrong condition, applied a rule too early, selected the wrong rule, or guessed the final label. That is already a significant improvement over “the model hallucinated,” the universal bucket where serious debugging goes to retire.
Where the paper should not be overread
The paper has clear boundaries, and they matter.
First, the main experiments use synthesized datasets with explicit symbolic structures. This is a strength for mechanism discovery because it allows clean and corrupted prompt pairs. It is also a limitation because real business documents are noisy, incomplete, and often written by committees that appear to have declared war on clarity.
Second, the reasoning format is symbolic-aided CoT. The model is guided into a structured representation of facts, rules, and inference steps. That is appropriate for analyzing deductive circuits, but it does not automatically describe implicit reasoning in free-form chat.
Third, the study focuses on attention heads. The authors explicitly note that other components, such as MLP layers, may also contribute to reasoning. So the correct conclusion is not “attention is the whole story.” The correct conclusion is “attention-head routing carries enough causal signal to reveal a meaningful part of the story.”
Fourth, the circuits are model- and dataset-dependent. The paper tests several open models and finds recurring patterns, but it does not certify that the same heads, layers, or exact structures hold across every architecture, model size, domain, or fine-tuning regime.
Finally, the paper’s ablation results are strongest as causal evidence in the studied setting. They are not a production monitoring method. You cannot ask a closed commercial model to reveal the same head-level circuits during a live refund decision. At least, not unless your vendor has become unusually generous with internal activations, which would be a refreshing plot twist.
These boundaries do not weaken the article’s practical lesson. They sharpen it. The right business takeaway is not to copy the mechanistic method directly. It is to copy the decomposition discipline.
Correct answers are not enough; the route matters
The useful misconception to retire is simple: a correct answer plus a fluent chain-of-thought does not prove that the model performed the intended reasoning process.
Nguyen, Dang, and Inoue’s paper gives that claim a more precise shape. In their controlled setting, deductive reasoning is not one vague capability. It is a routed process involving information reading, condition matching, premise selection, stopping decisions, rule selection, and integration across steps. Sparse groups of attention heads appear to mediate these functions. Remove them, and explicit logical reasoning degrades sharply; remove the combined set, and reasoning can collapse even when the answer format remains alive.
For Cognaptus readers building LLM systems for operations, finance, compliance, procurement, insurance, HR policy, customer support, or internal knowledge workflows, the message is not “trust the model less.” That is too easy and not very useful.
The message is: trust a controlled process more than a polished answer.
A dependable business reasoning system should not merely ask the model to explain itself. It should force the model to expose its working state, separate evidence from rules, validate each branch point, and score the reasoning trace before releasing the final recommendation. The final answer is only the visible endpoint. The route is where reliability is won or lost.
Correct outputs are nice. Controlled reasoning is better. The former impresses demos; the latter survives audits.
Cognaptus: Automate the Present, Incubate the Future.
-
Phuong Minh Nguyen, Tien Huu Dang, and Naoya Inoue, “Revealing Algorithmic Deductive Circuits for Logical Reasoning,” arXiv:2605.27824v1, 27 May 2026, https://arxiv.org/abs/2605.27824. ↩︎