When Tokens Remember: Graphing the Ghosts in LLM Reasoning

Audit is easy when the answer is a single lookup.

A customer asks, “What is your refund policy?” The model quotes the policy paragraph. We check whether the quoted paragraph came from the right source. Very civilized. Everyone goes home early.

But real enterprise LLM work is rarely that tidy. A compliance assistant reads a contract, extracts obligations, compares them with internal policy, reasons through exceptions, and writes a recommendation. A research assistant reads multiple sources, builds an intermediate summary, then answers a question from that summary. A support agent reads a user history, infers the likely issue, then proposes the next action. In these cases, the final sentence may depend on prompt evidence and on earlier generated text.

That is where ordinary context attribution starts to look suspiciously flat. It can tell us which prompt tokens seem directly associated with the answer. What it often fails to show is the route: how the model’s earlier generated steps carried evidence forward into the final output.

The paper Explaining the Reasoning of Large Language Models Using Attribution Graphs introduces CAGE, or Context Attribution via Graph Explanations, to address exactly this gap.¹ Its central claim is not that LLM reasoning has suddenly become transparent. Please, let us remain adults. The claim is narrower and more useful: if autoregressive models generate text step by step, then attribution should respect that stepwise structure instead of pretending every output token has a clean, direct line back to the prompt.

The paper’s comparison is therefore the right way to read it: row attribution audits isolated rows; CAGE audits paths.

Row attribution grades the destination; CAGE audits the route

Current context-attribution methods begin from a sensible idea. For each generated token or sentence, apply a base attribution method and ask: which earlier inputs influenced this generation?

The problem appears when those local attribution rows are converted into a prompt-level explanation. Existing “row attribution” approaches usually keep only direct prompt influence and then sum those rows. In practical terms, they ask:

Which prompt sentences directly influenced the selected output sentences?

That sounds reasonable until the output is the end of a reasoning chain. Suppose a model solves a math problem in several generated steps. The final answer may directly depend on the immediately previous calculation. That previous calculation depends on an earlier generated step. That earlier step depends on the original prompt. If attribution only looks for direct prompt-to-answer influence, it can miss the prompt facts that shaped the answer through intermediate reasoning.

CAGE changes the unit of explanation. Instead of flattening each generation into a prompt-only row, it builds an attribution graph across both prompt tokens and generated tokens. Edges point forward in generation order: earlier prompt or generated tokens can influence later generated tokens. The graph is then used to propagate influence from the prompt through intermediate generated text to the output being explained.

The distinction is small enough to sound technical, but large enough to change the audit meaning.

Question	Row attribution	CAGE
What is being explained?	Selected generated output	Selected generated output
What evidence is considered?	Direct prompt-to-output attribution	Prompt and inter-generation influence
What structure is preserved?	Mostly a prompt-output table	A directed acyclic influence graph
What gets lost?	Intermediate generated steps	Low-weight edges may be pruned, but causal order is retained
Business translation	“Which source paragraph touched the answer?”	“Which source evidence traveled through the reasoning path?”

This is why the paper matters for applied AI. Many enterprise failures do not look like an obviously wrong answer. They look like a plausible answer with a broken dependency chain. The model cited the right document, used the wrong clause, skipped a condition, and still produced something polished enough to pass a casual glance. Lovely. Dangerous, but lovely.

The technical move: turn token influence into a graph before summarizing it

CAGE starts with a base attribution method. The paper tests several: perturbation, Context Length Probing, ReAGent, Integrated Gradients, and attention multiplied by Integrated Gradients. This is important because CAGE is not presented as a replacement for every attribution method. It is a framework that reorganizes the attribution outputs into a structure that better matches autoregressive generation.

The method builds an attribution table where each generated token or sentence has attribution scores over prior prompt and generated content. Existing row approaches mostly use the prompt columns and ignore the lower-diagonal region that represents inter-generation influence. CAGE keeps that region.

Then it constructs an adjacency matrix $A$ for a directed graph. The paper enforces two properties:

Causality: influence only points forward in time, from earlier tokens to later generated tokens.
Row stochasticity: incoming influence weights for each generated token are non-negative and sum to one.

The normalization step is roughly:

$$ A_{i,j} = \frac{\Phi(T_{i,j})}{\sum_j \Phi(T_{i,j})} $$

where $\Phi(x)=\max(x,0)$ removes negative values before normalization.

That choice is not just mathematical housekeeping. It defines the interpretation of the graph. Each generated token receives a bounded distribution of influence from earlier tokens. Without that constraint, propagated influence can explode, flip signs, or cancel itself into nonsense. The appendix tests this directly, which we will come back to, because the appendix is doing real work here rather than decorative paper furniture.

Once the graph exists, CAGE computes prompt-level context attribution by marginalizing over influence paths. The paper expresses the full propagated influence matrix as:

$$ Y = A(I-A)^{-1} $$

Then, for any selected output $O$, CAGE sums the corresponding rows of $Y$ and reports the prompt-token or prompt-sentence components. In plain English: it traces how influence from the prompt reaches the output through all intermediate generated steps, not only through direct contact.

This is the central mechanism. CAGE does not ask the model to explain itself. It does not trust a verbal rationale. It builds an external attribution structure over the generation process. For audit work, that distinction matters.

The qualitative examples show the missing middle

The paper’s qualitative examples are useful because they reveal the exact failure mode row attribution has trouble with.

In the Facts dataset, the model is asked to list several facts from a prompt. The audit goal is to explain selected generated facts. Both row attribution and CAGE can identify the direct source facts for the selected outputs. But CAGE also attributes a previously generated fact because the model appears to be tracking which facts it has already used. That “reuse tracking” is not directly part of the final selected fact, but it shapes why the model selected what it selected.

Row attribution misses that inter-generation behavior. It sees the selected output and the source fact. It does not see the model’s memory of its own earlier output.

The Math example is even cleaner. A model solves a multi-step word problem with distractor sentences inserted between relevant prompt sentences. The final answer depends on several prompt facts: the initial situation, the numerical values, and the question. CAGE attributes all relevant odd-numbered prompt sentences. Row attribution misses critical prompt sentences, including early context needed to solve the problem.

This is not a small visualization preference. It changes the story the explanation tells.

A row-attribution explanation can say, “The answer was influenced by the final calculation.” CAGE can say, “The answer was influenced by a path that began with the original facts, passed through intermediate generated calculations, and ended at the final answer.” For a human reviewer, the second explanation is closer to the audit question we actually care about.

The quantitative tests separate coverage from faithfulness

The evaluation is structured around two different questions, and the distinction is worth preserving.

The first question is coverage: does the attribution identify all important prompt sentences? This is tested on the Math dataset, where the authors know which prompt sentences are ground truth because distractor sentences are inserted. A good explanation should concentrate attribution on the relevant prompt sentences and distribute it across them so that important pieces are not missed.

The second question is faithfulness: if we remove highly attributed prompt sentences, does the model’s probability of producing the original output fall? This is tested using perturbation metrics, RISE and MAS. Lower area under the perturbation curve means the attribution better identified inputs that mattered to the model’s output.

Those are not the same thing. Coverage asks whether the method finds the known relevant evidence. Faithfulness asks whether the method’s rankings align with actual model sensitivity. CAGE performs well on both, but the numbers should not be blended into one vague “better interpretability” smoothie.

Evidence block	Likely purpose	What the paper reports	What it supports	What it does not prove
Facts qualitative examples	Main illustrative evidence	CAGE captures source attribution plus reuse tracking; row attribution misses inter-generation influence	The graph representation exposes dependencies row attribution discards	That every visible edge corresponds to a true transformer mechanism
Math qualitative example	Main illustrative evidence	CAGE attributes all important prompt regions; row attribution misses critical context	Multi-step reasoning benefits from propagated attribution	That CAGE solves all reasoning interpretability
Math attribution coverage	Main quantitative evidence	CAGE wins 17/20 comparisons; max AC improvement 134%, average 40%	CAGE improves prompt-evidence coverage in multi-step math settings	That all improvements are large for every base method and model
MorehopQA/Facts faithfulness	Main quantitative evidence	On Llama 3 8B and Qwen 3 8B, CAGE wins 40/40 faithfulness tests; max improvement 30%, average 11%	The graph-based attribution better identifies inputs whose removal affects output probability	That perturbation metrics perfectly measure human-legible reasoning
Math faithfulness	Main quantitative evidence	CAGE wins another 40/40 faithfulness tests; max improvement 37%, average 16%	The same faithfulness advantage appears in chain-of-thought math	That the method’s explanations are complete causal proofs
Smaller-model appendix	Robustness extension	CAGE wins all 40 additional faithfulness tests on smaller Llama 3 3B and Qwen 3 4B for MorehopQA/Facts; max improvement 28%, average 11%	The pattern is not limited to the larger evaluated models	That scale effects are fully characterized
Normalization ablations	Ablation and sensitivity test	Removing row normalization or allowing raw signed values degrades stability; examples show value explosion and sign-flip erasure	Non-negativity and row-stochasticity are not optional cosmetics	That these are the only possible stable normalization choices

The coverage results are especially revealing. CAGE wins 17 of 20 Math attribution-coverage comparisons across four model settings and five base methods. The largest gains appear for perturbation-style methods such as ReAGent and Perturbation. For example, on Qwen 4B, ReAGent improves from 0.196 to 0.458 AC, and Perturbation improves from 0.191 to 0.442 AC. Those are not microscopic formatting wins.

There are losses. For Llama 8B with Integrated Gradients, CAGE is slightly below row attribution: 0.151 versus 0.160. For Qwen 8B, IG and attention-IG also show small declines. The paper is honest enough to report them, and the losses are small relative to the large gains elsewhere. This matters because the correct interpretation is not “CAGE magically dominates every cell.” It is “CAGE usually improves coverage, especially where inter-generation influence is important and the base attribution benefits from path propagation.”

The faithfulness results are more uniform. On MorehopQA and Facts, using Llama 3 8B and Qwen 3 8B, CAGE wins every reported RISE and MAS comparison across all five base methods. On Math, across Llama 3 3B, Llama 3 8B, Qwen 3 4B, and Qwen 3 8B, CAGE again wins all 40 reported faithfulness tests. The average gains are more modest than the headline coverage gain: 11% on MorehopQA/Facts for the larger models, 16% on Math, and 11% in the smaller-model appendix.

That pattern is healthy. Interpretability papers sometimes lean on a giant headline number while the rest of the evidence quietly coughs into a napkin. Here, the larger coverage gains and smaller faithfulness gains tell a coherent story. CAGE is particularly strong at recovering missing prompt evidence in multi-step reasoning, while its sensitivity-based faithfulness improvements are consistent but not absurdly large. Consistent and not absurd is underrated.

The appendix is a design defense, not a second thesis

The most important appendix section is the normalization ablation. The authors test three configurations:

The proposed CAGE: non-negative and row-stochastic.
A non-negative variant without row normalization.
A fully unnormalized variant using raw attribution values, including negative values.

This is an ablation because it asks whether CAGE’s design choices are necessary, not whether some unrelated feature also works. The answer is clear: influence propagation helps in general, but the proposed normalization is what keeps the graph interpretable and stable.

The qualitative ablations are blunt. Without row-stochastic normalization, values can explode, and a non-target generated sentence can dominate the attribution. Without both non-negativity and row normalization, negative values can propagate as sign flips, erasing influence on key prompt sentences while magnitudes explode elsewhere.

This matters for business use because audit tools must not create explanations that are more dramatic than the model behavior they claim to explain. An unstable attribution graph is worse than no graph. It gives reviewers a false map with very confident arrows. Corporate governance already has enough PowerPoint-shaped fiction; it does not need more.

The ablation also defines the method’s boundary. CAGE intentionally discards negative attribution values during graph construction. Negative values can represent inhibitory effects, so this is not free. The paper argues that propagating signed values through the graph causes instability and cancellation, so the method chooses stable mediated influence over preserving every possible sign. That is a defensible engineering choice, but it should be understood as a modeling assumption, not a revelation from the transformer gods.

The business value is audit-path diagnosis, not prettier heatmaps

For business readers, CAGE should not be understood as “a better visualization.” Heatmaps are cheap. Explanatory discipline is expensive.

The practical value is in workflows where the final answer is not the only object worth auditing. Consider four common cases:

Workflow	What row attribution can check	What graph attribution can add
Customer-support agent	Which policy passage influenced the final reply	Whether earlier generated diagnosis steps carried the right user facts into the final recommendation
Contract-review assistant	Which clause influenced the final risk label	Whether the conclusion passed through the right obligation, exception, and condition chain
Research assistant	Which source paragraph influenced the summary answer	Whether intermediate synthesis steps preserved the right evidence instead of drifting
Compliance copilot	Which regulation text influenced the output	Whether the final action recommendation depends on the right procedural reasoning path

The difference is operational. In a production system, the question is not merely, “Can we cite a source?” It is:

Did the model use the source in the right step?
Did it carry the right intermediate conclusion forward?
Did it ignore distractor material?
Did it reuse, skip, or overwrite information during generation?
Can a reviewer inspect the path without reading the entire raw transcript?

CAGE gives a possible mechanism for building those diagnostics. It could support audit dashboards that show not only source attribution but also generated-step dependency. It could help identify when a chain-of-thought-like output is decorative rather than functionally connected to the answer. It could also help compare prompts, retrieval settings, and agent designs by asking whether the model’s dependency paths become cleaner or noisier.

That is Cognaptus’ business inference, not the paper’s direct experimental claim. The paper shows improved attribution coverage and faithfulness under controlled tasks. The enterprise implication is that graph-based attribution may be useful for diagnosing reasoning-heavy automation. The uncertain part is how well the method transfers to real workflows with tool calls, retrieval systems, long documents, structured data, and policy constraints.

What this paper directly shows, and what it does not

The paper directly shows that CAGE improves context attribution over row-based alternatives across the evaluated settings. The evidence is strongest where outputs require multi-step retrieval or chain-of-thought-style reasoning: Math, MorehopQA, and Facts. It also shows that the method is modular across several base attribution methods and that its non-negative, row-stochastic graph construction is important for stability.

The paper does not show that CAGE reveals the true internal algorithm of the transformer. The authors explicitly describe it as a structured abstraction for tracing influence propagation, not a direct interpretation of model internals. That sentence deserves to be taken seriously. CAGE is not mechanistic interpretability in the circuit-tracing sense. It does not open the model and identify the exact internal computational circuit responsible for an answer.

The paper also does not solve the problem of whether the model’s reasoning text is semantically faithful in the human sense. CAGE can show that prompt influence propagates through generated steps according to the attribution structure. It cannot, by itself, prove that the generated reasoning is logically valid, normatively acceptable, or sufficient for a regulated decision.

Finally, the evaluation is centered on sentence-level attribution. That is practical and consistent with prior work, but enterprise documents often mix tables, clauses, metadata, code, and retrieved snippets. Moving from sentence-level paper tasks to messy production evidence units will require interface design, segmentation rules, and cost controls. The model may be autoregressive, but the invoice will also be autoregressive if someone runs attribution on everything without restraint.

How to use the idea without overbuying it

A sensible enterprise interpretation of CAGE is not “deploy attribution graphs everywhere.” It is more selective.

Use graph-based attribution where the answer is expensive to trust blindly and where intermediate generated steps matter. That includes compliance explanations, financial research notes, legal triage, safety-critical support flows, and multi-source analytical reports. Do not waste it on trivial single-hop answers where direct citation is enough.

Pair it with other checks. CAGE can help identify evidence paths, but it should sit beside retrieval validation, rule checks, answer verification, and human review thresholds. In a mature system, attribution is not the judge. It is a witness with a useful memory.

Also separate two audit layers:

Audit layer	Question	CAGE relevance
Source grounding	Did the answer use the right prompt or retrieved material?	Useful, especially when evidence travels through intermediate steps
Reasoning validity	Was the conclusion logically or procedurally correct?	Helpful for diagnosis, but not sufficient alone
Policy compliance	Is the final recommendation allowed under business rules?	Indirect; needs rule-based or expert validation
Model debugging	Where did the dependency chain drift?	Strong potential use case
User-facing explanation	What should we show the user?	Must be simplified; raw graphs may confuse rather than clarify

This is the less glamorous but more useful reading. CAGE is a candidate diagnostic layer for reasoning-heavy systems, not a magic transparency sticker for the product page.

The real lesson: reasoning audits need memory

The paper’s strongest contribution is conceptual. It forces attribution to remember that autoregressive generation has a past.

Row attribution treats the final answer like a static object: inspect the output, look backward to the prompt, sum what touches. CAGE treats the answer like the endpoint of a process: inspect the output, trace the generated steps, propagate influence through the path.

That difference is exactly what enterprise AI governance has been missing. Many deployed LLM systems are already process systems. They retrieve, summarize, infer, compare, draft, revise, and recommend. Yet their audits often remain object audits: final answer, final citation, final confidence score. That is not enough when the failure lives in the middle.

CAGE does not make LLM reasoning transparent. It does something more modest and more useful: it gives the ghosts in the generation a graph. It shows where earlier tokens still haunt later answers.

For business use, that is the point. The answer is not only what the model says at the end. It is what the model remembered along the way.

Cognaptus: Automate the Present, Incubate the Future.

Chase Walker and Rickard Ewetz, “Explaining the Reasoning of Large Language Models Using Attribution Graphs,” arXiv:2512.15663, 2025. https://arxiv.org/abs/2512.15663 ↩︎

Row attribution grades the destination; CAGE audits the route#

The technical move: turn token influence into a graph before summarizing it#

The qualitative examples show the missing middle#

The quantitative tests separate coverage from faithfulness#

The appendix is a design defense, not a second thesis#

The business value is audit-path diagnosis, not prettier heatmaps#

What this paper directly shows, and what it does not#

How to use the idea without overbuying it#

The real lesson: reasoning audits need memory#