Opening — Why this matters now

Large language models don’t think—but they do accumulate influence. And that accumulation is exactly where most explainability methods quietly give up.

As LLMs move from single-shot text generators into multi-step reasoners, agents, and decision-making systems, we increasingly care why an answer emerged—not just what token attended to what prompt word. Yet most attribution tools still behave as if each generation step lives in isolation. That assumption is no longer just naïve; it is actively misleading.

The paper Explaining the Reasoning of Large Language Models Using Attribution Graphs introduces CAGE—Context Attribution via Graph Explanations—as a direct response to this failure mode. Its claim is simple, almost embarrassingly so: reasoning is sequential, so explanations should be too. fileciteturn0file0

Background — Context attribution, but with amnesia

Attribution methods have a long and respectable history in vision and classification tasks. When adapted to autoregressive language models, they usually follow a predictable recipe:

  1. Treat each generated token as a classification step.
  2. Attribute that token to the prompt tokens.
  3. Sum attributions across selected outputs.

This approach—often called row attribution—sounds reasonable until you notice what it discards: inter-generational influence. In other words, how earlier generated tokens shape later ones.

In chain-of-thought reasoning, this omission is fatal. Intermediate steps don’t merely decorate the answer; they construct it. Ignoring them is like explaining a proof by highlighting only the axioms and the final theorem, while pretending the lemmas never existed.

Analysis — What CAGE actually does

CAGE reframes attribution as a graph problem.

The attribution graph

Each prompt token and generated token becomes a node. Directed edges point forward in time, capturing how earlier tokens influence later generations. The resulting structure is a directed acyclic graph with two enforced properties:

  • Causality: influence only flows forward in generation order.
  • Row stochasticity: incoming influences to each generated token are non-negative and sum to one.

This is not an architectural claim about transformers—it’s a modeling abstraction. A disciplined one.

From tables to propagation

Instead of summing attribution rows and hoping for the best, CAGE:

  1. Collects token-to-token attributions for every generation step.
  2. Normalizes them into an adjacency matrix.
  3. Propagates influence through the graph, marginalizing over all causal paths from prompt to output.

Mathematically, the total influence matrix is:

$$ Y = A (I - A)^{-1} $$

Once computed, any subset of outputs can be explained by summing the corresponding rows and projecting back onto the prompt tokens. No re-running attribution. No special casing.

This single equation quietly fixes what years of incremental attribution tweaks could not.

Findings — What improves, and by how much

The authors evaluate CAGE across multiple models (Llama 3, Qwen 3), datasets (Facts, Math, MorehopQA), and base attribution methods (perturbation, IG, attention-based).

Coverage (did we catch the right evidence?)

Method Avg. Attribution Coverage Gain
IG-based +15–20%
Perturbation-based Up to 134%
Overall average ~40%

CAGE consistently recovers all required prompt elements in multi-step reasoning tasks—something row attribution routinely fails to do.

Faithfulness (does removing important context hurt the model?)

Across every perturbation-based faithfulness test, CAGE wins. Not sometimes. Not conditionally. Always.

That is a rare result in interpretability research—and a telling one.

Implications — Why this matters beyond explainability

CAGE is not just a better attribution trick. It has broader consequences:

  • Agent auditing: When agents act over long horizons, influence graphs provide a traceable reasoning spine.
  • Safety and compliance: Regulators care about causal accountability, not heatmaps.
  • Model debugging: Misreasoning can now be localized to specific generative steps, not just prompts.

Equally important is what CAGE doesn’t claim. It does not pretend to reveal the transformer’s true internal computation. It offers a stable, interpretable surrogate—and is honest about the abstraction.

In explainability, restraint is a virtue.

Conclusion — Reasoning deserves structure

Row attribution treats LLM outputs like a bag of tokens. CAGE treats them like a process.

That shift—from accumulation to propagation, from tables to graphs—is the paper’s real contribution. Everything else follows.

If we want to trust reasoning models, we first need to admit that reasoning leaves a trail. CAGE finally gives us a way to follow it.

Cognaptus: Automate the Present, Incubate the Future.