Opening — Why this matters now
Explainable AI has always promised clarity. For years, that promise was delivered—at least partially—through feature attributions, saliency maps, and tidy bar charts explaining why a model predicted this instead of that. Then AI stopped predicting and started acting.
Tool-using agents now book flights, browse the web, recover from errors, and occasionally fail in slow, complicated, deeply inconvenient ways. When that happens, nobody asks which token mattered most. They ask: where did the agent go wrong—and how did it get there?
The paper behind this article makes a quiet but consequential argument: most explainability methods we rely on were designed for the wrong unit of intelligence.
Background — Explainability before agents
Traditional explainable AI (XAI) lives in a static world. Models implement a fixed mapping from input to output, and explanations are attached to that single decision. Techniques like SHAP, LIME, and saliency maps excel here. They answer a well-posed question:
Which input features influenced this prediction?
In domains like credit scoring or text classification, that framing mostly holds. Perturb the input, observe stable feature rankings, and you get explanations that are at least internally consistent.
But consistency is not the same as usefulness—especially once decision-making stretches across time.
Analysis — When prediction becomes behavior
Agentic AI systems invert the old assumptions. Behavior is no longer a single inference but a trajectory:
$$ \tau = (s_0, a_0, o_0, s_1, a_1, o_1, \ldots, s_T) $$
State, action, observation—repeated until success or failure emerges. In this setting, asking for feature attributions is like explaining a chess match by highlighting the importance of the opening pawn move.
The paper formalizes this gap by contrasting static explainability with agentic explainability, then testing both on real benchmarks:
- A static text classification task
- Two agentic benchmarks: airline booking and web-based assistants
The result is not subtle. Attribution methods remain stable in static tasks (Spearman (\rho \approx 0.86)), but collapse as diagnostic tools once failures arise from multi-step execution.
Findings — What actually breaks agents
Instead of attributions, the authors evaluate agents using trace-grounded behavioral rubrics—explicit checks over execution logs. These rubrics ask questions humans actually care about:
| Rubric | What it checks |
|---|---|
| Intent Alignment | Are actions consistent with the stated goal? |
| Tool Correctness | Are tools called correctly and with valid parameters? |
| Tool Choice Accuracy | Was the right tool chosen? |
| State Consistency | Does internal state drift or stay coherent? |
| Error Recovery | Does the agent detect and recover from mistakes? |
Across benchmarks, one pattern dominates: state inconsistency.
- In airline-booking agents, state tracking failures were 2.7× more common in failed runs.
- When state consistency was violated, success probability dropped by nearly 50%.
These are not errors you can see in a single step. They accumulate quietly—small misalignments compounding until recovery becomes impossible.
By contrast, attribution methods could only say which types of behavior correlated with success in aggregate. They could not explain why this run failed.
Bridging experiment — Where SHAP still fits
To be fair, the paper doesn’t discard attribution outright. Instead, it compresses agent trajectories into low-dimensional behavioral features (the rubrics above), then applies SHAP to a surrogate model.
The result is revealing:
| Behavioral Feature | Mean |SHAP| | |——————-|————-| | Intent Alignment | 0.473 | | State Consistency | 0.422 | | Tool Correctness | 0.415 |
Attribution works again—but only after behavior has been abstracted into human-meaningful dimensions. Even then, it explains what matters overall, not what broke where.
This is the paper’s most pragmatic takeaway: attribution is a summary tool, not a diagnostic one.
Implications — A new explainability contract
The authors introduce the Minimal Explanation Packet (MEP) to formalize what agentic explainability requires:
- Explanation artifact — traces, tool calls, reasoning steps
- Linked evidence — logs, observations, retrieved documents
- Verification signals — replay checks, rubric violations, consistency tests
Explainability becomes less about persuasion and more about auditability. Less narrative, more forensics.
For regulated or safety-critical domains—finance, healthcare, enterprise automation—this is not academic hair-splitting. It determines whether explanations can support debugging, accountability, and trust.
Conclusion — Explanations that move with time
Static XAI is not wrong. It is just narrowly scoped.
As AI systems move from predicting outcomes to executing plans, explanations must follow. Feature attributions explain decisions. Trajectories explain behavior. Confusing the two leads to comforting visuals—and very little understanding.
Agentic AI doesn’t fail loudly. It fails slowly, structurally, and across time. If we want explanations that matter, we need to stop asking why the model answered—and start asking how the agent behaved.
Cognaptus: Automate the Present, Incubate the Future.