A customer-service agent rebooks a flight, checks a policy, calls an API, updates the passenger record, apologizes politely, and still gets the outcome wrong.

The old explainability question would be: which input tokens influenced the final answer?

That question is not useless. It is just late to the crime scene.

When an AI system only predicts, explanation can focus on a single input-output decision. When an AI system acts, explanation has to follow the behavior across time: the state it maintained, the tool it selected, the observations it received, the recovery move it attempted, and the point where the run quietly became unrecoverable. A nice feature-importance chart does not tell you that. It tells you what mattered to a prediction, not how a workflow failed.

That is the central shift in From Features to Actions: Explainability in Traditional and Agentic AI Systems, an arXiv paper by Chaduvula and co-authors.1 The paper’s useful contribution is not that it declares traditional explainable AI obsolete. That would be fashionable, dramatic, and mostly wrong. Its sharper point is that traditional XAI was built around the wrong unit of analysis for agents.

Static AI explains predictions. Agentic AI needs explanations of trajectories.

The unit of explanation changes when the model starts acting

Traditional explainable AI grew up in a relatively clean world. A model receives an input $x$, produces an output $y$, and the explanation tries to identify which parts of $x$ mattered most. SHAP, LIME, saliency maps, partial dependence plots, and similar tools fit naturally into this framing.

In that world, the question is well-posed:

Why did this model produce this prediction for this input?

For many business systems, this remains useful. A credit model can highlight income, repayment history, and debt ratio. A text classifier can show that “Python,” “AWS,” and “Kubernetes” pushed a job posting toward an IT category. The explanation is not perfect, but the object being explained is clear.

Agentic systems break that neat geometry. An LLM-based agent does not merely map input to output. It moves through a sequence:

$$ \tau = (s_0, a_0, o_0, s_1, a_1, o_1, \ldots, s_T) $$

Here, $s$ is state, $a$ is action, and $o$ is observation. In plain English: the agent remembers something, does something, sees something, updates its understanding, and repeats the loop until it succeeds, fails, times out, or produces something that looks plausible enough to pass casual inspection. Enterprise software departments may recognize the pattern.

This changes the explanation problem. The business user does not only need to know why the final response looked reasonable. They need to know:

Question Static prediction framing Agentic workflow framing
What is being explained? One prediction A multi-step trajectory
Where can failure happen? Input processing or model decision Planning, tool choice, tool arguments, state updates, retrieval, recovery
What evidence matters? Input features, tokens, coefficients, saliency Tool logs, observations, state snapshots, retrieved evidence, action sequence
What makes the explanation useful? Feature influence and stability Failure localization, replayability, auditability, verification

That is the mechanism-first reading of the paper: once behavior unfolds across steps, explanation must move from feature influence to trajectory integrity.

A token attribution cannot tell you that an airline agent forgot the passenger’s constraint after the third tool call. A saliency map cannot tell you that a web agent selected the wrong interaction affordance and then wasted the remaining budget. A chain-of-thought trace may describe what the agent said it was doing, but unless it is linked to actual tool calls, observations, and state changes, it is still closer to a witness statement than an audit trail.

Useful, perhaps. Sufficient, no.

The paper’s framework: explanations need artifact, context, and verification

The paper introduces a compact framework called the Minimal Explanation Packet, or MEP. The name is a little bureaucratic, but the idea is practical.

An explanation should not be treated as a standalone artifact. It should be packaged with the context that grounds it and the verification signals that make it trustworthy.

MEP component Static example Agentic example Business interpretation
Explanation artifact SHAP values, LIME explanation, saliency map Reasoning/action trace, tool-call summary, trajectory diagnosis The human-readable explanation
Linked context Input text, predicted label, model confidence User request, tool arguments, tool outputs, observations, retrieved documents, state updates The evidence needed to inspect the explanation
Verification signals Perturbation stability, rank consistency Rubric flags, replay checks, state-consistency checks, tool correctness checks The reason the explanation is not merely decorative

The important move is the third column. In agentic systems, context is not optional. Without logs, tool outputs, and state snapshots, the explanation becomes a narrative over a missing process. That may look professional in a dashboard, but so does a smoke detector without batteries.

The MEP framing also separates three ideas that businesses often mix together:

  1. Transparency: Can we see what happened?
  2. Diagnosis: Can we identify where the failure emerged?
  3. Verification: Can we test whether the explanation is grounded in the actual run?

Most AI governance discussions are comfortable with the first. Production operations require the second and third.

The static experiment shows attribution still works where its assumptions hold

The paper first uses a static text classification task as a calibration baseline. The task is binary classification of online job postings into IT versus non-IT categories. The authors compare a TF-IDF plus Logistic Regression model with a Text CNN baseline and evaluate explanation stability using Spearman rank correlation under perturbations.

The result is straightforward:

Static model Explanation stability score
TF-IDF + Logistic Regression 0.8577
Text CNN 0.6127

The interpretation is not “linear models are good and neural networks are bad.” That would be too easy, and therefore suspicious.

The better interpretation is that attribution methods are most stable when the task structure and model representation are aligned with semantically meaningful features. In a sparse text classification task, words and phrases can carry interpretable signals. SHAP and LIME can surface those signals in a way that is stable enough to be useful.

This matters because the paper is not throwing traditional XAI into the recycling bin. It shows that traditional XAI works reasonably well under the conditions it was designed for: fixed input-output mapping, meaningful features, and explanation targets that live at the prediction level.

The problem begins when we ask those tools to diagnose a sequence of decisions.

Agent failures are not just wrong answers; they are damaged trajectories

For the agentic setting, the paper evaluates two tool-use benchmarks:

Benchmark Agent/model setting reported in the paper Tasks Successes Failures Accuracy
TAU-bench Airline Tool-calling agent using o4-mini-2025-04-16 50 28 22 56.00%
AssistantBench Browser agent using GPT-4.1 33 2 31 17.39%

Accuracy already tells us that the two environments are very different. TAU-bench Airline is structured around airline customer-service tasks and API-mediated actions. AssistantBench involves web-based assistance tasks requiring navigation and information gathering. One is a constrained tool environment; the other is a messier open-web task environment. The web remains undefeated as a machine for converting simple goals into strange procedural suffering.

But the more useful part of the paper is not the accuracy table. Accuracy tells us that agents failed. It does not tell us how.

To diagnose the failures, the authors use trace-grounded behavioral rubrics. The rubrics include:

Rubric What it checks
Intent Alignment Whether actions remain consistent with the user’s goal
Plan Adherence Whether the agent maintains a coherent multi-step plan
Tool Correctness Whether tool calls use valid tools and parameters
Tool-Choice Accuracy Whether the selected tool is appropriate for the subtask
State Consistency Whether the agent maintains coherent state across steps
Error Recovery Whether the agent detects and recovers from failures

These rubrics are applied to execution traces. That detail matters. The judge is not supposed to look at the final outcome and then invent a plausible explanation backward. It evaluates observable trajectory evidence: actions, tool calls, observations, and intermediate state. The method is still post-hoc and still relies on an LLM judge, but it is grounded in the run rather than in a final answer alone.

This is where the paper’s argument becomes operationally useful. It lets us distinguish “slow failures” from “fast failures.”

State drift is the slow failure pattern

In TAU-bench Airline, the strongest failure signal is state tracking consistency. The paper reports that state tracking violations are much more common in failed runs than successful runs: 0.526 in failures versus 0.194 in successes, a ratio of 2.719.

That is the paper’s most business-relevant number.

State inconsistency is not glamorous. It does not sound like a dramatic reasoning failure. It sounds like bookkeeping. Unfortunately, much of enterprise automation is bookkeeping with consequences.

An airline agent may begin with the correct passenger intent, retrieve the correct policy, and call plausible tools. But if it mis-tracks a constraint, forgets a previous observation, or carries forward a stale assumption, the trajectory starts to diverge from reality. Early steps may still look reasonable. The final failure appears later, often after the system has accumulated enough small inconsistencies to make recovery difficult.

The paper’s reliability analysis sharpens this point. In TAU-bench Airline, when state tracking consistency is violated, the success rate is 0.375; when it is not violated, the success rate is 0.735. That is a 36-percentage-point drop, with a relative ratio of 0.51.

This does not prove that state inconsistency causally causes every failure. The paper’s method is correlational. But it strongly suggests that state tracking is not a minor logging detail. It is a core reliability variable.

For business deployment, the inference is direct: if an agent is expected to complete multi-step tasks, state consistency should be monitored as a first-class operational metric. It should not be buried inside a generic “agent quality” score.

Tool choice is the fast failure pattern

AssistantBench shows a different pattern. The benchmark has a very low success rate in the reported setup: only 2 successful runs out of 33. Under such a low-success regime, many metrics become hard to interpret because almost everything is associated with failure. Still, the paper highlights tool-choice accuracy and plan adherence as sparse but decisive blockers.

In AssistantBench, tool-choice accuracy violations appear only in unsuccessful runs in the failure-prevalence table. In the reliability table, tool-choice accuracy and plan adherence violations correspond to zero success. The sample is small and the success rate is low, so this should not be overread as a universal law of browser agents. But the pattern is intuitive.

In open-ended web tasks, one wrong branch can collapse the run. The agent chooses the wrong page interaction, follows the wrong evidence path, or commits to a bad subtask decomposition. Unlike state drift, which accumulates gradually, tool-choice failure can be immediate. The run may not degrade; it may simply take the wrong exit and never return.

So the two benchmarks illustrate two different operational failure modes:

Failure pattern Benchmark signal What happens Operational response
Slow failure TAU-bench Airline state inconsistency Small state errors accumulate across steps State snapshots, invariant checks, memory reconciliation, replay
Fast failure AssistantBench tool-choice or plan-adherence violations A wrong branch blocks the task early Tool-selection audits, affordance checks, step-budget-aware recovery

This distinction is more useful than a single agent accuracy score. Accuracy tells you how often the system failed. Trace diagnostics tell you which engineering team should lose sleep first.

The bridging experiment shows where SHAP still belongs

The paper then runs a clever bridging experiment. Instead of applying attribution directly to raw agent trajectories, the authors compress trajectories into rubric-level binary features. Each feature represents whether a behavioral constraint was satisfied or violated. They then train a logistic regression model to predict task success and use SHAP to estimate which rubric features most influence the surrogate prediction.

The reported global SHAP values are:

Rubric attribute Mean absolute SHAP value
Intent Alignment 0.473
State Tracking Consistency 0.422
Tool Correctness 0.415
Tool Choice Accuracy 0.122
Error Awareness & Recovery 0.115
Plan Adherence 0.090

This result is easy to misread. It does not mean SHAP has become a full explanation method for agents. It means SHAP becomes useful again after agent behavior has already been translated into human-meaningful behavioral features.

That is a very different claim.

The attribution method can summarize which rubric dimensions are globally important for predicting success in the surrogate model. It can say, roughly, “Intent alignment and state tracking matter a lot across these runs.” That is useful for portfolio-level analysis, dashboard design, and prioritizing evaluation categories.

But it still does not answer the operational question:

In this failed run, which step broke the task?

For that, the system needs the trace. It needs the tool call. It needs the observation. It needs the state update. It needs the moment where the plan and reality separated.

So the practical conclusion is not “replace SHAP with rubrics.” It is more precise:

Use attribution for aggregate importance once behavior has been abstracted. Use trace-grounded diagnostics for per-run failure analysis.

That is less catchy than “SHAP is dead,” but it has the advantage of being true.

Why chain-of-thought is not enough

A likely misconception around agentic explainability is that chain-of-thought or reasoning traces solve the problem. After all, if the model writes down its reasoning, do we not have an explanation?

Not quite.

Reasoning traces are useful because they make the process legible. But they are self-reported. They can omit tool effects, misrepresent causal drivers, or present a tidy rationale after the model has already drifted. In a tool-using agent, behavior is not only generated by internal text. It is shaped by retrieved documents, tool outputs, state updates, API failures, interface constraints, and recovery attempts.

The paper’s framing is helpful here: reasoning traces become more reliable when they are aligned with interaction logs. A trace should connect:

  1. Stated intent;
  2. Chosen action;
  3. Tool call and arguments;
  4. Tool output or environment observation;
  5. State update;
  6. Verification signal.

Without those links, the trace may be readable but not auditable. It becomes an explanation-shaped object.

For business teams, this matters because many agent demos already show verbose reasoning, intermediate steps, and “thought process” panels. These can create the feeling of transparency while still failing to support diagnosis. The problem is not that the interface is dishonest. The problem is that a readable transcript is not the same as a verified execution account.

A proper agent explanation should let an operator answer:

  • Did the agent choose the right tool?
  • Did it pass the right arguments?
  • Did it interpret the tool output correctly?
  • Did it preserve the user’s constraint across later steps?
  • Did it detect the error when the environment pushed back?
  • Did the final action follow from the observed state?

That is not a narrative feature. It is a logging and evaluation architecture.

What businesses should actually build from this

The paper directly shows that trace-grounded rubrics provide more diagnostic visibility than attribution methods for the studied agentic benchmarks. It does not prove that every production agent should use the exact same rubrics, the exact same judge, or the exact same benchmark design.

The business inference is therefore architectural, not prescriptive.

Firms deploying LLM agents should treat explainability as an operational packet rather than a visualization layer. For each meaningful agent run, the system should preserve enough evidence to reconstruct the trajectory and enough evaluation signals to localize failure.

A practical agentic MEP could look like this:

Layer Minimum implementation Better implementation ROI relevance
Trace capture Store prompts, actions, tool calls, tool outputs, final response Add structured state snapshots, retrieved evidence IDs, environment observations, timestamps Reduces debugging time
Behavioral rubrics Score runs for intent, tool correctness, state consistency, recovery Customize rubrics by workflow and risk class Prioritizes reliability work
Verification Check schema validity and tool-call success Add replay tests, invariant checks, counterfactual tests, human spot reviews Supports audit and compliance
Failure localization Label failed runs globally Link violations to steps and evidence Turns “agent failed” into fixable work
Reporting Aggregate pass/fail rates Separate slow failures, fast blockers, recoverable errors, and fatal errors Better engineering allocation

The ROI pathway is not mysterious. Better explanations reduce the cost of diagnosis. They help teams decide whether a failure came from prompt design, tool schema, retrieval quality, memory/state tracking, missing guardrails, or impossible task setup. In mature deployments, that can matter more than a marginal gain in benchmark accuracy.

A slightly uncomfortable implication follows: many agent products are currently easier to demo than to inspect. The paper gives a vocabulary for closing that gap.

What the paper shows, what Cognaptus infers, and what remains uncertain

It is worth separating the evidence from the practical extrapolation.

Level Claim Status
Directly shown in the paper Static attribution methods are stable in the studied static text classification setup, especially for TF-IDF plus Logistic Regression Supported by the reported stability experiment
Directly shown in the paper In the studied agent benchmarks, trace-grounded rubrics identify behavior-level failure patterns such as state inconsistency and tool-choice errors Supported by rubric analysis over TAU-bench Airline and AssistantBench
Directly shown in the paper SHAP can summarize global importance over rubric-derived features but does not localize specific trajectory failures Supported by the bridging experiment
Cognaptus inference Production agent monitoring should include trace logging, state checks, tool-call audits, rubric labels, and replayable explanation packets Reasonable operational extension, not directly proven as an ROI study
Still uncertain Whether the same rubric categories and judge setup generalize to all agent architectures, domains, and long-memory systems Open question

This distinction matters. The paper is not a universal production manual. It is a research argument supported by two benchmarks, a static baseline, trace-derived rubric evaluation, and a bridging experiment.

That is enough to motivate a better explainability playbook. It is not enough to declare a final standard.

The limits are mostly about scope, judges, and causality

The paper’s limitations are practical rather than cosmetic.

First, the agentic evidence comes from TAU-bench Airline and AssistantBench. These are useful benchmarks, but they do not cover every kind of agent. Embodied agents, multi-agent systems, long-term memory agents, online-learning agents, and enterprise agents wired into messy internal systems may produce different failure modes.

Second, the rubric labels are generated by an LLM judge through Docent. The paper reduces outcome leakage by using trace-only evaluation, but LLM judging still introduces subjectivity. Fixed prompts help consistency; they do not turn judgment into physics.

Third, the method depends on trace completeness. If logs omit state updates, hide tool outputs, or fail to capture retrieval context, then the explanation packet is built on partial evidence. A beautiful rubric over incomplete logs is still incomplete. It may simply fail with better formatting.

Fourth, the results are mostly correlational. A state-consistency violation being associated with failure does not prove a clean causal pathway in every run. The next step for this line of work would be stronger counterfactual testing: replay the trajectory with corrected state, altered tool choices, or controlled observations and test whether outcomes change.

These limitations do not weaken the core mechanism. They define where the mechanism should be applied carefully.

The new explainability contract is forensic, not decorative

The old explainability contract was built around a prediction: show which features mattered, provide a chart, and make the decision feel inspectable.

Agentic AI needs a different contract. The system must preserve the run, ground the explanation in evidence, and verify that the account reflects actual behavior. It must answer not only “what mattered?” but also “what happened, where did it diverge, and what evidence proves that?”

That is a more demanding standard. It is also the standard required if businesses want agents to do more than perform well in demos and fail mysteriously in production.

Static XAI is not dead. It is scoped. Feature attribution remains useful when the object is a prediction. For agents, the object is behavior across time. Once the object changes, the explanation must change with it.

The lesson is simple enough to be annoying: if your AI system acts like a workflow, explain it like a workflow.

Not like a bar chart.

Cognaptus: Automate the Present, Incubate the Future.


  1. Sindhuja Chaduvula, Jessee Ho, Kina Kim, Aravind Narayanan, Ahmed Y. Radwan, Mahshid Alinoori, Muskan Garg, Dhanesh Ramachandram, and Shaina Raza, “From Features to Actions: Explainability in Traditional and Agentic AI Systems,” arXiv:2602.06841, https://arxiv.org/abs/2602.06841↩︎