Blame Isn’t a Bug: Turning Agent ‘Whodunits’ into Fixable Systems

TL;DR for operators

A bad agent incident rarely starts with one dramatic mistake. It usually forms as a chain.

The system may be predisposed to fail because of training data, feedback, system prompts, or scaffolding. The environment may then trigger the failure through unclear tasks, insecure information, unavailable tools, excessive permissions, or malicious inputs. Finally, the agent may commit a visible cognitive error: it overlooks something, misunderstands a command, chooses the wrong goal, or executes an action badly.

That is the useful lens in Incident Analysis for AI Agents, a paper by Carson Ezell, Xavier Roberts-Gaal, and Alan Chan.¹ The paper is not offering a new benchmark, a detection model, or a magic safety wrapper. It is doing something less glamorous and more useful: specifying what investigators would need to know after an AI-agent incident if they actually wanted to understand it.

The business implication is simple and slightly inconvenient. If your agent can browse, code, email, buy, search, retrieve records, call APIs, or touch customer data, your incident plan cannot begin after something goes wrong. You need logs, model and system versioning, tool-action records, runtime settings, scaffolding traces, permission histories, and change logs already in place. Otherwise the postmortem becomes theatre: everyone points at “the model,” “the prompt injection,” or “the vendor,” and nobody can prove which part of the chain broke.

The paper’s EchoLeak discussion makes the point concrete. Public reports say Microsoft 365 Copilot was vulnerable to a zero-click indirect prompt-injection attack that could exfiltrate confidential data. But public information cannot fully show whether the failure came from scaffolding, prompt-injection defences, tool access, model prioritisation, hidden instructions, or some combination. That is exactly the problem. The interesting part of an agent incident is often inside the logs nobody publishes.

This matters for operators because “agent safety” is not just model selection. It is evidence architecture.

The incident starts before the agent “makes a mistake”

The tempting story is that agent incidents are model incidents. The model hallucinated. The model obeyed a malicious instruction. The model misunderstood the user. The model went rogue, or at least wandered off with the corporate credit card and a suspicious amount of confidence.

That story is emotionally satisfying because it has a culprit. It is also usually too shallow.

The paper borrows from systems-safety thinking, especially the idea that accidents emerge from chains of causes rather than single bad moments. In aviation, healthcare, industrial operations, and other safety-critical domains, incident analysis does not stop at “the pilot made an error” or “the machine behaved incorrectly.” It asks what organisational processes, environmental conditions, supervision gaps, interface failures, and cognitive breakdowns made that error likely.

For AI agents, the authors propose a similar causal chain:

System factors → Contextual factors → Cognitive errors → Incident or hazard

That sequence is the paper’s main mechanism. It is not a statistical finding; it is a diagnostic structure. Its purpose is to stop investigators from flattening every failure into one label.

A prompt injection, for example, is not only “bad external text.” It may also involve weak input sanitisation, excessive tool access, unclear instruction hierarchy, unsafe scaffolding, missing monitoring, and a model that treats untrusted content as authoritative. The injected text is the trigger. It is not necessarily the full cause.

This is where the paper becomes useful for businesses. Most enterprise AI governance still treats agent risk as a pre-deployment checklist: model approved, vendor reviewed, privacy policy signed, hallucination warning added, off we go. The framework says that is not enough. Once an agent is operating in the world, risk is distributed across the model, the wrapper, the tools, the task, the data environment, and the runtime trace.

The failure is a system property. Naturally, it will be blamed on the nearest chatbot window.

System factors are the conditions that make whole classes of failures more likely

System factors are the stable design and development choices that travel with the agent across contexts. The paper highlights four broad kinds: training and feedback data, learning methods, system prompts, and scaffolding.

This is the slow-burn part of the causal chain. These factors may not explain why one particular incident happened at 3:17 p.m. on a Tuesday, but they can explain why a system was vulnerable to that class of incident in the first place.

The paper’s examples are telling. Training data can contain sensitive information, misleading information, poisoned data, or misuse-enabling material. Feedback can reward the wrong proxy, such as agreeableness over correctness. Learning methods can introduce side effects, including performance degradation, memorisation problems, or altered behavioural trade-offs. System prompts can be underspecified or accidentally steer behaviour in undesirable directions. Scaffolding can fail to filter malicious inputs, format tool calls correctly, expose relevant logs, or mediate tool use safely.

For operators, the key point is that system factors are often invisible from the final output. A chatbot answer does not tell you whether the relevant issue was RLHF feedback, a system-prompt update, a model-version change, a parser bug, a guardrail miss, or a tool-routing failure. You need system documentation and change history to distinguish those hypotheses.

That distinction matters because different causes imply different fixes.

Incident symptom	Weak diagnosis	Better causal question	Likely operational fix
Agent leaked sensitive information	“The model followed a prompt injection”	Did scaffolding fail to classify untrusted input, did the model mis-rank instruction priority, or did the tool expose too much data?	Input isolation, permission redesign, output filtering, retrieval boundaries, instruction hierarchy
Agent gave harmful advice	“The model hallucinated”	Was the task underspecified, was context missing, was training feedback misaligned, or did the system prompt encourage overconfident completion?	Task templates, uncertainty handling, escalation rules, prompt revisions, domain constraints
Agent took the wrong action through an API	“The agent made a bad tool call”	Did it observe the tool state, understand the API contract, choose the right goal, and monitor execution errors?	Tool documentation, dry-run mode, action confirmation, tool-state logging, rollback design
Agent changed behaviour after an update	“The new model is worse”	Which model, prompt, scaffolding, retrieval, or runtime setting changed?	Version pinning, change logs, staged rollout, regression tests

The paper does not claim these categories are exhaustive. That restraint is important. Agent systems are still mutating quickly, which is a polite way of saying the industry keeps inventing fresh ways for wrappers, tools, and models to disappoint one another. The value of the taxonomy is not finality. It is disciplined hypothesis generation.

Contextual factors are the trigger zone

If system factors create predisposition, contextual factors create the situation in which the predisposition becomes harmful.

The paper groups contextual factors into task definition, tools, and information. This is where agent risk starts looking very different from ordinary chatbot risk.

A chatbot mostly produces text. An agent interprets tasks, uses tools, reads external information, and may act through browsers, APIs, code environments, databases, payment systems, calendars, email accounts, or enterprise applications. Every extra interface expands the incident surface.

The task itself can be vague, resource-constrained, or internally conflicting. “Summarise this concisely” is a softer instruction than “produce three bullet points under 120 words.” “Help this customer” becomes riskier when the customer asks for something that conflicts with policy. “Optimise the workflow” becomes interesting, in the ominous sense, when the agent has broad permissions and no cost boundary.

Tools add another layer. The paper lists several tool-related failure modes: tools may be unavailable, rate-limited, outdated, insecure, excessive for the task, hard to monitor, or vulnerable through their descriptions and interfaces. A coding agent that cannot see full terminal logs may fail because it never observes the error. A browser agent with too much access may become a prompt-injection courier with executive privileges. Charming.

Information is the third contextual factor. External information may be inaccessible, low-quality, misleading, outdated, sensitive, or actively hostile. Prompt injection lives here, but the broader point is not just “malicious text is bad.” It is that agents treat information as operational input. Once external content enters the context window, it can influence goals, plans, and tool calls.

This is why the framework resists the casual phrase “the prompt caused it.” In agent systems, “the prompt” can include a user instruction, a system instruction, retrieved documents, webpage content, email text, tool output, memory, API responses, and scaffolding-inserted context. The incident may depend on how those sources were combined and ranked.

For a business, the practical lesson is that context needs governance. Not just data governance in the warehouse sense. Runtime context governance: what the agent can see, which sources are trusted, how instructions are separated, which tools are exposed, how credentials are scoped, and what happens when context contains conflicting commands.

That is not a philosophical issue. It is an architecture diagram with liability attached.

Cognitive errors are useful labels, not claims about robot psychology

The paper’s third category is cognitive errors. This could sound anthropomorphic, so the authors are careful: they are not claiming agents have human-like mental states. “Cognitive error” is an analytic label for observable breakdowns in agent function.

The paper divides these errors into four stages:

Observation: the agent fails to detect, attend to, or retain relevant input.
Understanding: the agent misinterprets the significance of information.
Decision-making: the agent selects or prioritises the wrong goal or action.
Action execution: the agent fails to carry out the chosen action properly.

This is a deceptively useful breakdown. It gives incident investigators a way to describe the proximate failure without pretending that the proximate failure is the root cause.

Consider a coding agent asked to integrate a payment system. If it fails to notice an existing endpoint, that is an observation failure. If it misunderstands recurring subscriptions, that is an understanding failure. If it chooses a custom card form instead of safer existing payment components, that is a decision-making failure. If it writes the API call incorrectly or ignores an error message, that is an action-execution failure.

Those labels help because they point to different evidence and different repairs. Observation failures suggest logging, interface visibility, context-window design, retrieval quality, or tool-output presentation. Understanding failures suggest evaluation data, instruction clarity, domain grounding, or semantic robustness. Decision failures suggest goal hierarchy, policy constraints, planning methods, and approval gates. Execution failures suggest tool schemas, error monitoring, sandboxing, and rollback controls.

The paper also notes an important boundary: reasoning traces can help investigators, but they may not be fully faithful to the model’s actual internal process. That matters for businesses now experimenting with “agent thoughts” as audit trails. A reasoning trace can be evidence. It should not be treated as a sworn confession.

The best use of cognitive-error labels is therefore modest: they organise the observable failure so that investigators can connect it back to system and contextual causes. They are the front door into the incident, not the whole house.

The paper’s “evidence” is a map of what must be retained

Because this is a framework paper, there are no experiments, ablations, benchmark deltas, or robustness tests to interpret. The core evidence is structural: the authors build a taxonomy of causes and then map those causes to the information investigators would need.

That mapping is the operational heart of the paper.

The authors identify three information categories: activity logs, system documentation and access, and tool information. Table 1 in the paper functions as an evidence-requirements matrix. Its purpose is implementation detail, not empirical proof. It tells operators what data helps test which causal hypotheses.

Information category	What it includes	What it helps diagnose	Business translation
Activity logs	System prompts, user prompts, external information, scaffolding logs, model reasoning traces, model actions, metadata	Contextual triggers, cognitive errors, some system effects	Keep enough runtime trace to reconstruct what the agent saw, decided, and did
System documentation and access	Model/system docs, change logs, system artifacts, runtime configuration, model or system access	System factors and reconstruction	Maintain version control for prompts, models, scaffolding, tools, settings, and patches
Tool information	Tool identity, version, enabled actions, access requirements, state, errors, personalization, usage instructions	Contextual factors involving tool use	Treat tools as part of the incident surface, not neutral plumbing

This is where many enterprise AI programmes are underbuilt. They focus on pre-release approval and user-facing disclaimers, then discover after an incident that they lack the evidence to answer basic questions:

Which model version was running?
Which system prompt was active?
What retrieved documents entered the context?
What tool calls were attempted?
What credentials or permissions were available?
What did the tool return?
What did the scaffolding filter, transform, or suppress?
What changed in the previous release?
Was the incident reproducible under the same runtime settings?

Without that evidence, “root-cause analysis” becomes a corporate ritual. The report will have a timeline, a severity rating, a mitigation plan, and a reassuring paragraph about continuous improvement. It may still not know what happened.

EchoLeak shows why public incident reports are too thin

The paper’s EchoLeak section is a case study. Its purpose is not to prove the framework statistically. It demonstrates the gap between public incident descriptions and causal diagnosis.

EchoLeak, identified as CVE-2025-32711, was a Microsoft 365 Copilot vulnerability that reportedly allowed attackers to exfiltrate confidential data without user interaction. The attack involved malicious instructions hidden in an email, leading the underlying language model to reveal private information through an indirect prompt-injection path.

Public information supports a rough diagnosis. The system’s scaffolding may not have detected or filtered the malicious input. It may not have blocked harmful outputs. The contextual trigger was a malicious email in an inbox Copilot could access. The cognitive error may have involved failing to distinguish hidden attacker commands from legitimate instructions, or recognising them but still complying.

But that is the point: “may have.”

To know more, investigators would need the actual email content, intermediate prompts, agent outputs, reasoning traces if available, system documentation, prompt-injection defence details, and change logs around the patch. Public advisories usually do not include that material, for understandable reasons. It may contain user data, proprietary defences, sensitive system internals, or attack details that could help copycats.

The result is a familiar accountability problem. The public sees the incident category; the vendor sees the operational trace; regulators may or may not see enough to validate the fix; customers receive reassurance with limited inspectability.

From a business perspective, EchoLeak is not merely a Microsoft story. It is a preview of what happens when agents sit inside email, documents, calendars, CRM systems, codebases, and financial workflows. The dangerous input does not need to arrive as a user prompt. It can arrive as “content.”

Once the agent can read that content and act on nearby data, the boundary between document, instruction, and attack surface becomes annoyingly porous.

Existing incident databases classify harms; they do not reconstruct causes

The paper also compares institutional arrangements for incident reporting. Table 2 is a comparison with prior practice, not a secondary thesis. Its purpose is to show why current public databases and voluntary reporting systems are useful but insufficient for agent root-cause analysis.

Public incident databases can aggregate across organisations and help identify broad patterns. They are valuable for awareness, trend analysis, and high-level classification. But they mostly rely on public reports, news articles, or voluntary submissions. They generally do not contain full activity logs, system internals, tool states, runtime settings, or proprietary change histories.

Internal developer reports can be much richer. Developers can often link user feedback to conversation IDs, metadata, logs, model versions, and system documentation. But internal analysis is siloed. External stakeholders cannot easily validate it, compare it across developers, or use it for public accountability.

Regulation-based reporting may bridge part of the gap. The paper points to emerging requirements such as serious-incident reporting under the EU AI Act and draft code language about reporting chains of events and root-cause analysis. The authors’ framework can inform what those requirements should actually ask for.

This matters because “report the incident” is too vague. A regulator asking for an incident report without specifying evidence requirements may receive a polished narrative instead of a causal analysis. The paper effectively says: if you want root-cause analysis, ask for the artefacts that make root-cause analysis possible.

That is a governance lesson businesses can adopt before regulators force the issue.

What Cognaptus infers for enterprise agent governance

The paper directly gives us a taxonomy and an information-needs framework. It does not give a complete enterprise implementation standard. That next step is an inference.

Here is the practical translation.

First, agent deployments need an incident evidence model. For each agent, define what must be logged by default, what is retained only under elevated-risk conditions, what is redacted for reports, and what can be disclosed to investigators under controlled access.

Second, tool governance should become part of AI governance. Tool identity, version, permissions, enabled actions, credentials, state, error logs, and personalization data all matter. If the agent can act through a tool, the tool is not peripheral. It is part of the agent.

Third, model governance must include scaffolding governance. Many failures will not come from model weights alone. They may come from prompt templates, parsers, filters, retrieval logic, routing code, tool schemas, memory systems, output validators, or orchestration frameworks. The wrapper is not “just integration.” The wrapper is behaviour.

Fourth, versioning should be treated as forensic infrastructure. A change log that says “updated prompt” or “improved guardrails” is not enough. Investigators need to know what changed, when, why, and which incidents map to which versions. Otherwise every patch becomes a fog machine.

Fifth, retention policy should be risk-based. The paper notes the tension between complete logs and privacy or storage constraints. A consumer chatbot may justify shorter retention. An agent making financial decisions, handling sensitive documents, executing code, or using payment tools may justify longer or conditional retention. Zero-retention contracts may be commercially attractive, but they reduce investigability. That trade-off should be explicit, not discovered during litigation.

A simple enterprise checklist follows from the framework:

Governance area	Minimum useful question	Failure mode if ignored
Logging	Can we reconstruct what the agent saw, planned, and did?	Incident analysis becomes guesswork
Versioning	Can we identify the exact model, prompt, scaffolding, tool, and runtime configuration?	Patches cannot be linked to causes
Tool access	Do we know which actions and data sources were available?	Excessive permissions become invisible
Context control	Do we separate trusted instructions from untrusted content?	Retrieved content becomes command surface
Retention	Do we keep enough evidence for severe or high-risk incidents?	Privacy-friendly design becomes investigation-hostile design
Disclosure	Can sensitive evidence be shared securely with auditors or regulators?	Accountability depends on vendor self-description

None of this is glamorous. It is also the difference between running agents as products and running them as expensive improvisation engines.

The real ROI is cheaper diagnosis, not perfect prevention

The instinctive business question is: will this framework prevent incidents?

Not directly.

Its more realistic value is lowering the cost and ambiguity of diagnosis. In mature operational environments, incident response is not only about preventing every failure. It is about detecting failures early, reconstructing them accurately, learning from them, and proving that fixes address the actual cause.

That has several forms of ROI.

One is engineering ROI. If a failure is traced to a tool-state visibility problem, the fix is different from a model fine-tune or prompt rewrite. Good diagnosis prevents teams from shipping theatrical mitigations: more disclaimers, stricter refusals, or a new system prompt with the emotional range of a compliance seminar.

Another is vendor-management ROI. Customers using third-party agent platforms need evidence to assess whether a vendor’s patch is adequate. Without structured incident data, buyers are left with trust statements. Trust statements are lovely. They are also not logs.

A third is regulatory ROI. As serious-incident reporting matures, companies that already maintain activity logs, change histories, and tool records will be better positioned to respond. Companies that do not will have to reconstruct governance under pressure, which is the corporate equivalent of assembling a parachute after admiring the clouds.

A fourth is product ROI. Incident analysis can reveal recurring design weaknesses: agents failing when information is behind paywalls, over-trusting tool outputs, ignoring execution errors, or prioritising user prompts over system constraints. Those patterns can guide product hardening.

The paper’s framework is therefore less like a shield and more like a black box recorder. The black box does not stop the crash. It makes denial less productive afterward.

Boundaries: the framework is necessary, not sufficient

The paper is disciplined about its limits, and those limits matter for implementation.

First, the taxonomy is not exhaustive. Domain-specific agents will have domain-specific causes. A medical triage agent, trading assistant, legal research agent, software engineering agent, and procurement agent will not fail in identical ways. The framework gives general categories; industry-specific reporting standards still need to fill in the details.

Second, the cognitive-error categories are high-level. They are useful for analysing observable behaviour, but the science of agent cognition is still immature. Future research may refine how agents attend, plan, represent tasks, use memory, or select actions. Treat the categories as working diagnostic labels, not metaphysics.

Third, implementation requires infrastructure that often does not exist. Secure incident repositories, confidential reporting channels, controlled investigator access, partial public disclosure protocols, and technical capacity inside oversight bodies are still underdeveloped. A beautiful reporting framework without institutions becomes an academic filing cabinet.

Fourth, privacy and legal constraints are real. Logs may contain user prompts, personal data, business secrets, credentials, confidential documents, and proprietary system details. Storing and sharing that material creates risk. The answer is not “log everything forever.” The answer is risk-tiered retention, minimisation, redaction, access control, and clear legal process.

Fifth, zero-retention and short-retention policies create a hard trade-off. They can protect user privacy and satisfy enterprise procurement requirements. They also make post-incident reconstruction much harder. A company cannot both refuse to retain traces and promise deep causal analysis later. Reality remains stubbornly unbundled.

Finally, the framework is conceptual. It has not been validated through a large corpus of agent incidents showing that these categories improve investigation quality. That does not make it weak. It means the next step is operational testing: apply it to real incidents, refine the taxonomy, and measure whether it improves diagnosis, remediation, and accountability.

The practical lesson: design the postmortem before the incident

The old software cliché says “move fast and break things.” Agentic AI adds a new clause: “then try to remember which thing broke, through which tool, under which prompt, after which update, using which credentials, while reading which poisoned email.”

That sentence is unpleasant because it is the job.

The paper’s value is that it gives operators a more precise way to think about agent incidents. Do not ask only what the model said. Ask what system design made the behaviour likely, what context triggered it, what cognitive function failed, and what evidence would distinguish one causal story from another.

For companies, the message is blunt. If your AI agent has meaningful autonomy, your logging, documentation, tool governance, and retention policy are part of the product. They are not back-office compliance accessories. They are what make the difference between a fixable system and a whodunit with a dashboard.

Blame is cheap. Causal evidence is expensive. The paper’s contribution is showing where to start paying.

Cognaptus: Automate the Present, Incubate the Future.

Carson Ezell, Xavier Roberts-Gaal, and Alan Chan, “Incident Analysis for AI Agents,” arXiv:2508.14231, 2025. https://arxiv.org/abs/2508.14231 ↩︎

TL;DR for operators#

The incident starts before the agent “makes a mistake”#

System factors are the conditions that make whole classes of failures more likely#

Contextual factors are the trigger zone#

Cognitive errors are useful labels, not claims about robot psychology#

The paper’s “evidence” is a map of what must be retained#

EchoLeak shows why public incident reports are too thin#

Existing incident databases classify harms; they do not reconstruct causes#

What Cognaptus infers for enterprise agent governance#

The real ROI is cheaper diagnosis, not perfect prevention#

Boundaries: the framework is necessary, not sufficient#

The practical lesson: design the postmortem before the incident#