A cloud incident does not arrive politely.

It does not say, “Hello, I am a memory leak in service X, beginning at 14:03, propagating through service Y, and pretending to be a latency spike somewhere else.” That would be useful. Naturally, production systems prefer theatre.

So when companies imagine AI agents taking over cloud Root Cause Analysis (RCA), the promise sounds almost unfairly attractive. Give the agent logs, metrics, traces, a Python executor, and a large enough model. Let it inspect the evidence, reason through the causal chain, and return the faulty component, incident time, and failure reason before the human on-call engineer has finished the second coffee.

The paper behind today’s article asks why this promise keeps breaking.1 Its answer is more useful than the usual “LLMs hallucinate” shrug. The problem is not simply that the model is weak. The problem is that the diagnostic system loses truth at three interfaces: inside the reasoning agent, between collaborating agents, and inside the execution environment.

That distinction matters. If the bottleneck is model intelligence, the obvious response is to buy a stronger model. If the bottleneck is system architecture, the cure is less glamorous: expose intermediate artifacts, verify interpretations against raw telemetry, constrain execution state, and stop pretending that natural-language summaries are a safe interface for operational diagnosis.

AIOps, meet plumbing. It was always going to happen.

The paper studies the diagnostic process, not just the final answer

The paper uses OpenRCA, a benchmark for cloud RCA agents. OpenRCA contains 335 failure incidents across three service domains: Telecom, Bank, and Market. Each incident has a ground-truth root cause: the faulty component, the incident time, and the failure reason. The telemetry is not a toy dataset. It includes metrics, logs, and distributed traces, totaling 68.5GB and more than 523 million lines.

The baseline OpenRCA agent uses a Controller–Executor design:

Role What it does Where failure can enter
Controller Reasons in natural language, decides what to investigate Misreads results, narrows search too early, mistakes symptoms for causes
Executor Converts instructions into Python, runs code on telemetry Writes mismatched code, returns partial or misleading output
Runtime environment Stores data, executes code, preserves kernel state Accumulates memory, exhausts step budget, crashes

This setup is sensible. One agent reasons; another executes. It is also exactly where the trouble starts.

The paper runs the full benchmark across five models: Gemini 2.5 Pro, GPT-5 mini, GPT-OSS 120B, Solar Pro 2, and Claude Sonnet 4. That produces 1,675 agent runs. The researchers then classify failures into 12 pitfall types across three architectural categories: intra-agent reasoning, inter-agent communication, and agent-environment interaction.

That process-level focus is the paper’s real contribution. Many agent benchmarks tell us whether the final answer was correct. This paper asks where the diagnostic chain broke. In business terms, it moves the question from “Did the agent solve the ticket?” to “Which part of the operating model corrupted the investigation?”

That is the more expensive question, which usually means it is also the more useful one.

Stronger models still fail because the shared framework keeps losing evidence

The baseline accuracy is low enough to be impolite.

The best model, Gemini 2.5 Pro, reaches 12.5% perfect detection across all 335 incidents. GPT-5 mini reaches 8.4%. GPT-OSS 120B reaches 6.9%. Solar Pro 2 reaches 5.7%. Claude Sonnet 4 reaches 3.9%.

A perfect detection means all three required elements are correct: component, time, and reason. Partial detection is higher, but partial RCA is only partly comforting. Knowing that “something near payments looked bad sometime yesterday” is not exactly the triumph of autonomous operations.

The important point is not merely that the numbers are low. It is that the dominant failure modes persist across models with different providers, capabilities, and cost profiles.

Dominant pitfall Overall rate across 1,675 runs What it means
Hallucination in Interpretation 71.2% The Controller imposes a plausible story that does not match the returned data
Incomplete Exploration 63.9% The agent skips relevant components or KPI families
Symptom-as-Cause 39.9% The agent stops at the first anomaly instead of tracing upstream
Code Generation Error 27.2% The generated code fails or behaves incorrectly
Limited Telemetry Coverage 26.9% The agent relies too heavily on one source, often metrics
Timestamp Error 23.3% The analysis uses the wrong time window, often due to timezone issues
No Cross-Validation 18.6% The agent accepts a single evidence source without checking alternatives

Two numbers deserve special attention: 71.2% and 63.9%.

Hallucination in Interpretation appears in more than seven out of ten runs. Incomplete Exploration appears in almost two-thirds. These are not rare edge cases. They are the normal behavior of the system under pressure.

The paper’s key interpretive move is to separate model-specific weaknesses from framework-level bottlenecks. Code generation errors vary sharply by model: Gemini 2.5 Pro has only 1.8%, while Claude Sonnet 4 has 65.5%. That looks like a model capability issue. But Hallucination in Interpretation remains above 66% for every tested model, and Incomplete Exploration remains above 53% for every tested model.

That pattern says something uncomfortable. Some failure modes can be reduced by choosing a better model. Others survive model choice because the shared agent framework keeps creating the same conditions for failure.

For businesses, this is the difference between procurement and engineering. Procurement asks which model to buy. Engineering asks why the same diagnostic blind spots appear after the model has changed.

The first interface breaks inside the Controller: the agent sees data, then tells a story

The first failure mechanism is intra-agent reasoning.

The Controller receives outputs from the Executor and turns them into diagnostic conclusions. This is where the agent should behave like a disciplined engineer: inspect the returned values, compare hypotheses, test causal direction, and avoid overclaiming.

Instead, the dominant pattern is narrative fabrication. The Controller sees data and then gives the data a meaning it does not support.

This is more dangerous than a random hallucination. In a cloud RCA setting, a bad interpretation often sounds operationally plausible. CPU pressure could cause latency. A memory spike could point to a leak. A downstream timeout could suggest a faulty service. The words fit. The numbers may not.

That is why the paper’s term “Hallucination in Interpretation” is useful. The agent is not merely inventing entities from nowhere. It is assigning the wrong causal meaning to retrieved evidence.

Incomplete Exploration compounds the problem. The agent may never inspect the relevant KPI family or component. Even when the prompt provides candidate components and KPI categories, agents routinely skip whole categories. The paper gives the example of agents focusing on CPU metrics while failing to query network KPIs. Limited Telemetry Coverage repeats the same pattern at the modality level: agents often rely on metrics while underusing logs and traces.

The result is a diagnostic funnel with two defects:

  1. The agent may not look broadly enough.
  2. When it does find evidence, it may interpret that evidence incorrectly.

The business implication is blunt. A cloud RCA agent cannot be evaluated only by the elegance of its final explanation. The system needs coverage tracking and evidence verification. It should know which components, KPI families, and telemetry modalities have been examined. It should also distinguish raw observation from interpretation.

A practical design rule follows:

Diagnostic layer Unsafe behavior Better system constraint
Exploration “I checked enough.” Require explicit coverage across relevant components, KPIs, and telemetry modalities
Interpretation “This spike means X.” Force claims to reference raw values, time windows, and comparison baselines
Causality “The first anomaly is the cause.” Require upstream/downstream causal tracing before final attribution
Validation “One source supports it.” Require cross-checks across metrics, logs, or traces when available

Notice what this table does not say: “Write a better prompt telling the agent to be careful.” The paper tests that instinct. It does not go well.

Prompting broadens the search, but does not fix the interpretation

The most natural response to intra-agent failure is prompt engineering. Tell the agent to form hypotheses. Tell it to avoid known pitfalls. Tell it to check more KPIs. Tell it not to hallucinate. Perhaps also tell it to breathe deeply and think step by step, because apparently software reliability now has wellness rituals.

The paper tests two prompt-level interventions on Claude Sonnet 4 across 70 Bank-domain tasks:

Test Likely purpose What it supports What it does not prove
Hypothesis-driven prompting Main mitigation test for incomplete exploration Structured prompts can broaden the agent’s investigation scope Broader search alone improves root-cause accuracy
Pitfall-aware prompting Main mitigation test for known failure patterns Agents can acknowledge pitfall descriptions in their reasoning trace Awareness prevents interpretive hallucination
Comparison against baseline Controlled intervention check Prompt changes affect some behaviors but not the dominant failure Prompting is sufficient for reliable RCA

The result is asymmetrical.

Hypothesis-driven prompting does broaden exploration. Previously ignored KPI categories, such as memory utilization, appear as explicit hypotheses. That is a real improvement. The agent looks in more places.

But Hallucination in Interpretation persists at comparable rates. The agent reaches relevant data and still imposes incorrect interpretations onto it. Pitfall-aware prompting shows a similar problem: the agent can mention the pitfall in its reasoning trace and then reproduce the pitfall in practice.

This is a familiar agent failure pattern. The model can recite the rule. The system does not enforce the rule.

For enterprise deployment, this is where many prototypes quietly become expensive. A team sees that a prompt improves apparent diligence. The trace becomes longer, the reasoning more structured, the checklist more professional. Yet the key failure remains: interpretation is not externally verified.

The paper’s evidence suggests a sharper design principle: prompt engineering can change attention allocation, but it cannot reliably turn generated interpretations into verified causal claims.

That does not make prompting useless. It makes prompting insufficient. A prompt can ask the agent to consider memory metrics. A verification module can check whether the memory time series actually supports the proposed causal story. Those are different control mechanisms.

The second interface breaks between agents: natural language is a leaky handoff

The OpenRCA baseline uses a natural-language communication interface between Controller and Executor. The Controller sends instructions. The Executor writes and runs Python. The Executor returns a natural-language summary.

This sounds reasonable until we remember what is being delegated: operational diagnosis over large telemetry datasets. A summary is not the evidence. It is a compression of the evidence. It may omit the code path, the failed query, the exception, the exact aggregation, or the slice of time that was actually analyzed.

The paper identifies three inter-agent pitfalls:

Pitfall Mechanism Operational consequence
Instruction-Code Mismatch Executor code does not reflect Controller intent The system analyzes the wrong thing while believing it followed the plan
Meaningless Repetition Controller repeats failed directives Step budget is wasted without diagnostic progress
Misattributed Evidence Controller accepts Executor output without seeing how it was produced Later reasoning is built on flawed evidence

Instruction-Code Mismatch is the most important of these. For GPT-5 mini, Solar Pro 2, and Claude Sonnet 4, it affects roughly 20–26% of execution steps. GPT-5 mini and Claude Sonnet 4 reach the highest rates at 25.5% and 24.8%.

This is not just a “communication style” issue. It is a loss of auditability.

If the Controller cannot see the generated code and raw output, it cannot know whether the Executor actually implemented the intended analysis. If the Executor does not receive enough diagnostic context, it may write code for a subtly different question. The agents remain fluent, but the evidence chain has already drifted.

This is one of the paper’s most business-relevant findings. Many agent systems are designed as polite bureaucracies: one agent asks, another agent summarizes, a third agent decides. Everyone writes in complete sentences. Nobody sees the receipts.

For operational work, that is fragile. RCA needs traceability. A diagnostic claim should be connected to:

  • the instruction that triggered the analysis;
  • the code or query used to retrieve evidence;
  • the raw output or exception returned;
  • the interpretation made from that output;
  • the next decision caused by that interpretation.

Without that chain, the agent system can become a theatre of delegated confidence.

Exposing code and raw output improves both reliability and efficiency

The paper’s enriched communication experiment is the cleanest evidence that architecture matters.

The intervention is simple. The Executor returns generated Python code and complete execution output, including exceptions and stack traces, alongside the natural-language summary. The Executor also receives more diagnostic context from the Controller, including the Controller’s full analysis, a snippet of previous execution output, and the overall objective.

This is not a new model. It is not a larger context window. It is not a magical “agentic reasoning” upgrade. It is a better interface.

The results are meaningful:

Metric Baseline Enriched communication Interpretation
Communication-related pitfall reduction Up to 14–15 percentage points for GPT-5 mini and Claude Sonnet 4 Exposing artifacts reduces handoff loss
Average steps per run 11.9 9.2 Better shared context reduces wasted loops
Token consumption per step 68K 85K Each step becomes more information-rich
Total tokens Baseline Down 1.6% Fewer steps offset heavier steps
Execution time Baseline Down 22.3% Less repetition improves efficiency
GPT-5 mini perfect detection, Bank subset 0.0% 4.9% Accuracy improves from a very low base
Gemini 2.5 Pro perfect detection, Bank subset 2.4% 7.3% Improvement appears across models
Solar Pro 2 perfect detection, Bank subset 4.9% 7.3% Gains are not limited to one provider

This is a useful result because it resists a lazy trade-off story. One might expect richer communication to cost more. It does cost more per step. But it reduces the number of steps enough to slightly lower total token usage and materially reduce execution time.

The mechanism is straightforward. When the Controller sees code, errors, and raw outputs, it can detect mismatches earlier. When the Executor receives richer context, it is less likely to produce code for the wrong analytical intent. When both sides share more state, the system repeats fewer failed actions.

For business teams building internal agents, the lesson is immediate: do not optimize for short messages between agents too early. Optimize for artifact-rich handoffs. In high-stakes workflows, compact summaries are not always efficient. Sometimes they are just cheap ways to lose the plot.

The third interface breaks in the runtime: state exists, but the agent cannot see it

The final category is agent-environment interaction.

The baseline system uses a persistent Python kernel to reduce redundant data loading. This is sensible because telemetry datasets are large. Keeping state in memory can reduce latency.

But the agent does not understand the accumulated kernel state. It can reload data already in memory or fail to release obsolete variables. The result is out-of-memory failure. In the Bank domain, the paper reports that 2 of 41 scenarios terminated when agents reloaded datasets or failed to manage memory.

This is not a reasoning failure in the usual sense. The agent might have a reasonable diagnostic plan. It may still crash because the execution environment has become hostile.

The paper also reports Max Step Exhaustion at 4.1% overall, with GPT-5 mini at 10.4% and Claude Sonnet 4 at 8.3%. Step exhaustion is a more ambiguous signal. It could reflect inefficient looping, but it could also reflect more thorough exploration. The paper correctly treats it as requiring joint analysis of exploration breadth and step consumption.

The mitigation for out-of-memory failure is practical: a memory watcher monitors kernel consumption, terminates execution when a threshold is exceeded, sends a structured warning to the Controller, and restarts the kernel so the Controller can generate a more memory-efficient implementation. Validation across all models and domains eliminates observed baseline OOM failures.

Again, not glamorous. Again, useful.

This is the kind of result that often gets underappreciated in agent discussions. Reliability is not only about reasoning quality. It is also about runtime contracts. An agent needs visibility into state, resource limits, execution history, and failure modes. Otherwise, it is operating a machine whose dashboard has been covered with a motivational poster.

The business lesson is not “agents are bad”; it is “diagnosis needs control surfaces”

It would be easy to turn this paper into a pessimistic story: AI agents cannot do RCA; hallucination is everywhere; humans are safe; please return to manually reading logs at 3 a.m.

That is not the right conclusion.

The paper shows that agent-based RCA fails systematically, but it also shows which interventions help. Prompting improves exploration but not interpretation. Enriched communication reduces communication failures and improves efficiency. Runtime monitoring eliminates observed OOM crashes. These are not philosophical observations. They are design levers.

A business-oriented reading should separate three layers:

What the paper directly shows Cognaptus interpretation Practical boundary
Perfect detection remains low across five models on OpenRCA Model selection alone is not enough for reliable autonomous RCA Results are specific to the OpenRCA-style benchmark and baseline architecture
Hallucination in Interpretation and Incomplete Exploration persist across model tiers Dominant failures are framework-shaped, not only model-shaped The taxonomy needs validation across other RCA systems
Prompt interventions broaden exploration but do not fix interpretation Prompts can guide search but cannot enforce truth Prompt design still matters, just not as the final control layer
Enriched communication reduces inter-agent pitfalls and improves efficiency Artifact-rich handoffs are a high-ROI architectural fix Mitigation experiments are mainly on the Bank-domain subset
Memory watcher eliminates observed OOM failures Runtime state must be monitored as part of agent reliability OOM is only one class of environment failure

The phrase “control surfaces” is useful here. A reliable diagnostic agent needs places where the system can observe, constrain, and correct behavior.

A prompt is one control surface. It influences intention. But RCA also needs control surfaces for evidence coverage, interpretation verification, inter-agent handoff, runtime state, and final claim validation.

In practice, that could mean:

  • a telemetry coverage tracker that records which components, KPIs, logs, metrics, and traces have been examined;
  • a claim-evidence table that forces every causal assertion to point to raw output;
  • a code-and-output exposure protocol between reasoning and execution agents;
  • a verifier module that checks whether an interpretation follows from the returned values;
  • a causal-chain validator that distinguishes upstream causes from downstream symptoms;
  • a memory and resource monitor that turns runtime limits into structured feedback instead of silent failure;
  • a final escalation policy when evidence is partial, contradictory, or underexplored.

This is less magical than “autonomous AIOps.” It is also closer to production engineering.

What this means for AI agent ROI

For companies, the immediate temptation is to treat agent improvement as a model-ranking exercise. Run the same workflow on several models. Pick the one with the best score. Negotiate price. Add a dashboard. Call it transformation.

The paper suggests that this is incomplete.

When Code Generation Error varies from 1.8% to 65.5%, model choice clearly matters for some tasks. If your agent must write executable analysis code, code reliability is a valid selection criterion. But when Hallucination in Interpretation remains above 66% across all tested models, a better model alone does not solve the core diagnostic problem.

That changes the ROI logic.

Investment option Likely benefit Risk if used alone
Stronger model Better code generation, sometimes better reasoning Persistent framework-level pitfalls remain
Longer prompt Broader exploration, clearer trace Interpretive hallucination may remain unchanged
Enriched agent communication Fewer handoff failures, fewer wasted steps Does not fully solve intra-agent interpretation
Verification modules Better constraint on causal claims Requires careful design and telemetry integration
Runtime monitoring Fewer catastrophic execution failures Does not improve reasoning by itself
Coverage and validation dashboards Better governance and human oversight Adds process cost if not integrated cleanly

The best business case is probably not “replace SREs with agents.” That framing is both premature and unimaginative.

A better near-term case is “reduce diagnostic search cost while preserving human escalation.” Let agents gather evidence, run structured checks, propose hypotheses, and expose their reasoning artifacts. Let humans inspect a shorter, better-audited diagnostic trail. Over time, as verification improves, more incident classes can move from human-led to agent-led.

That is a more credible automation path than pretending a 12.5% perfect-detection ceiling is production autonomy.

Where the paper’s evidence should not be overread

The paper is useful, but its boundaries matter.

First, the mitigation experiments are mainly conducted on the Bank-domain subset. The enriched communication results are promising, but they should not be treated as universal proof across every cloud architecture or RCA benchmark.

Second, the failure taxonomy is built through semi-automated classification using an analysis agent with human verification. That is a reasonable method at this scale, but it is not the same as a fully automated, continuously validated monitoring system. The taxonomy is a foundation for such a system, not the finished instrument.

Third, the paper studies OpenRCA-style multi-agent RCA. Other architectures may distribute memory, tools, code execution, or telemetry access differently. The exact pitfall rates may change. The broader mechanisms—interpretation error, handoff loss, and hidden runtime state—are likely more general, but that remains an inference, not something fully proven by this paper.

Fourth, the paper’s strongest mitigation does not solve the largest pitfall. Enriched communication reduces communication failures. It does not eliminate Hallucination in Interpretation. The authors point toward external verification modules, structured state sharing, and adaptive task decomposition as future directions. That is exactly where serious product work would need to continue.

In short: this is not a solved-RCA paper. It is a why-your-agent-keeps-failing paper. That is less comforting and more valuable.

The real root cause is architectural opacity

The article’s title asks whether cloud RCA agents find the root cause or merely produce a root illusion. The paper’s answer is that illusion enters the system through opacity.

The Controller’s interpretation is opaque because the system does not sufficiently verify generated narratives against raw telemetry. The inter-agent handoff is opaque because natural-language summaries hide code, errors, and execution details. The runtime is opaque because persistent kernel state affects execution while remaining outside the agent’s awareness.

Once seen this way, the paper’s findings become less surprising. The agent is not failing because it lacks enough words. It is failing because the diagnostic chain lacks enough observable structure.

That is the practical takeaway for AIOps teams, cloud platforms, and enterprise AI builders. Reliable agents are not created by adding more eloquence to a fragile workflow. They are created by making evidence, state, and delegation inspectable.

The next generation of RCA agents should not simply be better explainers. They should be better instrumented diagnostic systems.

Because in cloud operations, a confident wrong answer is not intelligence. It is just downtime with punctuation.

Cognaptus: Automate the Present, Incubate the Future.


  1. Taeyoon Kim, Woohyeok Park, Hoyeong Yun, and Kyungyong Lee, “Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?”, arXiv:2602.09937, 2026. ↩︎