A failed agent run rarely fails politely.
It does not raise its hand at step 4 and say, “Here is the causal error; please patch the planner.” It drifts. A web agent grabs the wrong source. A coding agent trusts a bad assumption. A verifier rubber-stamps a plausible-looking answer. Twenty steps later the final output is wrong, the dashboard says “failed,” and the team is left doing digital archaeology with a very expensive shovel.
That is the problem Capital One’s RAFFLES paper tackles: not whether an LLM-agent system succeeded, but where its failure first became decisive.1 The distinction sounds small until one has to debug a multi-agent workflow in production. A pass/fail score tells you the patient is unwell. RAFFLES tries to find the first organ that actually failed. Slightly more useful, one might say.
The paper’s core contribution is a shift from outcome evaluation to decisive-fault attribution. Instead of asking only whether the final answer is correct, it formalises a sharper question: which step was the earliest non-trivial, trajectory-altering mistake? Then it builds an offline Judge–Evaluator system that reasons over that question iteratively. The results are not magic. Exact step attribution remains hard, especially on long agent traces. But the paper offers a useful design pattern for enterprise AI teams: evaluation systems need structure, memory, critique, and stopping rules too.
The important unit is not the first mistake, but the first mistake that matters
The easiest misconception is to treat fault attribution as “find the first error.” That is not quite what the paper is doing.
In long agent trajectories, some errors are merely local. A model may phrase something badly, retrieve a weak source, or take a redundant action, and the system may later recover. Other errors are fatal but late: the pipeline is already doomed by the time they appear. The interesting step is the one where a plausible correction would have changed the final outcome.
The paper separates these categories:
| Concept | Meaning | Why it matters operationally |
|---|---|---|
| Step-level fault | A local step is wrong. | Useful for quality review, but may overcount harmless mistakes. |
| Trivial fault | A mistake occurs, but correcting it would not salvage the run. | Bad place to spend debugging time. |
| Critical fault | Correcting this step could change failure into success. | Candidate for root-cause remediation. |
| Decisive fault | The first critical fault in the trajectory. | The best target for preventing the cascade. |
That hierarchy is the paper’s most important move. It makes fault attribution less like proofreading and more like causal triage. The goal is not to complain about every dent in the process. The goal is to find the first dent that bent the axle.
This matters because agent failures are rarely isolated. A retrieval error can poison a summariser. A planning error can force a tool into the wrong task. A verifier can convert uncertainty into false confidence. Once the failure has propagated, the most visible mistake may not be the most useful one to fix.
RAFFLES turns one vague judgement into three inspectable claims
RAFFLES stands for Reasoning-based Attribution of Faults for LLM Systems. Its architecture is simple enough to explain, but more disciplined than the usual “ask a bigger model to judge the log” pattern.
The system has a central Judge. The Judge reads the trajectory and proposes a candidate decisive fault: the agent, the step, and the reason. Crucially, it must justify the candidate using three criteria:
- Fault condition: did the step actually contain a mistake?
- Primacy: was this the first relevant mistake tied to the final failure?
- Decisiveness: was the mistake not corrected later, and was it materially responsible for the failed outcome?
Then specialised Evaluators critique the Judge’s reasoning. One checks whether the proposed step is genuinely faulty. Another checks whether it is really the first relevant mistake. A third checks whether the mistake remained uncorrected and decisive. The system also uses a rule-based consistency check to ensure the proposed step matches the log.
The Evaluators return rationales and confidence scores. Their critiques are written into a memory component. The Judge then tries again, now with the earlier critiques in view. The loop stops when confidence is high enough or when the maximum number of iterations is reached.
A compact way to see the mechanism:
Full trajectory
↓
Judge proposes candidate decisive fault
↓
Evaluator 1 checks: is it a real fault?
Evaluator 2 checks: is it the first relevant fault?
Evaluator 3 checks: did it remain decisive?
Rule check verifies log consistency
↓
Memory stores critiques and scores
↓
Judge revises or confirms candidate
↓
Final predicted decisive fault
The point is not that RAFFLES uses more LLM calls. That would be the least interesting possible interpretation. The point is that RAFFLES makes the evaluator’s reasoning modular. A single-pass judge can hide three separate failures inside one fluent paragraph: it may identify a real error, but the wrong one; it may find an early error, but one the system later fixed; or it may find a visible late error that was only a symptom. RAFFLES forces those questions apart.
That is the mechanism-first lesson: as agent systems become multi-step and multi-component, evaluators cannot remain monolithic. The evaluator has to resemble the system it is judging: structured, iterative, and capable of reviewing its own intermediate claims. Yes, we have reached the point where even the judge needs management.
The benchmarks test different kinds of diagnosis
The paper evaluates RAFFLES mainly on two benchmark families, and they should not be mentally collapsed.
The first is Who&When, a benchmark for attributing failures in multi-agent systems. The authors use two subsets:
| Dataset subset | Sample size | Average steps | Average words | What it stresses |
|---|---|---|---|---|
| Who&When Algorithmically-Generated | 126 | 8.72 | 1,507.33 | Breadth across many agents and shorter traces |
| Who&When Hand-Crafted | 58 | 50.60 | 7,459.91 | Long-horizon diagnosis with much larger search space |
| ReasonEval MR-MATH-Invalid | 159 | 6.78 | 179.91 | Step-level reasoning errors |
| ReasonEval MR-GSM8K Original | 1,418 | 7.05 | 104.82 | Mathematical reasoning error localisation |
| ReasonEval MR-GSM8K Reversed | 1,359 | 11.26 | 164.98 | Longer reasoning traces under altered problem framing |
The second benchmark family is ReasonEval, which concerns step-level errors in mathematical reasoning chains. This is related, but not identical. In ReasonEval, the paper omits the decisiveness criterion because the dataset annotation is about reasoning-step correctness rather than whether a step causally doomed a failed agent trajectory.
That distinction matters. The Who&When experiments are the main evidence for agentic fault attribution. ReasonEval is better read as a transfer test: can the same structured Judge–Evaluator pattern help locate errors outside multi-agent logs? The answer appears to be yes, though it does not prove that RAFFLES is a universal process reward model. The authors are careful on this point; the article should be too. We are all adults here, allegedly.
The main evidence: RAFFLES improves exact step attribution, but exact attribution is still hard
The headline result is straightforward: RAFFLES outperforms one-shot judges, router-style methods, binary search, and a Tool-Caller baseline on strict step-level accuracy.
On Who&When, the strict metric requires the method to predict the exact failing step. That is a harsh metric, especially for the Hand-Crafted subset with roughly fifty steps and long logs. The numbers show both progress and difficulty.
| Method | Llama 3.3 70B Algorithmic | Llama 3.3 70B Hand-Crafted | Claude Sonnet 4 Algorithmic | Claude Sonnet 4 Hand-Crafted |
|---|---|---|---|---|
| Chat-LLM | 19.05% | 6.90% | 29.37% | 3.45% |
| Router: Step by Step | 6.35% | 3.45% | 30.16% | 12.07% |
| Router: Binary Search | 4.76% | 10.34% | 33.33% | 20.69% |
| Tool-Caller | 33.33% | 13.56% | 30.95% | 18.97% |
| RAFFLES | 43.65% | 20.69% | 51.59% | 22.41% |
The Claude result on the Algorithmically-Generated subset is the cleanest headline: RAFFLES reaches 51.59% strict step-level accuracy, compared with 33.33% for Binary Search and 30.95% for Tool-Caller in the same table. On the Hand-Crafted subset, the gain is more modest in absolute terms: 22.41% for RAFFLES versus 20.69% for Binary Search and 18.97% for Tool-Caller.
This is exactly where the sober interpretation matters. RAFFLES is not “solving” agent debugging. A strict score around 22% on the harder long-context setting means most exact step predictions are still wrong. But compared with one-shot judging at 3.45% under Claude, the improvement is practically meaningful. The method moves the process from “nearly blind” toward “useful triage,” which is not a slogan one puts on a billboard but is often what engineering teams actually need.
The paper also reports tolerant step-level accuracy, where predictions within a step window count as useful. With Llama 3.3 70B on the Algorithmically-Generated subset, RAFFLES rises from 43.65% exact accuracy to 73.81% within a ±2-step window and 82.54% within ±3. On the Hand-Crafted subset, the same tolerant metric remains much harder: 20.69% exact, 27.59% within ±2, and 29.31% within ±3.
That gap is important. Shorter traces with early faults are much more forgiving. Long, hand-crafted traces are still stubborn. Any enterprise adopting this pattern should measure both exact and tolerant localisation. If the tool cannot always identify the single bad step, a narrow review window may still reduce human debugging cost. If the window remains too wide, the system is merely producing confident paperwork. We already have enough of that.
The ablations and extensions say structure helps, but iteration has diminishing returns
The paper’s evidence is not just a leaderboard. Several supporting analyses clarify why RAFFLES helps and where the gains are limited.
| Test or analysis | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Comparison with Chat-LLM, routers, Binary Search, and Tool-Caller | Main evidence and comparison with prior styles | Structured Judge–Evaluator reasoning improves step attribution over simpler evaluator architectures | That RAFFLES will generalise unchanged to all enterprise workflows |
| Tolerant step-level accuracy | Practical utility test | Approximate localisation can be more useful than exact matching alone | That a tolerance window is always sufficient for remediation |
| ReasonEval experiments | Transfer / exploratory extension | The structured reasoning pattern helps on mathematical reasoning traces too | That RAFFLES replaces specialised process reward models |
| Iteration analysis | Ablation / sensitivity test | Candidate revision improves alignment with ground truth distributions and often improves accuracy | That more iterations always help |
| Latency and cost analysis | Implementation detail | Iterative evaluation is feasible for offline debugging workloads | That it is suitable for real-time gating in latency-sensitive products |
| Dataset discussion | Boundary condition | Results depend heavily on benchmark structure, label quality, and trace length | That the reported gains are benchmark-independent |
The iteration analysis is especially useful because it prevents a lazy conclusion: “more reflection equals better results.” The paper does not show that. It shows that structured iteration often helps, but improvements are not perfectly monotonic. Candidate steps change substantially early and then settle: between the first and second iterations, candidate steps changed in 25.40% of Algorithmically-Generated examples and 37.93% of Hand-Crafted examples. Later changes were much smaller.
That pattern suggests RAFFLES is doing real revision, not simply decorating its first answer with additional prose. But it also shows diminishing returns. The paper argues for early stopping and maximum iteration limits because iterative self-critique can stall or even degrade performance. In other words: reflection is useful, but only when someone has the authority to end the meeting.
The appendix also helps explain the long-context weakness. In Hand-Crafted traces, ground-truth faults are more distributed across the trajectory, while initial predictions tend to favour early steps. Later Evaluator critiques help push the Judge toward later candidates when appropriate. This is a plausible mechanism: the model begins with an early-fault bias, then structured critique partially corrects it. Partially is doing work in that sentence.
Tool-calling is not the same as disciplined evaluation
One of the more interesting comparisons is RAFFLES versus the Tool-Caller baseline.
The Tool-Caller baseline gives a planner access to the full log and lets it choose steps to inspect with an LLM judge. This sounds attractively agentic. It has global context, targeted inspection, and flexible action. In many product decks, that would already be called an “autonomous evaluation layer,” followed by a slide with arrows and a suspiciously cheerful gradient.
Yet Tool-Caller still underperforms RAFFLES across the main Who&When comparisons. The paper attributes this to the planner’s weaker procedural reliability. It can choose poor candidate steps, and its reasoning is not decomposed into the three decisive-fault criteria.
The lesson is not “avoid tools.” It is that tool use without an explicit evaluation protocol can become expensive wandering. RAFFLES constrains the evaluator around the structure of the decision: fault, primacy, decisiveness. The freedom to inspect is less valuable than the discipline to inspect the right claims.
For enterprise agent systems, this is a useful correction. Many teams will be tempted to debug agents with more agents: a planner, a critic, a log reader, a tool inspector, perhaps a summariser wearing a small hat. The RAFFLES result suggests that the key design question is not how many sub-agents exist, but whether each one owns a distinct, testable part of the attribution problem.
What businesses can actually use from this paper
The direct paper result is benchmark performance on offline fault attribution. The business inference is broader but should remain bounded.
The practical use case is not real-time autonomous incident response. It is post-hoc diagnosis for failed or suspicious runs. A company running agentic workflows can store indexed traces, pass failed runs into a RAFFLES-like evaluator, and use the output to narrow human review.
That can support several operational workflows:
| Operational problem | RAFFLES-style use | Business value | Boundary |
|---|---|---|---|
| Failed agent run with long trace | Identify likely decisive step and rationale | Shorter debugging loop | Exact localisation may still fail on long traces |
| Repeated failures by same component | Aggregate decisive faults by agent, tool, or step type | Prioritise remediation work | Requires consistent trace schemas and labels |
| Regression after prompt or tool change | Compare decisive-fault distributions before and after deployment | Faster rollback or patch targeting | Needs baseline data and controlled rollout |
| Governance review for high-impact workflows | Preserve Judge–Evaluator rationales as audit artefacts | Better explainability than final-score dashboards | Human review remains necessary |
| Synthetic data generation | Use high-confidence diagnoses to create training/evaluation examples | Better downstream evaluators | Risk of reinforcing evaluator errors |
The strongest business case is cost reduction in diagnosis. The paper estimates RAFFLES cost at about $0.1188 per Algorithmically-Generated log and $0.306 per Hand-Crafted log using Claude Sonnet 4 with two maximum iterations, compared with rough junior-engineer debugging estimates of $2.50 and $12.50 per log respectively. Treat those as illustrative, not universal. Labour rates, trace length, model pricing, security requirements, and workflow complexity will all move the number.
Still, the direction is credible: offline automated triage can be cheap enough to run on many failures, especially when the alternative is manual trace review. The more important question is not whether RAFFLES is cheaper per log. It is whether its diagnoses are accurate enough to change engineering behaviour. For short and medium traces, the evidence is promising. For very long traces, it is promising with a raised eyebrow.
The metric choice is part of the contribution
The paper’s insistence on step-level accuracy deserves attention because metrics shape product decisions.
Agent-level accuracy can be misleading in imbalanced datasets. In the Who&When Hand-Crafted subset, the WebSurfer agent accounts for 33 of 58 decisive faults, and WebSurfer plus Orchestrator dominate the event distribution. A trivial system that guesses WebSurfer can look respectable at the agent level. That does not mean it understands where the workflow failed.
Step-level accuracy is stricter. It asks the evaluator to find the actual point of failure, not merely the usual suspect. Tolerant step-level accuracy then adds a practical layer: if the tool points within two or three steps of the decisive fault, that may be enough to guide a human reviewer.
For business use, both should be tracked:
- Strict step accuracy tells whether the evaluator can pinpoint.
- Tolerant step accuracy tells whether it can narrow the search.
- Agent-level accuracy tells whether it can identify the responsible component, but should be treated carefully when component frequencies are skewed.
- Rationale quality should be reviewed separately, because a correct step with a bad explanation is dangerous in governance settings.
A boring metric can be a safety feature. This is rude, but true.
Where the paper’s boundaries matter
RAFFLES should not be read as a finished enterprise reliability system. It is a research architecture with several important boundaries.
First, the evidence depends on available benchmarks. The authors themselves note that high-quality agentic fault-attribution datasets remain scarce. The Who&When dataset is valuable because it gives full traces and annotated agent-step fault pairs, but the appendix reports label inconsistencies in six cases across the two subsets. That does not invalidate the results, but it reminds us that evaluator research is only as clean as the labels beneath it.
Second, the scope is English-language datasets focused on agentic systems and mathematical reasoning. A firm deploying multilingual customer-service agents, finance agents, or regulated decision-support workflows would need domain-specific validation. “It worked on Who&When” is not an audit strategy. It is the beginning of one.
Third, RAFFLES is offline. The paper’s cost and latency numbers are reasonable for post-hoc diagnosis, but this is not the same as inserting RAFFLES into every live user interaction. With Claude Sonnet 4, the paper reports parallelised latency of 15.55 seconds for the shorter subset and 29.15 seconds for the longer one. That is fine for debugging queues. It is less fine for a customer waiting at checkout.
Fourth, confidence scores are not calibrated truth. Evaluators can be persistently overconfident. The paper reports incorrect predictions persisting across all iterations in about 20.6% of Algorithmically-Generated examples and 17.2% of Hand-Crafted examples. That is not a footnote-level concern. It means the system can converge on the wrong answer with confidence, which is precisely the genre of failure AI systems enjoy contributing to civilisation.
Finally, decisiveness is partly counterfactual. The formal definition asks whether replacing a faulty step could have led to success. In real systems, proving that counterfactual is difficult without simulation, replay, or human judgement. RAFFLES approximates this through structured reasoning. That approximation can be useful, but it should not be confused with causal proof.
A practical implementation pattern for agent teams
A RAFFLES-inspired system does not require copying the paper verbatim. The design pattern is more important than the acronym.
A useful enterprise implementation would start with trace hygiene:
- Persist every agent step with stable indices.
- Store agent identity, tool inputs, tool outputs, retrieved documents, intermediate reasoning summaries where available, and final outcome.
- Mark failed, suspicious, or low-confidence runs for offline diagnosis.
- Run a structured Judge that must separate fault, primacy, and decisiveness rationales.
- Run specialised Evaluators that critique each rationale independently.
- Store the full critique history as an incident artefact.
- Aggregate diagnoses by component, tool, prompt version, workflow type, and release window.
The end product should not be “RAFFLES says step 12.” It should be a diagnosis package:
Likely decisive fault: Web retrieval step 12
Confidence profile:
- Fault condition: high
- Primacy: medium
- Decisiveness: high
Alternative candidate: planner step 7
Human-review window: steps 7–13
Recommended remediation class: retrieval validation / query reformulation
That is the difference between a benchmark method and an operational tool. The benchmark rewards exact prediction. The business process needs prioritised evidence, alternative hypotheses, and review boundaries.
The real thesis: evaluators are becoming systems, not prompts
The most useful idea in the RAFFLES paper is not simply “iterate.” It is that evaluation has to become architected.
The first generation of LLM evaluation treated the model as an answer machine. The second treated another model as a judge. RAFFLES points toward a third pattern: evaluators as structured systems with roles, memory, critique, thresholds, and explicit definitions of what counts as failure.
That shift is overdue. Agentic systems are not single completions. They are trajectories. Their failures are not single wrong tokens. They are cascades of planning, retrieval, execution, and verification errors. Evaluating them with a one-shot verdict is like inspecting a factory by tasting the final biscuit. Useful, perhaps, but not exactly process control.
RAFFLES gives teams a more serious template. It does not remove the need for human review. It does not guarantee root-cause truth. It does not make long-context diagnosis easy. What it does is turn failure analysis into a structured, inspectable, repeatable process.
For companies building real agent workflows, that is the right level of ambition. Not omniscient automation. Cheaper, better-directed diagnosis. Less archaeology. Fewer all-hands debugging rituals. And, with luck, fewer agents confidently marching from one bad early assumption to a beautifully formatted wrong answer.
Cognaptus: Automate the Present, Incubate the Future.
-
Chenyang Zhu, Spencer Hong, Jingyu Wu, Kushal Chawla, Charlotte Tang, Youbing Yin, Nathan Wolfe, Erin Babinsky, and Daben Liu, “RAFFLES: Reasoning-based Attribution of Faults for LLM Systems,” arXiv:2509.06822, 2025, https://arxiv.org/abs/2509.06822. ↩︎