TL;DR

Most LLM agent evaluations judge the final answer. RAFFLES flips the lens to where the first causal error actually happened—then iterates with a Judge–Evaluator loop to verify primacy, fault-ness, and non-correction. On the Who&When benchmark, RAFFLES materially outperforms one-shot judges and router-style baselines. For builders, this is a template for root-cause analytics on long-horizon agents, not just scorekeeping.


Why we need decisive-fault attribution (not just pass/fail)

Modern agent stacks—routers, tool-callers, planners, web surfers, coders—fail in cascades. A harmless-looking plan choice at t=3 can doom execution at t=27. Traditional “LLM-as-a-judge”:

  • Focuses on the end product, missing the origin of failure.
  • Struggles with long contexts, often biased to early or late steps.
  • Offers limited guidance for remediation beyond “try a better prompt.”

Decisive-fault attribution asks a different question: What is the earliest step whose correction would have flipped failure into success? That’s actionable for owners of complex, multi-component systems.


The RAFFLES idea in one diagram (words)

RAFFLES runs an iterative Judge–Evaluator pipeline over a full trajectory:

  1. Judge proposes a candidate (agent i, step t) as the decisive fault, with structured rationales for three criteria:

    • Fault condition: a real mistake occurred at (i, t).
    • Primacy: it’s the first causal mistake.
    • Non-correction: later steps didn’t correct it (nor could reasonably correct it).
  2. Evaluators (E1–E3 + rule-check) critique those rationales independently—each focused on one criterion—and return natural-language confidence scores.

  3. Memory H accumulates critiques; the Judge revises the candidate and rationales. Early stopping triggers when total confidence exceeds a threshold (or after K iterations; authors typically use K = 2).

Net effect: a reasoned search for the first causal domino, with built-in cross-examination.


What changes versus existing baselines

  • Versus one-shot judges: RAFFLES decomposes the claim (fault / first / non-corrected) and subjects it to adversarial checks, rather than issuing a monolithic verdict.
  • Versus routers / binary search: it preserves global context while zooming in, avoiding the myopia of partial windows.
  • Versus flexible tool-callers: it trades “agentic freedom” for structured reliability—less clever wandering, more disciplined reasoning.

Results at a glance (Who&When benchmark)

Strict step-level accuracy (identify the exact failing step):

Method (Representative) Algorithmic subset Hand-Crafted subset
One-shot LLM judge ~19% ~7%
Router (Step-by-Step / Binary) ~0–16% ~0–14%
Tool-Caller (planner+judge) ~17–33% ~7–17%
RAFFLES (K=2) ~44% ~21%

Tolerance helps (±k steps). Under a ±2 window, RAFFLES sustains strong practical utility—often 70%+ on the algorithmic subset and ~26–30% on the hand-crafted subset—good enough to guide a human reviewer straight to the critical neighborhood.

Takeaway for teams: You don’t need perfect pinpointing to slash debug time. A small window around the decisive step already pays for itself.


Practical implications for enterprise agent stacks

1) Treat evaluation as an agent too.

  • Implement a Judge that must write down three separate rationales (fault, first, non-corrected) for its choice.
  • Add targeted Evaluators that check each rationale and return a confidence integer (0–100) with a short justification.

2) Adopt step-level metrics.

  • Track strict step accuracy and tolerant step accuracy (±k)—they tell you how much human review remains.
  • De-emphasize agent-only accuracy when labels are imbalanced (e.g., WebSurfer ≫ other agents). It can be misleading.

3) Expect long-context drift—and contain it.

  • Use early stopping (fixed K) and confidence thresholds; iterative self-critique isn’t monotonic.
  • Keep full trajectories available—but let Evaluators quote localized snippets to anchor claims.

4) Turn insights into fixes.

  • Classify decisive faults by type (planning vs retrieval vs tool-use vs code vs verification). Map each to playbooks (e.g., “planner hallucination → enforce schema + plan-checker with queries-to-criteria”).
  • Instrument your orchestrator to auto-collect the H memory and verdicts as artifacts for governance.

A mini example: RAG pipeline gone sideways

  • Symptom (final): wrong summary.
  • RAFFLES flow: Judge claims the decisive fault is retrieval@t=4 (irrelevant top-k due to bad query), not generation@t=12.
  • Why: E1 finds verifiable mismatch between retrieved docs and query intent; E2 shows earlier steps were fine and no prior causal error exists; E3 argues later self-reflection didn’t correct the off-target corpus. Fixing t=4 flips outcome.
  • Remediation: strengthen query reformulation, inject semantic filters, and add a retrieval-verifier Evaluator to catch future misfires.

Caveats & open questions

  • Data scarcity: fault-attribution datasets are nascent; labels can be noisy. Don’t overfit to a single benchmark.
  • Context limits: very long logs (hundreds of thousands of tokens) still stress today’s models; chunking strategies may be needed.
  • Early-step bias: judges can over-select early steps; require Evaluators to explicitly test plausible later candidates.
  • Cost & latency: iterative evaluation is pricier than one-shot scoring. Use it selectively—e.g., for failures, regressions, high-stakes runs.

A builder’s checklist (copy/paste)

  • Persist full trajectories with step indices, agent IDs, tool I/O.
  • Implement Judge with structured fields: fault_reason, first_mistake_reason, non_correction_reason.
  • Implement Evaluators: (E1) fault check, (E2) primacy check, (E3) non-correction check, (E4) log-consistency rule.
  • Use confidence aggregation + early stopping (K ≤ 2–3).
  • Report strict and ±k tolerant step accuracy; archive H memory for audits.
  • Close the loop: map decisive-fault types → targeted orchestrator patches.

Cognaptus: Automate the Present, Incubate the Future