Opening — Why this matters now

LLM agents are no longer just answering questions. They are executing SQL, calling APIs, modifying system state, and quietly making decisions that stick. Yet most evaluations still assume a fantasy user: precise, unambiguous, and cooperative. In real deployments, users are vague, wrong, impatient, or simply human.

This gap is no longer academic. As agents enter finance, operations, and infrastructure, the cost of misunderstanding now rivals the cost of misreasoning. DRIFT‑BENCH arrives precisely at this fault line.

Background — The Oracle Assumption nobody admits

Most agent benchmarks quietly rely on what the paper calls the Oracle Assumption: user instructions are correct, complete, and intention‑aligned. This assumption held when models were passive text generators. It collapses once agents act.

Prior work explored ambiguity, clarification questions, or tool robustness — but usually in isolation. Text‑only clarification without execution risk. Tool benchmarks without pragmatic repair. User simulations without downstream consequences.

The result: a fragmented evaluation ecosystem that cannot explain why agents fail when instructions are flawed — only that they do.

Analysis — What DRIFT‑BENCH actually does

DRIFT‑BENCH reframes agent failure as a cooperative breakdown between human and machine. Its core contribution is not another task set, but a diagnostic framework that links input faults, clarification behavior, and execution outcomes.

1. A unified fault taxonomy

Instead of ad‑hoc error labels, the paper introduces four structurally distinct input flaws:

Fault Type What breaks Example
Intention User goal is wrong or mixed Asking for SQL and fashion trends
Premise False assumptions Querying non‑existent records
Parameter Missing or invalid values No date, ID, or threshold
Expression Linguistic ambiguity Vague references or pronouns

This taxonomy matters because each fault demands a different clarification strategy.

2. From command‑execution to clarification‑capable agents

Agents are augmented with explicit communication actions — not just tools:

  • Ask_Parameter
  • Disambiguate
  • Propose_Solution
  • Confirm_Risk
  • Report_Blocker

Crucially, clarification is no longer decorative. It is routed through a user simulator and affects task success.

3. Personas: users are not i.i.d.

DRIFT‑BENCH models five human decision styles (Rational, Intuitive, Dependent, Avoidant, Spontaneous), grounded in established psychology. This exposes a hard truth: agent robustness is user‑dependent.

Avoidant users — minimal, non‑committal — consistently break agents. Spontaneous and Rational users are far easier to recover.

4. White‑box vs. black‑box environments

The benchmark deliberately contrasts:

  • State‑oriented (white‑box) systems: OS, databases — where agents can inspect state.
  • Service‑oriented (black‑box) systems: APIs — where agents cannot peek behind interfaces.

This distinction turns out to be decisive.

Findings — The Clarification Paradox

The headline result is uncomfortable.

Interaction helps agents in transparent systems — and hurts them in opaque ones.

Across frontier models, introducing input faults causes a ~40% performance collapse. More surprisingly, multi‑turn clarification degrades performance in black‑box environments.

Why? The authors identify execution‑bias and context overload:

  • Agents proceed with risky actions ~70% of the time instead of deferring.
  • Additional dialogue history distracts parameter grounding in API‑heavy settings.

Persona impact snapshot

Persona Effect on success
Rational High recovery
Spontaneous Surprisingly robust
Dependent Mixed
Intuitive Variable
Avoidant Consistently catastrophic

Some models show a 26% performance gap between Spontaneous and Avoidant users — an operational nightmare.

Implications — What builders should actually do

DRIFT‑BENCH quietly demolishes several comforting beliefs:

  1. More interaction ≠ more safety
  2. Clarification is not a generic skill
  3. Execution deferral is under‑learned

For real systems, this implies:

  • Clarification policies must be risk‑aware, not verbosity‑maximizing.
  • Black‑box environments demand minimal‑turn, high‑impact questions.
  • Agent evaluation must include adversarial users, not polite ones.

In regulated domains, the recommendation is blunt: agents should default to not acting when intent is unclear.

Conclusion — Agents don’t fail loudly. They fail politely.

DRIFT‑BENCH doesn’t make agents smarter. It makes failures legible.

By abandoning the Oracle Assumption, it reveals a deeper problem: modern agents are optimized to comply, not to collaborate. Until clarification is treated as a first‑class safety mechanism — shaped by environment, risk, and human behavior — agentic systems will remain brittle where it matters most.

Cognaptus: Automate the Present, Incubate the Future.