DRIFT-BENCH: When Agents Stop Asking and Start Breaking

A user says, “Update the record with a sensible value.”

That sentence is small. The damage may not be.

For a normal chatbot, the worst outcome might be a vague answer wearing a confident expression. Annoying, yes, but usually recoverable. For an agent connected to a database, file system, workflow platform, or API service, the same ambiguity becomes operational. The model may update the wrong row, call the wrong endpoint, overwrite a file, or politely explain its mistake after making it. Charming, in the same way a self-driving forklift is charming.

That is the problem DRIFT-BENCH studies: not whether LLM agents can follow clean instructions, but whether they can survive ordinary human instructions that are incomplete, misleading, ambiguous, or built on false assumptions.¹ The paper’s central move is useful because it attacks a quiet assumption behind much agent evaluation: the user is treated as an oracle. The user supposedly knows exactly what they want, states it clearly, includes every required parameter, and never drags in irrelevant context.

Anyone who has watched real business users interact with software will need a moment to recover from the comedy.

DRIFT-BENCH replaces that fantasy with controlled input faults, multi-turn clarification, persona-driven user simulation, and grounded execution environments. Its most important lesson is not “agents should ask more questions.” That is the easy slogan. The harder result is that clarification is environment-dependent. It helps in transparent systems where the agent can inspect the world. It can hurt in opaque API systems where extra dialogue disrupts schema discipline, increases context burden, or makes the agent abandon recoverable tasks.

The paper is therefore best read as a comparison study: clean versus flawed inputs, state-oriented versus service-oriented tools, clarification versus no clarification, and cooperative versus avoidant users. The business question is not whether an agent can chat. The question is whether it knows when to ask, when to inspect, when to proceed, and when to stop touching things.

The benchmark starts by removing the oracle user

DRIFT-BENCH builds around four categories of cooperative breakdown. These are not random perturbations sprinkled onto prompts for drama. They are meant to represent ways real user instructions violate the assumptions needed for safe execution.

Fault class	What breaks	Business example	What the agent should usually do
Intention	The real goal is obscured by irrelevant, indirect, or mixed intent.	“Check this invoice, and by the way, what do you think about our new pricing page?”	Separate executable goal from conversational noise.
Premise	The request assumes something false or infeasible.	“Cancel the Friday meeting,” when no Friday meeting exists.	Verify preconditions before acting.
Parameter	Required execution details are missing or polluted.	“Send the report to the client,” without specifying which report or client.	Ask for the missing slot or inspect safe context.
Expression	The language is ambiguous, vague, or referentially unclear.	“Update that one to a reasonable value.”	Disambiguate candidates before taking irreversible action.

The distinction matters because these faults do not all ask for the same repair. A missing parameter wants a slot-filling question. A false premise wants precondition checking. An ambiguous expression wants candidate disambiguation. A mixed-intent request wants task-boundary control. Treating all of them as “unclear user input” is how one gets a very polite agent with a very expensive habit of guessing.

The paper operationalizes these faults through a data pipeline. It begins from verified tasks, extracts semantic frames, generates controlled perturbations, and injects those faults while preserving enough information for the task to remain solvable after clarification. This is an important design choice. If the perturbed tasks were simply impossible, failure would prove little. DRIFT-BENCH instead tries to isolate cooperative breakdown: the task can be solved, but only if the agent repairs the interaction.

The retained benchmark contains 200 state-oriented tasks from AgentBench subsets, split into 159 database and 41 operating-system tasks, and 150 service-oriented tasks from StableToolBench G1 subsets, split across instruction, tool, and category tasks. The authors also apply an “at least two correct” oracle filter using GPT-4o, Gemini-2.0-Flash, and Llama-3.3-70B, retaining tasks solved by at least two reference models under complete information. This filtering step is not glamorous, but it is one of the paper’s better instincts: do not diagnose communication failure on tasks that were already broken.

The key comparison is not model A versus model B

Many benchmark papers invite the reader to scan a leaderboard and declare a winner. DRIFT-BENCH is more useful if read differently.

The central comparison is not “which model is best?” It is “what happens when the same agent moves from oracle inputs to flawed inputs, and then from no clarification to clarification?” That reveals where the failure enters the system.

The paper evaluates three conditions:

Oracle baseline: original unperturbed instructions.
Perturbed without clarification: flawed inputs, no clarifying action allowed.
Perturbed with clarification: flawed inputs, structured clarification allowed.

The authors evaluate seven models or agent configurations named in the paper, including GPT-5.2, GLM-4.7, Gemini-2.5-Flash, GPT-OSS-120B, Qwen3, DeepSeek-v3.2, and Llama-4. The model list is less important than the pattern: performance drops sharply under input faults across model families.

The robustness metric is performance degradation:

$$ PD = 1 - \frac{Score_{\text{perturbed}}}{Score_{\text{clean}}} $$

Lower is better. A lower performance degradation means the agent preserves more of its oracle performance under flawed input.

Without clarification, the average degradation in state-oriented tasks is severe: 44.29% for intention faults, 40.75% for premise faults, 49.91% for parameter faults, and 45.52% for expression faults. Parameter and expression faults are especially damaging in state-oriented settings because missing or ambiguous details can directly corrupt execution in databases or operating systems.

Service-oriented tasks look less uniformly catastrophic, but not safe. Average degradation is 18.13% for intention faults, 33.84% for premise faults, 20.03% for parameter faults, and 12.62% for expression faults. That difference is partly structural. APIs constrain execution through endpoints and schemas; they may buffer some ambiguity. But buffering is not understanding. It is just a smaller blast radius, until it is not.

A practical reading:

Comparison	What the paper directly shows	Business interpretation	Boundary
Oracle vs flawed input	Agents lose large portions of task performance under systematic user-input faults.	Clean-demo success does not predict real workflow reliability.	The benchmark uses controlled perturbations, not live enterprise logs.
State vs service environments	State-oriented tasks suffer especially large degradation under parameter and expression faults.	Database, file, and system agents need stronger pre-execution checks.	Specific risk depends on tool permissions and reversibility.
No clarification vs clarification	Clarification helps in state-oriented tasks but can damage service-oriented tasks.	“Always ask a follow-up” is not a policy. It is a reflex.	The exact mechanism may vary by agent framework and API schema design.
Persona effects	Avoidant users are hardest; rational and spontaneous users are easier.	User behavior is part of agent reliability testing.	Simulated personas are useful proxies, not a substitute for production telemetry.

The benchmark is therefore not merely testing reasoning. It is testing whether the agent can preserve task intent while managing uncertainty, tool constraints, and user behavior.

Clarification helps when the agent can inspect the world

The paper’s most business-relevant result is the contrast between state-oriented and service-oriented systems.

In state-oriented tasks, clarification generally helps. Table 3 reports positive clarification gains across the state-oriented setting. The gains are especially large for parameter faults: GPT-5.2 gains 23.58 percentage points, GLM-4.7 gains 18.17 points, Gemini-2.5-Flash gains 20.17 points, GPT-OSS-120B gains 19.99 points, Qwen3 gains 22.59 points, DeepSeek-v3.2 gains 22.34 points, and Llama-4 gains 11.50 points.

That makes sense. In a database or operating-system environment, the agent can often inspect the current state. It can list files, query table schemas, check whether an object exists, compare rows, or verify preconditions. Clarification does not float in empty air. It can be grounded against an inspectable environment.

This is the “good” version of agent interaction:

The user provides a flawed instruction.
The agent notices uncertainty.
The agent asks a targeted question or checks safe context.
The user clarifies.
The agent maps the clarified intent back to real state.
Execution proceeds with lower risk.

For business automation, this supports a design principle: clarification should be attached to state inspection. The agent should not merely ask, “Can you clarify?” It should ask after narrowing the uncertainty.

Bad clarification:

“Can you clarify what you mean?”

Better clarification:

“I found two records matching ‘Bright’: the team row and the league table name. Should I update the team row, or did you mean the table-level value?”

The second question is more useful because the agent has already done safe work. It has reduced the cognitive burden on the user and constrained the possible answers. This is how clarification becomes operational, not decorative.

Clarification breaks when it becomes another source of drift

The paper’s most interesting correction to common agent-safety thinking is the “Clarification Paradox.” In service-oriented tasks, clarification can reduce performance.

This deserves careful reading. The result does not mean clarification is bad. It means clarification is not a universal safety button. In opaque API environments, the agent cannot inspect the server-side state directly. It must obey API schemas, select tools, pass exact parameters, interpret noisy responses, and maintain dialogue context. Adding a clarification loop may increase the burden rather than reduce it.

The paper’s Appendix F.2 offers two diagnostic case comparisons. These are best read as error analysis, not as the main statistical evidence. Their purpose is to explain plausible mechanisms behind the main result.

The first mechanism is clarification-induced syntactic collapse. In one StableToolBench case, the no-clarification baseline formats an initial API call correctly and retrieves article details, even though it later fails the overall task. Under clarification-enabled execution, the same kind of setup repeatedly produces invalid JSON, missing quotes, mismatched braces, and parse errors. The agent’s “conversation mode” appears to interfere with rigid tool-call formatting.

That should sound familiar to anyone building agents with verbose system prompts, growing dialogue history, and unforgiving tool schemas. At some point, the model is not “thinking more deeply.” It is juggling too many incompatible obligations and dropping the braces.

The second mechanism is premature abandonment. In another case, the no-clarification agent receives an empty API result, bypasses the null output, proceeds to a second sub-task, and succeeds. With clarification enabled, the agent makes no actual clarification attempt, interprets the API noise as a terminal obstacle, and gives up or restarts. The clarification policy may lower the agent’s confidence threshold: instead of recovering autonomously from technical noise, it waits for a human or aborts.

This distinction is important for product design. There are at least three different uncertainties:

Uncertainty type	Example	Good response	Bad response
User-intent uncertainty	“Send the report” but several reports exist.	Ask a targeted disambiguation question.	Guess the report.
Environment-state uncertainty	The file or database row may not exist.	Inspect safe state before action.	Execute based on assumption.
Tool-execution noise	API returns empty result or temporary error.	Retry, fallback, or continue with independent sub-task.	Treat tool noise as user ambiguity and abandon.

The paper’s service-oriented result is a warning against mixing these categories. A tool error is not always a reason to ask the user. An ambiguous user request is not always a reason to retry the API. And a missing parameter is definitely not a reason to invent a value while looking productive.

The safety result is about premature action, not bad manners

DRIFT-BENCH uses Safe Action Rate, or SAR, to measure whether the agent avoided invoking a high-risk tool before effective clarification or refusal. This is a useful metric because it shifts the safety question from what the agent eventually says to what it does before it knows enough.

In state-oriented tasks, the paper reports that agents reach nearly 60% SAR for intention faults, but only about 29% for premise and parameter faults. The interpretation is blunt: in more than 70% of cases involving false presuppositions or missing critical values, agents proceed with execution instead of pausing for disambiguation.

This is the operationally dangerous part. A model can be very articulate about uncertainty after the fact. That does not matter if it already updated the table.

The safety lesson is not “make the agent more cautious” in the abstract. Generic caution produces annoying agents that ask for confirmation before breathing. The better lesson is risk-gated execution.

A simple enterprise pattern would be:

Situation	Risk level	Agent behavior
Read-only query with minor expression ambiguity	Low	Infer if safe, disclose assumption, or ask concise question.
Read-only query with missing required parameter	Low to medium	Ask for the missing slot or offer candidates.
State-changing database/file/API action with missing parameter	High	Block execution until clarified.
State-changing action with false premise or non-existent target	High	Report blocker; optionally propose safe alternative.
API error with independent remaining sub-tasks	Medium	Continue recoverable steps; do not automatically abandon.
Irreversible operation with uncertain referent	Critical	Confirm risk with explicit target and consequence.

This is where business agent design should move. Not “chatty agents.” Not “obedient agents.” Risk-sensitive agents.

Persona testing is not UX decoration

The paper also tests how clarification works against five simulated user personas: rational, intuitive, dependent, spontaneous, and avoidant. The average accuracy by persona shows a clear behavioral dependency. Avoidant users are the hardest, with an average score of 56.64%. Spontaneous and rational users perform better, both averaging above 67%.

That result is not surprising. Avoidant users provide minimal information and resist commitment. Rational users supply precise information. Spontaneous users may be hurried, but they still give usable signals. The agent can work with energy. It struggles with evasiveness.

The more useful point is that agent reliability is partly a user-distribution problem. A workflow agent tested only with cooperative internal testers will look better than one deployed to rushed managers, uncertain junior staff, distracted customers, and people who do not want to make a decision because the decision might later be blamed on them. Enterprise reality, as usual, has excellent benchmark sabotage skills.

The paper adds a human evaluation validating persona labels in the simulator: two annotators labeled 198 samples, with 81.31% exact agreement and Cohen’s $\kappa = 0.7649$. This supports using persona labels as evaluation covariates, although the paper notes that the intuitive persona is harder to distinguish. That is a reasonable boundary. Persona simulation is not human reality, but it is still better than pretending every user behaves like a benchmark prompt.

For business deployment, persona testing should translate into role testing:

Business user type	Likely behavior	Agent requirement
Expert operator	Precise but impatient.	Minimize unnecessary clarification; expose assumptions.
Junior employee	Defers decisions to the agent.	Avoid over-taking responsibility; require confirmation for risky actions.
Busy manager	Gives compressed, incomplete instructions.	Ask targeted slot questions; propose choices.
External customer	Uses vague language and domain-specific shorthand.	Disambiguate without exposing internal complexity.
Avoidant stakeholder	Refuses to commit or gives soft answers.	Escalate or stop when commitment is required.

A serious agent evaluation should include these profiles before deployment, not after the first incident review.

The RISE framework is useful because success alone hides the failure mode

DRIFT-BENCH proposes RISE: Robustness, Intelligence, Safety, and Efficiency.

This is a good evaluation framing because task success alone is too crude. Two agents can fail the same task for very different reasons. One may misunderstand the user. Another may ask the right question but execute the repaired intent incorrectly. A third may know the right answer but call a high-risk tool too early. A fourth may succeed after ten miserable clarification turns, which is not exactly automation unless the business goal was to simulate a committee meeting.

RISE separates these concerns:

RISE dimension	Metric idea	What it tells us
Robustness	Performance degradation under perturbation	How much flawed input damages task performance.
Intelligence	Clarification gain	Whether clarification improves outcomes.
Safety	Safe Action Rate	Whether the agent avoids risky execution before clarification or refusal.
Efficiency	Average interaction rounds	How much dialogue cost is required for successful recovery.

The efficiency result is especially useful. In state-oriented tasks, average successful interaction rounds rise from 4.84 under oracle conditions to 5.78 for intention faults, 5.65 for parameter faults, and 5.85 for premise faults under clarification. Clarification costs turns. Sometimes that cost is worth paying because it recovers the task. Sometimes, especially in service-oriented tasks, the system fails quickly instead of using the interaction budget effectively.

Business teams should not optimize for “fewest questions” or “most careful agent.” They should optimize for minimum effective clarification: the smallest number of interaction turns needed to reduce execution risk enough for the specific environment.

A good clarification policy is therefore not a sentence template. It is a routing function:

Detect the fault class.
Estimate action risk.
Determine whether safe state inspection is possible.
Decide whether to ask, inspect, proceed, refuse, retry, or escalate.
Preserve tool-call format discipline after clarification.
Log the uncertainty source for later evaluation.

That last step is underrated. If the agent fails, the team should know whether the root cause was missing parameter detection, bad schema formatting, API noise, user avoidance, unsafe execution bias, or post-clarification reasoning collapse. Otherwise, the “fix” will be another prompt instruction saying “be more careful,” the enterprise software equivalent of taping a motivational quote to a leaking pipe.

What Cognaptus infers for business automation

The paper directly shows that current LLM agents are fragile under systematic input faults, that clarification has different effects across environment types, that agents often act before clarifying high-risk operations, and that user persona affects recovery.

From that, Cognaptus would infer several practical design rules for business automation agents.

First, separate clarification policy from execution policy. The agent should not decide to ask a question using the same unconstrained reasoning path that decides whether to execute a database update. Clarification should be part of a workflow controller with explicit gates.

Second, classify uncertainty before responding. Missing parameter, false premise, ambiguous expression, and irrelevant intention require different handling. A single generic “please clarify” behavior is lazy and often ineffective.

Third, treat state-oriented and service-oriented tools differently. For databases, files, calendars, and operating systems, safe inspection should often precede clarification. For APIs, the agent must maintain schema discipline and distinguish user ambiguity from technical execution noise.

Fourth, measure unsafe premature action. Success rate is insufficient. A task that succeeds after risky early execution may still be unacceptable in finance, legal operations, HR, procurement, system administration, or client communication.

Fifth, test against user behavior, not just task types. A rational test user is not a production user. The avoidant persona result is a reminder that some failures come from interaction style, not task complexity.

A compact implementation framework would look like this:

Layer	Function	Example control
Input fault detector	Identify intention, premise, parameter, expression faults.	Mark “Friday meeting” as premise-sensitive if no meeting is found.
Risk classifier	Estimate consequence of wrong execution.	Read-only lookup vs irreversible database update.
Environment router	Choose state-inspection, API retry, clarification, or refusal path.	Inspect table rows before asking; retry API before escalating.
Clarification manager	Generate targeted, minimal questions.	“Which of these two records should be updated?”
Tool-call guard	Preserve schema and validate arguments after dialogue.	JSON schema validation before API call.
Recovery logger	Record failure class and resolution path.	Track whether failure came from user ambiguity or API noise.

This is not a call for heavier agents. It is a call for less naive agents. The difference is expensive.

The boundaries of the result

DRIFT-BENCH is a diagnostic benchmark, not a deployment guarantee.

The tasks are derived from AgentBench and StableToolBench subsets, not from messy enterprise production logs. The perturbations are controlled, which is methodologically useful, but real users will combine multiple faults in less elegant ways. The user personas are simulated, although the paper does include human validation of the labels. The service-oriented failure analysis is persuasive but partly based on representative cases; it explains mechanisms rather than proving every cause quantitatively.

The model list and reported scores should also be read as a snapshot of the paper’s experimental setup, not as a permanent ranking. The more durable result is structural: agents fail when user cooperation breaks, clarification can both repair and damage execution, and safety depends on action timing.

For enterprise readers, the transfer is strongest when the planned agent has tool access and side effects: database operations, file manipulation, workflow automation, ticket handling, calendar changes, CRM updates, code execution, financial operations, procurement, and API orchestration. The transfer is weaker for purely advisory chatbots where the cost of ambiguity is mostly conversational.

In other words, the paper matters most where the agent can actually break something. A rare moment of alignment between benchmark design and business reality.

The real lesson: asking is not enough

DRIFT-BENCH gives us a better vocabulary for agent failure. The problem is not only hallucination. It is not only poor reasoning. It is not only bad tool use. It is cooperative breakdown: the agent, user, and environment fail to maintain a shared, executable understanding of the task.

That framing is valuable because it changes the engineering target. The agent should not be designed as a command executor with a chatbot attached. It should be designed as an uncertainty manager with controlled execution rights.

The practical lesson is also sharper than the usual safety slogan. Agents should not always ask more questions. They should ask when the uncertainty is user-facing, inspect when the uncertainty is environmental, retry or recover when the uncertainty is tool noise, and stop when the action is risky and the referent is unresolved.

An agent that never asks is reckless. An agent that always asks is useless. An agent that asks at the wrong time is somehow both.

DRIFT-BENCH is useful because it makes that middle ground measurable.

Cognaptus: Automate the Present, Incubate the Future.

Han Bao, Zheyuan Zhang, Pengcheng Jing, Zhengqing Yuan, Kaiwen Shi, and Yanfang Ye, “DRIFT-BENCH: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction,” arXiv:2602.02455, 2026. ↩︎

The benchmark starts by removing the oracle user#

The key comparison is not model A versus model B#

Clarification helps when the agent can inspect the world#

Clarification breaks when it becomes another source of drift#

The safety result is about premature action, not bad manners#

Persona testing is not UX decoration#

The RISE framework is useful because success alone hides the failure mode#

What Cognaptus infers for business automation#

The boundaries of the result#

The real lesson: asking is not enough#