DRIFT-BENCH: When Agents Stop Asking and Start Breaking
Opening — Why this matters now LLM agents are no longer just answering questions. They are executing SQL, calling APIs, modifying system state, and quietly making decisions that stick. Yet most evaluations still assume a fantasy user: precise, unambiguous, and cooperative. In real deployments, users are vague, wrong, impatient, or simply human. This gap is no longer academic. As agents enter finance, operations, and infrastructure, the cost of misunderstanding now rivals the cost of misreasoning. DRIFT‑BENCH arrives precisely at this fault line. ...