Cover image

DRIFT-BENCH: When Agents Stop Asking and Start Breaking

Opening — Why this matters now LLM agents are no longer just answering questions. They are executing SQL, calling APIs, modifying system state, and quietly making decisions that stick. Yet most evaluations still assume a fantasy user: precise, unambiguous, and cooperative. In real deployments, users are vague, wrong, impatient, or simply human. This gap is no longer academic. As agents enter finance, operations, and infrastructure, the cost of misunderstanding now rivals the cost of misreasoning. DRIFT‑BENCH arrives precisely at this fault line. ...

February 3, 2026 · 4 min · Zelina
Cover image

Hook, Line, and Confidence: When Humans Outthink the Phish Bot

Opening — Why this matters now Phishing is no longer about bad grammar and suspicious links. It is about plausibility, tone, and timing. As attackers refine their craft, the detection problem quietly shifts from raw accuracy to judgment under uncertainty. That is precisely where today’s AI systems, despite their statistical confidence, begin to diverge from human reasoning. ...

January 11, 2026 · 4 min · Zelina
Cover image

Suzume-chan, or: When RAG Learns to Sit in Your Hand

Opening — Why this matters now For all the raw intelligence of modern LLMs, they still feel strangely absent. Answers arrive instantly, flawlessly even—but no one is there. The interaction is efficient, sterile, and ultimately disposable. As enterprises rush to deploy chatbots and copilots, a quiet problem persists: people understand information better when it feels socially grounded, not merely delivered. ...

December 13, 2025 · 3 min · Zelina
Cover image

The User Is Present: Why Smart Agents Still Don't Get You

If today’s AI agents are so good with tools, why are they still so bad with people? That’s the uncomfortable question posed by UserBench, a new gym-style benchmark from Salesforce AI Research that evaluates LLM-based agents not just on what they do, but how well they collaborate with a user who doesn’t say exactly what they want. At first glance, UserBench looks like yet another travel planning simulator. But dig deeper, and you’ll see it flips the standard script of agent evaluation. Instead of testing models on fully specified tasks, it mimics real conversations: the user’s goals are vague, revealed incrementally, and often expressed indirectly. Think “I’m traveling for business, so I hope to have enough time to prepare” instead of “I want a direct flight.” The agent’s job is to ask, interpret, and decide—with no hand-holding. ...

July 30, 2025 · 3 min · Zelina