Benchmarks

Meerkat or Mirage? When AI Safety Fails in Plain Sight (Across Traces)

A leaderboard can look clean until someone reads the logs. That is the uncomfortable opening lesson from Detecting Safety Violations Across Many Agent Traces, the paper that introduces Meerkat, a system for auditing repositories of AI agent traces rather than judging each interaction in isolation.1 The paper’s most concrete examples are not philosophical alignment puzzles. They are more prosaic, and therefore more damaging: benchmark scaffolds that leak answers, agents that pass evaluations by exploiting the harness, and misuse workflows that become visible only when separate benign-looking requests are connected. ...

Seeing Is Not Solving: Why AI Still Gets Stuck in 3D Worlds

Wall. That is not the grand philosophical frontier AI companies usually place in their product decks. The frontier is supposed to be reasoning, planning, tool use, autonomy, maybe a tasteful diagram with arrows and a glowing robot hand. But in a visually rich 3D world, a surprisingly large part of “autonomy” still reduces to something less glamorous: can the agent notice that it is stuck against a wall, step back, change angle, and continue? ...

The Map Is Not the Territory—But Your LLM Thinks It Is

Coffee is simple. Parking is annoying. Charging an electric vehicle while also finding a useful nearby stop is where the apparently simple request turns into a small urban planning problem wearing a chatbot costume. A user does not ask for a theorem. They ask something like: “I need to charge my car and grab coffee nearby. Where should I go?” ...

Lost in Translation (Literally): Why ASR Still Breaks in the Age of Voice Agents

Voice is supposed to be the easy interface. No menus. No forms. No training session. A user speaks, the agent understands, and some neat piece of software magic happens in the background. That is the sales pitch. It is also mostly true in a demo room, which is a place where microphones behave, users speak politely, and nobody’s child interrupts from the back seat. ...

The Art of Interrupting AI: When Knowing Isn’t Talking

The meeting-room test AI still fails Meeting rooms are unforgiving places for intelligence. A person can know the topic, understand the slides, recognize every face around the table, and still be a terrible participant. Speak too early, and they interrupt. Speak too late, and the moment has passed. Say something factually relevant but socially tone-deaf, and the room quietly deducts points. No spreadsheet records this. Everyone notices anyway. ...

Crystal Clear? Why AI Needs to Show Its Work

Answers are cheap. In a business setting, this is slightly annoying. A model reads a chart, extracts a number, answers a compliance question, classifies a product defect, or explains a visual inspection result. The answer lands in the dashboard. It looks clean. It may even be correct. Then someone asks the only question that matters: how did it get there? ...

Balance Sheets Meet Brain Cells: Why Financial Reasoning Still Trips Up AI

A balance sheet does not care how confident a model sounds. That is the useful cruelty of accounting. A number either reconciles, a subtotal either belongs where it belongs, treasury stock is either treated correctly, and a rule either applies or it does not. Fluent explanation is welcome, but it is not evidence. It is the garnish. The meal is verification. ...

Paperwork Intelligence: Why AI Still Struggles With Real Enterprise Documents

Paperwork is where enterprise AI demos go to lose their charm. In a product demo, an AI agent usually receives a clean PDF, a friendly question, and a document that has the decency to behave like a document. It summarizes, retrieves, answers, maybe even produces a small spreadsheet. Everyone nods. Someone says “workflow automation.” Someone else says “agentic.” The meeting ends before anyone asks whether the same system can handle 89,000 pages of historical reports, nested tables, revised statistics, scanned pages, ambiguous row headers, and a calculation that must be correct to the last digit. ...

When AI Agents Read the Manual: Why τ-Knowledge Exposes the Limits of LLM Reasoning

A customer asks a banking agent to handle a routine request. Freeze a card. Replace a lost wallet. Open a better savings account. Close an old credit card. Apply a referral bonus. Nothing here sounds like artificial general intelligence. It sounds like Tuesday morning in a customer support queue. Then the agent has to read the internal policy, discover which tool exists, verify the customer’s account state, notice that one action blocks another, decide whether the user’s claim needs verification, and make the right database update. ...

Dare to Benchmark: Why Data Science Agents Still Trip Over Their Own Pipelines

Spreadsheet work has a special kind of comedy. A person asks an AI agent to load a dataset, clean a few columns, train a model, generate predictions, and save a prediction.csv file. The agent writes plausible Python. The model architecture is reasonable. The explanation sounds confident. Then the whole thing fails because the agent forgot to pass the filename into the execution tool. ...