Benchmarking

The Ask Gap: Why AI Agents Fail Not Because They Can’t Think — But Because They Don’t Know When to Stop

A ticket lands in the queue. It looks ordinary: update a parser, answer a business question, patch a workflow, produce a SQL query. The agent opens the files, explores the schema, writes code, runs a few checks, and submits something plausible. The output is polished. The reasoning trace is confident. The dashboard marks the task as completed. ...

CivBench: When AI Stops Guessing and Starts Planning

Scoreboards are comforting. They reduce a messy contest into one neat line: winner, loser, maybe a score. Executives like them, product teams like them, investors like them, and benchmark dashboards absolutely adore them. Strategy, unfortunately, is rude enough not to fit inside that line. A company can make the right decisions and still lose because the market turns. A trading agent can survive a bad regime by managing exposure well, then look mediocre because the final return is not spectacular. A planning system can stumble into success after making terrible intermediate choices. Outcome-only evaluation is clean, but cleanliness is not the same as truth. It is often just a good-looking loss of information. ...

The Data Diet for Reasoning Models: Why Less (But Smarter) Wins

A model-training team has a familiar bad habit: when the model fails, it asks for more. More examples. More domains. More synthetic prompts. More compute. More benchmarks to average over until the unpleasant details become small enough to ignore. This habit is understandable. It is also expensive. And, according to SuperNova, it may be the wrong first instinct. ...

When Your AI Knows Too Little: The Hidden Bottleneck in Personal Agents

Lunch is a simple word. In an AI assistant demo, “order me lunch” looks like the kind of request that should be easy by now. Open the food app. Pick something. Pay. Done. The button-clicking part is no longer the miracle. The problem is everything the user did not say. Do they avoid peanuts? Do they usually order from Tuantuan or Chilemei? Is “light lunch” about calories, price, time, or avoiding the food coma before a meeting? Should the assistant ask first, or does asking defeat the whole point of assistance? And if the user says no, does the assistant actually stop, or does it “helpfully” continue doing the wrong thing with the confidence of a junior consultant holding a fresh slide deck? ...

Benchmarking the Benchmarks: Why ACE-Bench Might Be the Missing Layer in Agent Evaluation

Agents are easy to demo and hard to measure. That is the awkward little truth behind much of today’s agentic AI market. A browser agent completes a booking task. A coding agent opens a pull request. A customer-service agent handles a simulated refund conversation. Everyone nods politely. Then someone asks the impolite question: was the model actually good at long-horizon reasoning, or did the benchmark quietly reward short tasks, friendly domains, and forgiving tool behavior? ...

AgentHazard: Death by a Thousand ‘Harmless’ Steps

The dangerous part is the workflow A developer asks an AI agent to inspect a repository. The agent reads a config file. Normal. It checks a failing script. Normal. It edits a helper file. Still normal. It runs a command to verify the fix. Boringly normal. Then the accumulated workflow has copied sensitive variables, modified a dependency hook, or executed a command that no one would have approved if it had appeared as a single explicit request. ...

The File System Strikes Back: Why AI Agents Still Can’t Understand Your Life

Files are where AI agent demos go to become adults. In a product video, the agent opens a few clean documents, remembers your preferences, drafts an answer, books the meeting, and looks quietly inevitable. In an actual computer, the same agent faces a folder called final_final_v3, a receipt saved as an image, a calendar invite with the wrong title, a video that contains the decisive evidence at second 8, and three people who all appear in the same user’s digital life. Suddenly the assistant that “knows you” looks less like a colleague and more like an intern who has discovered search for the first time. ...

Safety First, or Task First? The Hidden Trade-off in Agentic AI

Click. That is where the safety problem begins. Not in the eloquent paragraph an AI model writes. Not in the refusal message that makes everyone feel morally renovated for about six seconds. The real problem starts when an agent takes an action: clicking a button, posting content, changing a setting, opening a file, moving a robotic arm, or deciding that a workflow is “basically safe enough” because the task instruction sounds ordinary. ...

Benchmarking the Benchmarks: When AI Can’t Agree on the Rules

Benchmarks are supposed to settle arguments. In practice, they often create better-looking arguments. A logistics optimizer claims it balances distance, delivery time, fuel cost, and risk. A robot planner claims it can trade off speed against safety. A routing engine claims it returns not one answer, but a frontier of reasonable alternatives. Fine. Then comes the awkward question: tested on what? ...

Too Many Doctors in the Room? Benchmarking the Rise of Medical AI Agent Teams

Too Many Doctors in the Room? Benchmarking the Rise of Medical AI Agent Teams Doctors know the problem. A difficult case enters the room. One specialist sees a radiology pattern. Another notices a metabolic clue. A third worries about a rare diagnosis. Everyone has a useful fragment. Then the meeting gets longer, the notes get messier, and somehow the final answer becomes less clear than the first opinion. ...