Benchmarks

Gamma Rays and Toolboxes: Why Superintelligence May Be a Systems Engineering Problem

Toolboxes are not glamorous. Nobody gives a keynote about the screwdriver. Nobody writes breathless think-pieces about the socket wrench. But when a complicated system fails, the difference between “genius” and “expensive confusion” is often whether the operator had the right tool, used it at the right moment, and trusted it to do the part humans should not pretend to do mentally. ...

Agents in Lab Coats: When LLMs Try to Become Data Scientists

Spreadsheet first. Not the model. Not the agent. Not the impressive diagram with seven tiny boxes labeled “planner,” “executor,” “critic,” “memory,” “tool user,” “reflection,” and, inevitably, “orchestrator.” In most companies, data science automation begins with something less glamorous: a messy spreadsheet, a half-documented database table, a recurring report, a manager asking why last month’s number changed, and one unlucky analyst trying to remember whether “customer_id” means account, user, buyer, household, or whatever the CRM vendor believed in 2019. ...

Ready Player None: Why AI Still Can’t Beat the Human Game Multiverse

Games are not supposed to be frightening. A commuter plays them between meetings. A child learns one in thirty seconds. A bored adult opens a mobile puzzle, fails once, notices the trick, and improves. No dissertation. No onboarding deck. No “agentic workflow architecture.” Just look, act, remember, adjust. That is precisely why the new AI GAMESTORE paper is awkward for the current AI narrative.1 It does not ask whether frontier models can solve another static exam, write another function, or produce another polished paragraph about strategic transformation. They can do all of that, often impressively. The paper asks something more ordinary and therefore more damaging: can a model learn unfamiliar human-designed games under roughly human-like gameplay constraints? ...

The Reliability Gap: Why Smarter AI Agents Still Fail When It Matters

A customer service agent gets the refund policy right on Monday, wrong on Tuesday, and confidently wrong on Wednesday. A coding agent passes the benchmark, then casually rewrites the wrong file in production. A workflow agent behaves perfectly in a demo, then becomes confused when the API returns the same fields in a different order. ...

When Agents Browse Back: Why Multimodal Search Still Fails the Real Web

Search looks easy until the answer is hiding in a caption, a cropped image region, a second web page, and one annoyingly necessary intermediate clue. That is the problem BrowseComp-V3 is trying to measure.1 Not whether a multimodal model can recognize an object in an image. Not whether a chatbot can summarize the first search result. Not whether a web agent can click around long enough to look busy. The benchmark asks a more operationally relevant question: can an AI system browse the open web, combine text and visual evidence across multiple steps, and still arrive at the right answer? ...

Breaking Things on Purpose: How CLI-Gym Teaches AI to Fix the Real World

Broken environments are where coding agents stop looking magical. A model can write a neat Python function, patch a repository, and explain the bug with courtroom confidence. Then it enters a terminal, meets a missing shared library, a corrupted dependency, a bad environment variable, or a filesystem permission issue, and suddenly the “autonomous engineer” starts behaving like an intern trapped inside conda. Not a bad intern, perhaps. Just one who keeps running the same command and hoping Linux will become more emotionally cooperative. ...

Game On, Agents: When Multimodality Meets the Godot Engine

A game engine is a wonderfully unfair place to test an AI agent. That is exactly why it is useful. In ordinary software tasks, a coding agent can often survive by reading files, editing functions, running tests, and pretending the world is mostly text. A game engine is less polite. It asks the agent to understand spritesheets, scene hierarchies, collision shapes, animation states, shaders, camera views, object nodes, and temporal behavior. The code matters, but the code is only one layer of the object. The game itself lives somewhere between text, geometry, assets, and motion. ...

AIRS-Bench: When AI Starts Doing the Science, Not Just Talking About It

A benchmark is supposed to be a ruler. In AI, it often becomes a trophy shelf. A model gets a higher score, a chart moves up and to the right, and everyone politely pretends the hard part has been settled. That ritual works when the task is narrow: classify an image, answer a question, pass a coding test, retrieve a document. But it becomes much less comforting when the system being evaluated is no longer just answering. It is planning experiments, writing code, debugging failures, training models, interpreting results, and deciding what to try next. ...

Benchmarks Lie, Rooms Don’t: Why Embodied AI Fails the Moment It Enters Your House

The room is not impressed by your leaderboard A robot that performs well on a public benchmark has not necessarily learned how to operate in your house. It may recognize a chair in a dataset. It may answer a visual question about a tidy image. It may even produce a confident paragraph explaining where the coffee mug should be. Then it enters a real room — with mirrors, partial views, cluttered corners, awkward sightlines, and objects that are not positioned for benchmark convenience — and suddenly the “general intelligence” starts behaving like a tourist holding the map upside down. ...

First Proofs, No Training Wheels

Proof is where AI systems stop performing confidence and start owing the reader money. A model can restate a theorem elegantly. It can cite the right neighborhood of literature. It can produce LaTeX with the visual manners of a publishable paper. None of that is a proof. It is proof-shaped material. Sometimes useful. Sometimes impressive. Sometimes a very expensive fog machine. ...