Benchmarking

Enviro-Mental Gymnastics: Why Cross-Environment Agents Still Trip Over Their Own Feet

Demo day is easy. Give an AI agent one workflow, one tool stack, one database schema, one approval rule, and one forgiving evaluator, and it may look surprisingly competent. It files the ticket. It updates the CRM. It writes the SQL query. Everyone nods. Someone says “agentic transformation,” because apparently every procurement meeting now needs a spell. ...

Game of Cones: How Physics Codes Could Fix Agent Reasoning

Controls are where agent intelligence goes to embarrass itself. Give a vision-language model a game frame, a goal, and a list of legal buttons. It may describe the scene beautifully. It may explain that the projectile is approaching, the platform is unstable, and the shiny object is probably a reward. Then it presses the wrong key, late, for the wrong duration, and walks heroically into danger. Excellent commentary. Poor organism. ...

Touch Intelligence: How DigiData Trains Agents to Think with Their Fingers

Phones are where automation goes to embarrass itself. A desktop workflow can often be forced into a neat sequence: open tab, click menu, submit form, pretend the enterprise software was designed by someone who likes people. Mobile apps are less polite. They hide features behind drawers, gestures, modals, permissions, scrolling lists, bottom sheets, dark-pattern-ish confirmations, and the occasional button that looks decorative until it suddenly matters. A human user handles this with a mixture of visual attention, memory, muscle habit, and mild resentment. A mobile control agent has to do it with pixels, UI trees, and a policy that decides where the next finger should land. ...

Agents on the Clock: How TPS-Bench Exposes the Time Management Problem in AI

A competent assistant can make a list. A useful assistant knows what must happen first. That distinction sounds small until an AI agent is asked to do something ordinary and annoyingly realistic: check a calendar, search the web, compare options, use a map, assemble a recommendation, and perhaps create a document at the end. None of those steps is exotic. The difficulty is that some of them can run in parallel, some must wait for earlier results, and some become nonsense if executed too early. This is less “genius at work” than “junior operations manager with access to too many browser tabs.” Naturally, it is where things get interesting. ...

Seeing Green: When AI Learns to Detect Corporate Illusions

Advertisement first, evidence later. That is not a moral complaint. It is a business model. A company does not need to lie outright to reshape public perception. It can show a wind turbine, a smiling engineer, a school visit, a research lab, a family cooking dinner, a national flag, or a vague line about “the energy future.” The viewer receives a feeling before receiving a claim. Conveniently, feelings are harder to audit. ...

Org Charts for Robots: What AgentArch Really Tells Us About Enterprise AI

Enterprise AI teams love an architecture diagram. Boxes, arrows, specialist agents, memory stores, tool registries, a tasteful orchestrator sitting at the top like a middle manager with JSON access. It looks reassuring. It looks intentional. It also looks suspiciously like the kind of thing that can fail in six different places while still producing a beautifully formatted answer. ...

Tool Wars, Protocol Peace: What MCP‑AgentBench Really Measures

A procurement team does not buy an AI agent because it can recite the word “interoperability” with theatrical confidence. It buys the agent because the thing can use tools, collect data, combine results, and stop before it bankrupts the token budget. That is the useful way to read MCP-AgentBench, a new benchmark for evaluating language agents inside the Model Context Protocol ecosystem.1 The paper is not just another leaderboard with a fresh coat of protocol paint. Its more interesting result is harsher: MCP gives agents a common integration layer, but it does not make them competent tool users. Compatibility is plumbing. Competence is orchestration. ...

Model Portfolio: When LLMs Sit the CFA

Exams are useful because they are rude. They do not care that a model sounds polished, cites the right buzzwords, or can produce a gorgeous paragraph about duration risk. They ask for A, B, or C. Then they mark the answer wrong. That is why a new CFA-based benchmark is more useful than another misty-eyed essay about AI “transforming finance.” The paper evaluates GPT-4o, GPT-o1, and o3-mini on 1,560 official CFA mock multiple-choice questions across Levels I, II, and III, both zero-shot and with a domain-reasoning RAG pipeline built from official CFA curriculum materials.1 The result is not a single leaderboard. It is closer to a routing manual. ...

Judgment Day for RAG: How L‑MARS Cuts Legal Hallucinations by Design

TL;DR for operators Legal AI does not fail only because models “hallucinate”. That word has become the industry’s favourite fog machine. The more operational diagnosis is sharper: models fail when they answer current legal questions from stale internal memory and then dress the error in confident reasoning. The L-MARS paper is useful because it separates two tasks that vendors often blend together for convenience: retrieving current legal facts and reasoning over stable legal principles.1 On LegalSearchQA, a new 50-question benchmark built around recent U.S. legal facts verified in March 2026, L-MARS reaches 96.0% accuracy. Zero-shot GPT-4o-mini reaches 58.0%. Chain-of-thought falls to 30.0%, because step-by-step reasoning from outdated premises merely creates a more articulate mistake. ...

Edge of Reason: Orchestrating LLMs Without a Conductor

TL;DR for operators Symphony is not just another “let several agents chat until something sensible happens” framework. The paper’s real contribution is more specific: it proposes a decentralised orchestration pattern where agents advertise capabilities, subtasks are routed to the best-matching available worker, and final answers are selected through weighted voting across multiple reasoning paths.1 ...