Benchmarking

Agents on the Clock: How TPS-Bench Exposes the Time Management Problem in AI

Opening — Why this matters now AI agents can code, search, analyze data, and even plan holidays. But when the clock starts ticking, they often stumble. The latest benchmark from Shanghai Jiao Tong University — TPS-Bench (Tool Planning and Scheduling Benchmark) — measures whether large language model (LLM) agents can not only choose the right tools, but also use them efficiently in multi-step, real-world scenarios. The results? Let’s just say most of our AI “assistants” are better at thinking than managing their calendars. ...

The Agent Olympics: How Toolathlon Tests the Limits of AI Workflows

Opening — Why this matters now The AI world is obsessed with benchmarks. From math reasoning to coding, each new test claims to measure progress. Yet, none truly capture what businesses need from an agent — a system that doesn’t just talk, but actually gets things done. Enter Toolathlon, the new “decathlon” for AI agents, designed to expose the difference between clever text generation and real operational competence. In a world where large language models (LLMs) are being marketed as digital employees, Toolathlon arrives as the first test that treats them like one. Can your AI check emails, update a Notion board, grade homework, and send follow-up messages — all without breaking the workflow? Spoiler: almost none can. ...

Paper Tigers or Compliance Cops? What AIReg‑Bench Really Says About LLMs and the EU AI Act

The gist AIReg‑Bench proposes the first benchmark for a deceptively practical task: can an LLM read technical documentation and judge how likely an AI system complies with specific EU AI Act articles? The dataset avoids buzzword theater: 120 synthetic but expert‑vetted excerpts portraying high‑risk systems, each labeled by three legal experts on a 1–5 compliance scale (plus plausibility). Frontier models are then asked to score the same excerpts. The headline: best models reach human‑like agreement on ordinal compliance judgments—under some conditions. That’s both promising and dangerous. ...

Lost in the Long Game: What UltraHorizon Reveals About Agent Failure at Scale

TL;DR UltraHorizon is a new benchmark that finally tests what real enterprise projects require: months‑long reasoning crammed into a single run—35k–200k tokens, 60–400+ tool calls, partially observable rules, and hard commitments at the end. Agents underperform badly versus humans. The pattern isn’t “not enough IQ”; it’s entropy collapse over time (the paper calls it in‑context locking) and foundational capability gaps (planning, memory, calibrated exploration). Simple scaling fails; a lightweight strategy—Context Refresh with Notes Recall (CRNR)—partially restores performance. Below we translate these findings into a deployer’s playbook. ...

Org Charts for Robots: What AgentArch Really Tells Us About Enterprise AI

If you’ve ever tried turning a clever chatbot into a reliable employee, you already know the pain: great demos, shaky delivery. AgentArch, a new enterprise-focused benchmark from ServiceNow, is the first study I’ve seen that tests combinations of agent design choices—single vs multi‑agent, ReAct vs function-calling, summary vs complete memory, and optional “thinking tools”—across two realistic workflows: a simple PTO process and a gnarly customer‑request router. The result is a cold shower for one‑size‑fits‑all playbooks—and a practical map for building systems that actually ship. ...

Tool Wars, Protocol Peace: What MCP‑AgentBench Really Measures

TL;DR MCP‑AgentBench is the first broad benchmark that evaluates language agents inside the Model Context Protocol (MCP) rather than with ad‑hoc function calls. It sets up 33 MCP servers with 188 tools and runs 600 goal‑oriented queries across six task patterns. Results flip a few assumptions: open‑source leaders (notably Qwen3‑235B‑A22B) can top the table under the ReAct style, while Claude 4 Sonnet shines with native tool‑calling. Token budgets matter: o3‑mini posts the best performance‑per‑token among big names. The meta‑lesson for builders: your agent’s interaction style must match the model and benchmarks must reward outcome, not ritual. ...

Model Portfolio: When LLMs Sit the CFA

If your firm is debating whether to trust an LLM on investment memos, this study is a gift: 1,560 questions from official CFA mock exams across Levels I–III, run on three model archetypes—multimodal generalist (GPT‑4o), deep-reasoning specialist (GPT‑o1), and lightweight cost‑saver (o3‑mini)—both zero‑shot and with a domain‑reasoning RAG pipeline. Below is what matters for adoption, not just leaderboard bragging rights. What the paper really shows Reasoning beats modality for finance. The reasoning‑optimized model (GPT‑o1) dominates across levels; the generalist (GPT‑4o) is inconsistent, especially on math‑heavy Level II. RAG helps where context is long and specialized. Gains are largest at Level III (portfolio cases) and in Fixed Income/Portfolio Management, modest at Level I. Retrieval cannot fix arithmetic. Most errors are knowledge gaps, not reading problems. Readability barely moves accuracy; the bottleneck is surfacing the right curriculum facts and applying them. Cost–accuracy has a sweet spot. o3‑mini + targeted RAG is strong enough for high‑volume workflows; o1 should be reserved for regulated, high‑stakes analysis. Executive snapshot CFA Level GPT‑4o (ZS → RAG) GPT‑o1 (ZS → RAG) o3‑mini (ZS → RAG) Takeaway I 78.6% → 79.4% 94.8% → 94.8% 87.6% → 88.3% Foundations already in‑model; RAG adds little II 59.6% → 60.5% 89.3% → 91.4% 79.8% → 84.3% Level II exposes math + integration gaps; RAG helps smaller models most III 64.1% → 68.6% 79.1% → 87.7% 70.9% → 76.4% Case‑heavy; RAG is decisive, especially for o1 ZS = zero‑shot. Accuracies are from the paper’s aggregated results. ...

Benchmarks with Benefits: What DeepScholar-Bench Really Measures

TL;DR DeepScholar-Bench introduces a live (continuously refreshable) benchmark and a holistic automated evaluation for generative research synthesis. Its reference pipeline, DeepScholar‑base, is simple yet competitive. The headline: today’s best systems organize text well but miss key facts, under-retrieve important sources, and fail verifiability at scale. That’s not a death knell—it’s a roadmap. Why this matters for business readers Enterprise “research copilots” promise to digest the live web, summarize options, and provide auditable citations. In practice, three gaps keep showing up: ...

USB‑C for Agents, Stress‑Tested: What MCP‑Universe Really Reveals

The pitch: a unified plug—and a tougher test The Model Context Protocol (MCP) is often described as the “USB‑C of AI tools”: one standardized way for agents to talk to external services (maps, finance data, browsers, repos, etc.). MCP‑Universe, a new benchmark from Salesforce AI Research, finally stress‑tests that idea with real MCP servers rather than toy mocks. It derives success from execution outcomes, not multiple‑choice guesswork—exactly what enterprises need to trust automation. ...

Patch Tuesday for the Law: Hunting Legal Zero‑Days in AI Governance

TL;DR: Legal zero‑days are previously unnoticed faults in how laws interlock. When triggered, they can invalidate decisions, stall regulators, or nullify safeguards immediately—no lawsuit required. A new evaluation finds current AI models only occasionally detect such flaws, but the capability is measurable and likely to grow. Leaders should treat statutory integrity like cybersecurity: threat model, red‑team, patch. What’s a “legal zero‑day”? Think of a software zero‑day, but in law. It’s not a vague “loophole,” nor normal jurisprudential drift. It’s a precise, latent defect in how definitions, scope clauses, or cross‑references interact such that real‑world effects fire at once when someone notices—e.g., eligibility rules void an officeholder, or a definitional tweak quietly de‑scopes entire compliance obligations. ...