Agent Evaluation

Wheel Smarts > Wheel Reinvention: What GitTaskBench Really Measures

TL;DR for operators GitTaskBench is useful because it evaluates code agents where enterprise automation usually breaks: not in a clean coding puzzle, but inside an existing repository with dependencies, pretrained weights, fragile instructions, file formats, runtime constraints, and a user asking for a finished output.1 The paper’s headline is not “agents can code”. We have enough confetti for that parade. The sharper finding is that agents are still inconsistent at the whole delivery chain. The best reported combination, OpenHands with Claude 3.7, reaches 72.22% execution completion but only 48.15% task pass rate. In other words, many runs produce something executable, but far fewer produce something good enough. ...

USB‑C for Agents, Stress‑Tested: What MCP‑Universe Really Reveals

TL;DR for operators MCP-Universe is useful because it punctures a very convenient belief: once an LLM is connected to tools through MCP, the agent is basically “integrated” and therefore close to production-ready. The paper says: adorable, but no.1 The benchmark tests agents against real MCP servers rather than toy APIs. It covers 231 tasks across Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. It uses 11 MCP servers, 133 tools, and 84 execution-based evaluators, including dynamic evaluators that retrieve live ground truth for time-sensitive tasks. ...

Mind Games for Machines: How Decrypto Reveals the Hidden Gaps in AI Reasoning

TL;DR for operators Meetings are easy to automate until someone has to understand what everyone else thinks everyone else knows. That is the useful discomfort created by Decrypto, a new benchmark for multi-agent reasoning and theory of mind in language models.1 The benchmark is built around a simple word game. Alice and Bob share four secret keywords. Alice receives a three-digit code and gives three public hints. Bob must recover the code. Eve sees the same hints but does not know the secret keywords and tries to intercept. Alice’s job is therefore not “give good clues.” It is “give clues calibrated to Bob’s knowledge while limiting Eve’s inference.” Welcome to enterprise communication, but with fewer calendar invites. ...

Bias Busters: Teaching Language Agents to Think Like Scientists

TL;DR for operators Language-model agents do not merely make wrong causal guesses. In this paper, they gather evidence in a biased way, then interpret that evidence through the same bias. That is the uncomfortable part. The study turns the classic Blicket Test from developmental psychology into a text-based active exploration game for LM agents. The agent must test objects, observe whether a machine turns on, then infer which objects are “Blickets” and whether the hidden rule is disjunctive — any Blicket activates the machine — or conjunctive — all relevant Blickets must be present together.1 ...

Scaling Trust, Not Just Models: Why AI Safety Must Be Quantitative

TL;DR for operators The paper’s practical message is simple enough to be uncomfortable: “use a smarter model to supervise the risky model” is not a safety strategy. It is an experiment waiting to be measured. Engels, Baek, Kantamneni, and Tegmark propose a way to measure scalable oversight as a two-player contest between a Guard and a Houdini.1 The Guard is the overseer: auditor, judge, monitor, containment system, or reviewer. The Houdini is the model trying to defeat oversight: deceive, persuade, insert a backdoor, or escape a simulated control environment. Each side receives a domain-specific Elo score, and the paper studies how that score changes as general model capability increases. ...