Benchmarking

USB‑C for Agents, Stress‑Tested: What MCP‑Universe Really Reveals

TL;DR for operators MCP-Universe is useful because it punctures a very convenient belief: once an LLM is connected to tools through MCP, the agent is basically “integrated” and therefore close to production-ready. The paper says: adorable, but no.1 The benchmark tests agents against real MCP servers rather than toy APIs. It covers 231 tasks across Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. It uses 11 MCP servers, 133 tools, and 84 execution-based evaluators, including dynamic evaluators that retrieve live ground truth for time-sensitive tasks. ...

Patch Tuesday for the Law: Hunting Legal Zero‑Days in AI Governance

TL;DR for operators Legal risk usually enters the boardroom through contracts, investigations, licensing, or compliance failures. This paper asks a colder question: what if the legal system itself contains undiscovered vulnerabilities, and future AI systems become good at finding them before institutions can repair them?1 The paper calls these vulnerabilities Legal Zero-Days. The analogy is deliberate. In cybersecurity, a zero-day is not just “a bug.” It is a flaw that matters because it is unknown, exploitable, and hard to patch quickly. Here, the bug lives inside laws, regulations, administrative procedures, or the interaction among them. The exploit is not malware. It is a legal discovery that suddenly makes a safeguard fail, a regulator hesitate, or a government process jam. ...

Meta-Game Theory: What a Pokémon League Taught Us About LLM Strategy

TL;DR for operators A Pokémon tournament sounds unserious until you notice what it does better than many enterprise AI pilots: it forces models to make constrained, sequential, adversarial decisions, then records not only what they did but why they said they did it. The paper behind this article introduces LLM Pokémon League, a benchmark where eight models from the GPT, Claude, and Gemini families act as Pokémon trainers. Each model selects a six-member team, then makes turn-by-turn battle decisions in a zero-shot setting. The framework captures team-building rationales, move choices, switching decisions, and explanations throughout the tournament.1 ...

FAITH in Numbers: Stress-Testing LLMs Against Financial Hallucinations

TL;DR for operators FAITH is useful because it changes the hallucination question from “Does the model sound right?” to “Can the model reconstruct a known financial number from the exact tables and surrounding text that justify it?”1 That sounds modest. It is not. In finance, modest is usually where the damage hides. ...

Longer Yet Dumber: Why LLMs Fail at Catching Their Own Coding Mistakes

TL;DR for operators Code review usually starts after code exists. FPBench argues that this is already too late. The paper behind FPBench tests whether large language models can detect faulty premises in code-generation requests before obediently producing code from them.1 The answer is awkward. Many models can identify the flaw when explicitly told to check the question first, but most do not do so proactively. They behave less like careful engineers and more like very fast interns with a tragic respect for bad tickets. ...

Love in the Time of Context: Why LLMs Still Don't Get You

TL;DR for operators Personalization does not fail because the model forgot your birthday. That would be almost charming. It fails because the system remembers too much in the wrong shape. The Cupid benchmark tests whether LLMs can infer a user’s context-dependent preference from prior multi-turn interactions and apply it to a new request.1 The setup is deliberately business-relevant: users do not announce a clean preference profile; they reveal expectations through feedback, correction, and mild conversational friction. Very realistic. Nobody fills out a YAML file called my_deeply_contextual_preferences.yml, at least not outside certain Slack channels. ...

Mirage Agents: When LLMs Act on Illusions

TL;DR for operators LLM agents do not merely hallucinate by saying false things. They hallucinate when they act on a version of the world that does not match the task, the history, or the screen in front of them. That is the useful idea in MIRAGE-Bench: it treats agent hallucination as context-unfaithful action. The agent may click a button that is not there, assume a page transition succeeded when it did not, answer a colleague’s question with invented information, submit code despite failed tests, or report success when the environment says otherwise. Very industrious. Very confident. Very much not what you want near production systems. ...

Sound and Fury Signifying Stock Picks

TL;DR for operators Finfluencer videos are not just “text with a face attached.” They contain ticker symbols on charts, spoken recommendations, gestures, confidence, hedging, hype, and the occasional performance of certainty. VideoConviction turns that mess into a benchmark: 288 YouTube videos from finance influencers, 687 stock recommendation segments, 6,063 expert annotations, transcripts, metadata, and a 1–3 conviction score grounded in tone, facial expression, delivery, and consistency between title and content.1 ...

Words, Not Just Answers: Using Psycholinguistics to Test LLM Alignment

TL;DR for operators Most AI evaluation still asks whether a model can produce the right answer. This paper asks a quieter but more commercially awkward question: when a model uses a word, does it attach human-like emotional, concrete, familiar, gendered, or sensory associations to that word?1 The authors propose using established psycholinguistic word norms as an automated alignment test. Instead of hiring new human raters every time, they reuse datasets where humans have already rated thousands of English words on features such as arousal, valence, concreteness, imageability, familiarity, gender association, and sensory modalities. ...

Playing with Strangers: A New Benchmark for Ad-Hoc Human-AI Teamwork

TL;DR for operators Teamwork is the awkward part of agentic AI. It is easy to show a model completing a task when the environment is clean, the instructions are explicit, and the other “teammates” behave exactly as expected. Real deployments are less polite. Humans omit context, follow local conventions, adapt unevenly, and occasionally do something that looks wrong only because the system has misunderstood the room. ...