Benchmarks

Red Flag on the Track: Why LLMs Still Struggle with Real Algorithmic Reasoning

In the world of AI benchmarks, most roads lead to flashy competitions: solving coding puzzles, climbing Codeforces ratings, or passing Olympiad-level problems. But a new benchmark — FormulaOne — changes the race. It doesn’t ask, “Can you win a medal?” It asks, “Can you think like a researcher?” And the answer from today’s frontier LLMs? A resounding no. From Codeforces Champs to Research Rookies The authors of FormulaOne strip away the glitz of competitive programming and delve into something far more consequential: research-grade algorithmic problems grounded in Monadic Second-Order (MSO) logic over graphs. These aren’t out-of-distribution visual puzzles like ARC. They’re in-distribution, theoretically tractable problems designed with precision to demand multi-step symbolic reasoning, mathematical insight, and clean implementation. ...

The First Hurdle: Why Coding Agents Struggle with Setup

In the race to build autonomous software engineers, large language model (LLM) agents like Devin and Copilot Chat are lauded for fixing bugs, writing code, and even completing tasks from GitHub issues. But what happens when the code doesn’t even run? That’s the uncomfortable gap SetupBench aims to measure—and the results are sobering. SetupBench introduces a 93-task benchmark evaluating a foundational but under-tested skill: bootstrapping a development environment from scratch. Unlike prior benchmarks that hand agents a fully pre-configured Docker container, SetupBench drops them into a barebones Linux sandbox and challenges them to install dependencies, initialize databases, configure background services, and resolve real-world version conflicts. It sounds simple. It isn’t. ...

Passing Humanity's Last Exam: X-Master and the Emergence of Scientific AI Agents

Is it possible to train a language model to become a capable scientist? That provocative question lies at the heart of a new milestone in AI research. In SciMaster: Towards General-Purpose Scientific AI Agents, a team from Shanghai Jiao Tong University introduces X-Master, a tool-augmented open-source agent that has just achieved the highest score ever recorded on Humanity’s Last Exam (HLE)—surpassing even OpenAI and Google. But what makes this feat more than just a leaderboard update is how X-Master got there. Instead of training a larger model or fine-tuning on more data, the researchers innovated on agentic architecture and inference-time workflows. The result? An extensible framework that emulates the exploratory behavior of human scientists, not just their answers. ...

Ping, Probe, Prompt: Teaching AI to Troubleshoot Networks Like a Pro

When a network fails, it doesn’t whisper its problems—it screams in silence. Packet drops, congestion, and flapping links rarely announce themselves clearly. Engineers must piece together clues scattered across logs, dashboards, and telemetry. It’s a detective game where the evidence hides behind obscure port counters and real-time topological chaos. Now imagine handing this job to a Large Language Model. That’s the bold challenge taken up by researchers in “Towards a Playground to Democratize Experimentation and Benchmarking of AI Agents for Network Troubleshooting”. They don’t just propose letting LLMs debug networks—they build an entire sandbox where AI agents can learn, act, and be judged on their troubleshooting skills. It’s not theory. It’s a working proof-of-concept. ...

Mind the Gap: Fixing the Flaws in Agentic Benchmarking

If you’ve looked at any leaderboard lately—from SWE-Bench to WebArena—you’ve probably seen impressive numbers. But how many of those reflect real capabilities of AI agents? This paper by Zhu et al. makes a bold claim: agentic benchmarks are often broken, and the way we evaluate AI agents is riddled with systemic flaws. Their response is refreshingly practical: a 33-point diagnostic called the Agentic Benchmark Checklist (ABC), designed not just to critique, but to fix the evaluation process. It’s a must-read not only for benchmark creators, but for any team serious about deploying or comparing AI agents in real-world tasks. ...

Grounded and Confused: Why RAG Systems Still Fail in the Enterprise

Grounded and Confused: Why RAG Systems Still Fail in the Enterprise If you’ve been following the RAG (retrieval-augmented generation) hype train, you might believe we’ve cracked enterprise search. Salesforce’s new benchmark—HERB (Heterogeneous Enterprise RAG Benchmark)—throws cold water on that optimism. It exposes how even the most powerful agentic RAG systems, armed with top-tier LLMs, crumble when facing the chaotic, multi-format, and noisy reality of business data. Deep Search ≠ Deep Reasoning Most current RAG benchmarks focus on shallow linkages—documents tied together via entity overlap or topic clusters. HERB rejects this toy model. It defines Deep Search as not just multi-hop reasoning, but searching across unstructured and structured formats, like Slack threads, meeting transcripts, GitHub PRs, and internal URLs. It’s what real enterprise users do daily, and it’s messy. ...

Mind Games for Machines: How Decrypto Reveals the Hidden Gaps in AI Reasoning

As large language models (LLMs) evolve from mere tools into interactive agents, they are increasingly expected to operate in multi-agent environments—collaborating, competing, and communicating not just with humans but with each other. But can they understand the beliefs, intentions, and misunderstandings of others? Welcome to the world of Theory of Mind (ToM)—and the cleverest AI benchmark you haven’t heard of: Decrypto. Cracking the Code: What is Decrypto? Inspired by the award-winning board game of the same name, Decrypto is a three-player game of secret codes and subtle hints, reimagined as a benchmark to test LLMs’ ability to coordinate and deceive. Each game features: ...

Mind the Context: How ContextAgent Listens, Sees, and Acts Before You Ask

Introduction: From Reaction to Proaction Imagine an assistant that doesn’t wait for your command. It notices you’re standing by a bus stop late at night and proactively checks the next bus arrival. If it’s too far off, it suggests calling a ride instead. Welcome to the world of ContextAgent — a proactive, context-aware Large Language Model (LLM) agent designed to act before you’re forced to ask. While most LLM agents still require explicit prompts and work in tightly scoped environments like desktops, ContextAgent leverages open-world sensory inputs (from devices like smart glasses, earphones, and smartphones) to understand user context and offer unobtrusive help. ...

Raising the Bar: Why AI Competitions Are the New Benchmark Battleground

In the rapidly evolving landscape of Generative AI (GenAI), we’ve long relied on static benchmarks—standardized datasets and evaluations—to gauge model performance. But what if the very foundation we’re building our trust upon is fundamentally shaky? Static benchmarks often rely on IID (independent and identically distributed) assumptions, where training and test data come from the same statistical distribution. In such a setting, a model achieving high accuracy might simply be interpolating seen patterns rather than truly generalizing. For example, in language modeling, a model might “memorize” dataset-specific templates without capturing transferable reasoning patterns. ...