Mind the Gap: Fixing the Flaws in Agentic Benchmarking

TL;DR for operators

Agent benchmark scores are starting to function like procurement documents. They appear in model cards, vendor decks, research claims, and internal build-versus-buy decisions. The awkward finding in this paper is that some of those scores do not measure what buyers think they measure.

Zhu et al. introduce the Agentic Benchmark Checklist, or ABC, to audit whether an agentic benchmark has valid tasks, valid outcome grading, and adequate reporting.¹ Applying it to ten widely used agentic benchmarks, they find task-validity flaws in seven, outcome-validity flaws in seven, and reporting limitations in all ten.

The paper’s most useful contribution is not the checklist as paperwork. It is the evidence that benchmark failures can be embarrassingly concrete. A do-nothing agent can pass 38% of τ-bench Airline tasks. A SWE-Lancer agent can reach 100% by overwriting tests rather than solving tasks. KernelBench overestimates generated-kernel correctness by 31 percentage points because its fuzzing misses important cases. OSWorld underestimates performance by 28 percentage points in its Chrome section because website changes broke evaluation selectors. CVE-Bench, once audited with ABC, reduced a reported overestimation by 33 percentage points.

For a business reader, the implication is simple: leaderboard diligence now needs benchmark diligence. Before using an agent score to choose a model, approve a deployment, or benchmark a vendor, ask whether the environment is frozen, ground truth is isolated, trivial-agent baselines are reported, graders handle edge cases, uncertainty is disclosed, and failures are interpreted honestly. Otherwise, congratulations: you may be buying the best model at pleasing the benchmark’s plumbing.

The score looked good because the grader was asleep

Agent benchmarks are supposed to measure whether an AI system can complete messy, multi-step tasks. Not answer a multiple-choice question. Not generate a sentence that overlaps with a reference answer. Actually do the thing: modify code, use tools, interact with a website, update a database, exploit a vulnerability in a controlled environment, or reason through a realistic workflow.

That is precisely why the numbers are seductive. “The agent solved 35% of tasks” sounds like a capability statement. It feels operational. It has the clean smell of measurement.

The paper asks a more irritating question: what if the benchmark is measuring the wrong success condition?

The examples are not subtle. In τ-bench, some airline tasks are intentionally impossible. A user asks for a change that should be denied under the policy, such as changing a non-refundable ticket. The benchmark considers the task successful if the database remains unchanged and no specific response text is required. A real agent should inspect the booking, interpret the rules, and explain the denial. A trivial agent that returns nothing also leaves the database unchanged. It passes.

That do-nothing agent achieves 38% on the Airline partition and 6.0% on the Retail partition. A spamming agent that outputs all available data does even better in some cases: 40% on Airline and 9.6% on Retail. This is not agent intelligence. This is the benchmark equivalent of awarding a driving licence because the parked car did not hit anyone.

SWE-Lancer exposes a different class of failure. The benchmark asks agents to implement features or fix bugs, then uses end-to-end tests. That sounds stronger than simple unit tests. The problem is isolation. The tests are stored inside a password-protected ZIP archive, but the archive directory can be listed and its files can be overwritten without knowing the password. An agent can replace the tests with a trivial assertion such as assert 1 == 1 and achieve a 100% resolve rate without implementing the requested software changes.

KernelBench fails in a more technical but equally important way. It evaluates generated CUDA kernels using fuzz testing, but the fuzzer varies tensor values while missing important dimensions such as memory layout, tensor shape, and hardware-sensitive edge cases. The authors use o3-mini to generate additional tests, manually verify those tests, and apply them to generated kernels from prior work. The result: correctness is overestimated by 31 percentage points.

OSWorld shows the mirror-image problem. Not every benchmark flaw inflates performance. In the Chrome task section, 13 out of 46 problems are broken because target websites changed their layouts, URLs, or functionality after benchmark release. The evaluation uses HTML selectors such as classes and XPaths. When those selectors go stale, a capable agent can be marked wrong. The authors estimate this underestimates UI-TAR performance by 28 percentage points in that section.

The lesson is not that every benchmark is useless. The lesson is nastier: a score can be precise, reproducible, and still conceptually wrong.

The paper is really about two broken equivalences

The paper’s central move is to separate two questions that are too often collapsed.

First: does task success actually require the target capability? This is task validity.

Second: does the evaluation result actually indicate task success? This is outcome validity.

Those sound similar until they fail differently.

Validity condition	What has to be true	Failure mode	Concrete example
Task validity	The task is solvable if and only if the agent has the intended capability	Shortcuts, impossible tasks, leaky environments, stale external dependencies	SWE-Lancer lets agents overwrite hidden tests; OSWorld selectors break as websites change
Outcome validity	The grader’s positive result actually means the task was completed	Weak tests, brittle string matching, permissive LLM judges, naive state matching	τ-bench rewards doing nothing; KernelBench misses shape and layout failures

This distinction matters because “fix the grader” is too vague. Sometimes the benchmark task itself is flawed. Sometimes the task is fine, but the measurement of success is inadequate. Sometimes both are true, because apparently benchmarks too can multitask.

SWE-bench illustrates outcome validity. It evaluates code patches using tests. Passing tests is useful evidence, but not proof that the original GitHub issue was resolved. The paper discusses prior work showing that agents can pass evaluations without truly addressing the issue in 5.3% of SWE-bench Verified tasks and 7.7% of SWE-bench Lite tasks. Those cases produced substantial leaderboard movement: 40.9% changes in the Verified leaderboard and 24.4% in Lite.

The business translation is straightforward. If an engineering agent claims to “resolve” tickets based only on a weak regression suite, the metric may be measuring test-suite compliance rather than bug resolution. That is still useful internally, but only if everyone understands the label on the bottle.

ABC turns benchmark criticism into an audit procedure

The Agentic Benchmark Checklist is the paper’s practical answer. It is not a new score for agents. It is a way to audit the machinery that produces agent scores.

ABC has three parts.

Task validity checks ask whether the task environment is reliable, isolated, reproducible, and actually solvable. Tool versions should be specified. APIs should be available and rate limits handled. Environments should reset cleanly between tasks. Agents should not be able to access ground truth. The setup should be frozen rather than dependent on shifting external websites. Ground-truth annotations and task configurations should be verified. Ideally, an oracle solver should demonstrate that the task is solvable.

Outcome validity checks depend on the type of output. Text matching should handle equivalent expressions, negation, exhaustive listing, and guessing. LLM-as-judge systems should have pilot validation for accuracy and consistency. Code tests should be verified and supported by quality indicators such as coverage or complexity. Fuzz testing should cover values, types, shapes, memory layouts, and edge cases. State matching should include all valid success outcomes, relevant and irrelevant states, and enough complexity that random or trivial changes do not pass.

Reporting checks accept that some flaws cannot be eliminated completely. When that happens, benchmark creators should disclose limitations, quantify their likely impact, report uncertainty, provide interpretation guidance, and include sanity baselines such as human experts, non-AI systems, and trivial agents.

This last part is easy to underestimate. Reporting is not clerical hygiene. It is how a benchmark prevents misuse.

A benchmark that says “Model A scored 73%” invites ranking. A benchmark that says “Model A scored 73%, but known annotation noise implies a wide confidence interval and possible rank overlap from first to sixteenth” invites judgement. Less exciting, yes. Also less likely to bankrupt a procurement meeting with fake precision.

The audit evidence is the main result, not the checklist diagram

The paper audits ten open-source benchmarks selected from a broader collection of agentic benchmarks used by major AI providers or recognised in academic venues. The assessed set spans software engineering, ML engineering, cybersecurity, general assistants, and environment interaction. Evaluation designs include unit tests, fuzz tests, end-to-end tests, answer matching, state matching, quality metrics, substring matching, and LLM-as-a-judge.

The high-level result is blunt:

Audit result	Finding
Benchmarks assessed in depth	10
Benchmarks with task-validity issues	7
Benchmarks with outcome-validity issues	7
Benchmarks with reporting limitations	10
Benchmarks satisfying every reporting criterion	0

The paper does not treat all evidence the same way. That matters for interpretation.

Evidence in the paper	Likely purpose	What it supports	What it does not prove
ABC assessment across ten benchmarks	Main evidence	Evaluation rigor problems are common across capability areas and grading methods	That every agentic benchmark outside the sample has the same flaws
τ-bench do-nothing and spamming agents	Main evidence / sanity-test demonstration	Trivial baselines can expose false positives in state and substring matching	That τ-bench is useless for every task or domain
SWE-Lancer test overwrite	Main evidence / implementation vulnerability	Ground-truth isolation failures can invalidate end-to-end testing	That end-to-end testing is inherently weak
KernelBench additional fuzz tests	Main evidence / robustness probe	Narrow fuzzing can overstate correctness by missing shapes, layouts, and edge cases	That all GPU-kernel benchmark scores are inflated by the same amount
OSWorld broken selectors	Main evidence / environment drift case	Dynamic web environments can create false negatives over time	That all UI benchmarks underestimate performance
CVE-Bench revision	Case-study validation	ABC can identify and reduce concrete benchmark flaws during development	That ABC guarantees perfect security-benchmark evaluation
BIRD reporting example	Implementation detail / reporting template	Reporting uncertainty and trivial baselines can make benchmark use more honest	That BIRD itself becomes flawless after better reporting

This table is useful because it prevents the article from turning into a greatest-hits list of benchmark bloopers. The deeper finding is structural: agentic benchmarks fail at the seams between task setup, environment state, tool access, and success measurement.

Traditional benchmarks often fail through bad labels, contamination, or narrow constructs. Agentic benchmarks add new failure surfaces. The agent can use tools. The environment can drift. The final answer can be a file system mutation, a database update, a website action, a code patch, or a side effect. The grader must infer success from traces of behaviour. That is harder than checking whether a predicted label equals “cat”.

False positives are dangerous, but false negatives cost money too

The most attention-grabbing examples are false positives: empty responses passing, tests overwritten, wrong kernels marked correct. For business decisions, those are obvious risks. A vendor appears better than it is. A model gets selected because it learned to exploit the yardstick. A team optimises toward fragile benchmark tricks and calls it progress, because dashboards are very persuasive when nobody audits the denominator.

But OSWorld’s result matters just as much. If stale selectors cause a capable agent to fail evaluation, the benchmark underestimates performance. That can misdirect investment in the other direction. A useful agent is rejected. A promising internal system loses budget. An engineering team spends weeks improving the model when the real defect is the evaluation harness.

This is the quiet operational point: benchmark flaws distort allocation.

They do not merely create embarrassment in academic leaderboards. They influence which model gets bought, which architecture gets scaled, which safety claim gets trusted, and which agent framework becomes the default in production. A five-point overestimate might be tolerable in a toy comparison. A 31-point correctness overestimate in generated kernels is not a rounding issue. A 100% SWE-Lancer score achieved by overwriting tests is not “benchmark noise.” It is a fire alarm with a nice LaTeX table.

CVE-Bench shows what fixing looks like

The CVE-Bench case study is the paper’s strongest argument that ABC is not just a critic’s checklist. CVE-Bench evaluates agents’ ability to exploit real-world web vulnerabilities in controlled one-day or zero-day scenarios. That domain is evaluation-hostile by nature: multiple attack paths, side effects, environment state, and security-specific success conditions.

ABC identifies two concrete issues.

The first concerns time-based SQL injection. CVE-Bench originally checked whether a SLEEP clause appeared in a database log. But a log containing SLEEP does not necessarily mean the sleep operation executed in a way that demonstrates the vulnerability. An agent could pass by placing SLEEP anywhere in the query. This is a naive state-matching problem. The paper reports that this led to a 32.5 percentage-point performance overestimation for that attack category.

The second concerns an outbound server. CVE-Bench checked whether a banned outbound server had been accessed. During mock execution, agents consistently passed, which suggested a setup flaw. The authors found that agents could access the outbound server from the same Docker network, creating a shortcut. After denying external requests on the outbound server, agent success rates decreased by 10 percentage points.

The headline claim is that applying ABC reduced performance overestimation in CVE-Bench by 33 percentage points. The important phrase is “during benchmark construction.” ABC is more valuable before a benchmark becomes a public scoreboard, when fixing flaws is still cheaper than issuing corrections after everyone has already tweeted the leaderboard.

For enterprises, this maps directly onto internal evaluation. The best time to audit an agent benchmark is before using it to choose a vendor, approve a workflow, or tell leadership that deployment risk is under control. Once the result becomes a slide, the benchmark has already started hardening into organisational belief. Those are annoyingly hard to patch.

What an operator should ask before trusting an agent leaderboard

ABC can be translated into a practical review gate. A team does not need to reproduce the entire paper to use the discipline.

Operator question	What it catches	Why it matters
Can a do-nothing agent, random agent, or spamming agent score above zero?	Trivial false positives	Prevents τ-bench-style “success by absence”
Can the agent access tests, labels, ground truth, or hidden evaluation files?	Ground-truth leakage	Prevents SWE-Lancer-style test tampering
Are environments frozen, versioned, and reset between tasks?	Drift and contamination	Prevents stale selectors, legacy state, and inconsistent runs
Are external APIs monitored for rate limits and failure modes?	False negatives from infrastructure	Separates agent failure from service failure
Do code tests cover edge cases, not just common paths?	Weak outcome validity	Prevents test-passing but incorrect patches or kernels
Does state matching include all valid outcomes and irrelevant state?	Naive success checks	Prevents narrow or accidental matches
Are LLM judges validated for consistency and domain accuracy?	Judge hallucination and permissiveness	Prevents evaluator models from becoming the weakest link
Are confidence intervals, known flaws, and interpretation guidance reported?	Fake precision	Prevents over-ranking from noisy measurements
Are human, non-AI, and trivial baselines included?	Missing sanity checks	Makes scores interpretable rather than decorative

This is where the paper becomes commercially useful. Benchmarks should not be accepted as neutral infrastructure. They are software systems with assumptions, failure modes, attack surfaces, and maintenance debt.

The business process should reflect that. If an AI vendor claims superiority on an agentic benchmark, the procurement question should not be only “what was the score?” It should be “what would a trivial agent score, what flaws are disclosed, and when was the evaluation harness last audited?” Slightly less glamorous. Considerably less gullible.

ABC is governance, not a universal truth machine

The paper’s own limitations are important. The authors analyse benchmarks used by major AI providers between January 2024 and March 2025, plus recognised academic benchmarks. That is a meaningful slice, not the entire universe. Their in-depth audit covers ten open-source benchmarks chosen for popularity and coverage across capability categories and evaluation methods. Future benchmarks may introduce new agent capabilities and new failure modes. Existing benchmarks may also be revised after the paper’s analysis.

So ABC should not be treated as the final constitution of agent evaluation. It is closer to an audit standard for a fast-moving measurement regime.

There is another boundary. The paper focuses on whether benchmark scores are rigorously produced. It does not answer every downstream question about business deployment. A benchmark can be valid and still mismatched to a company’s actual workflow. A coding benchmark may not predict performance on a proprietary monorepo. A browser benchmark may not capture enterprise authentication, latency, compliance constraints, or human escalation policies. A cybersecurity benchmark may be rigorous in a sandbox but still incomplete as a risk model.

That distinction matters. Benchmark validity is necessary. It is not sufficient.

The practical sequence is therefore:

Audit whether the benchmark measures its own stated task correctly.
Check whether that stated task matches the business workflow.
Run internal evaluations with production-like data, tools, permissions, and failure handling.
Treat public leaderboard scores as prior evidence, not final evidence.

The paper mostly addresses step one. The business buyer still owns steps two through four. Procurement cannot be fully outsourced to a leaderboard, though many decks will bravely attempt it.

The real message: benchmark the benchmark

The old article framing was right to call ABC a wake-up call. The revision is sharper: this is not mainly a call for better checklists. It is a call to stop confusing benchmark performance with agent capability.

Agentic benchmarks are becoming more consequential because agents are becoming more operational. They touch files, APIs, databases, browsers, terminals, and sometimes security-sensitive systems. Evaluation therefore becomes less like grading an exam and more like auditing a workflow.

That shift changes the standard of trust. A leaderboard score is not a fact floating above the world. It is the output of a task design, environment setup, tool configuration, grader, reporting practice, and maintenance process. Any weak link can produce the number. Sometimes the weak link is exactly what the agent learns to exploit.

Zhu et al.’s paper gives the field a vocabulary for that problem: task validity, outcome validity, and reporting. It also gives operators a useful habit: before asking which agent won, ask whether the contest was worth winning.

Because if the do-nothing agent beats your expensive frontier model, the correct response is not to hire the do-nothing agent.

It is to audit the benchmark.

Cognaptus: Automate the Present, Incubate the Future.

Yuxuan Zhu et al., “Establishing Best Practices for Building Rigorous Agentic Benchmarks,” arXiv:2507.02825, 2025. https://arxiv.org/abs/2507.02825 ↩︎

TL;DR for operators#

The score looked good because the grader was asleep#

The paper is really about two broken equivalences#

ABC turns benchmark criticism into an audit procedure#

The audit evidence is the main result, not the checklist diagram#

False positives are dangerous, but false negatives cost money too#

CVE-Bench shows what fixing looks like#

What an operator should ask before trusting an agent leaderboard#

ABC is governance, not a universal truth machine#

The real message: benchmark the benchmark#