Benchmarking the Benchmarks: When AI Can’t Agree on the Rules

Benchmarks are supposed to settle arguments. In practice, they often create better-looking arguments.

A logistics optimizer claims it balances distance, delivery time, fuel cost, and risk. A robot planner claims it can trade off speed against safety. A routing engine claims it returns not one answer, but a frontier of reasonable alternatives. Fine. Then comes the awkward question: tested on what?

That is the question behind Bridging the Evaluation Gap: Standardized Benchmarks for Multi-Objective Search, a new paper introducing a benchmark suite for exact and approximate multi-objective search (MOS).¹ The paper does not offer a faster search algorithm. It does not announce a grand theory of intelligence. It does something less glamorous and more useful: it tries to fix the evaluation substrate.

That matters because multi-objective search has a particular benchmarking trap. A dataset can look difficult because it is large, while being weak as a test of trade-off reasoning. The familiar example in the paper is the DIMACS road-network family, where distance and travel time are strongly correlated. In the New York and Florida instances, the paper reports edge-level Pearson correlations of $\rho = 0.96$ and $\rho = 0.97$. If two objectives move together that closely, the “multi-objective” problem is partly wearing a costume.

A large graph with redundant objectives can still test scale. It does not necessarily test conflict. That difference is where the paper earns its keep.

The real problem is not missing datasets; it is incompatible evidence

Multi-objective search asks for paths that balance several non-negative additive costs. A path is not judged by one scalar score. Instead, the goal is to compute a Pareto-optimal set: solutions where no alternative improves one objective without worsening another. Approximate MOS relaxes this requirement using $\varepsilon$-dominance, allowing a compact set to cover the exact frontier within a specified tolerance.

The formalism is clean. The evaluation practice has been less clean.

The paper’s diagnosis is that MOS research has drawn benchmark instances from different communities: road networks, synthetic graph generators, grid pathfinding, and robotics. Each source often uses different objectives, different start-goal query procedures, and different reporting conventions. This makes cross-paper comparison difficult. More importantly, it hides a structural variable that can dominate algorithm behavior: how objectives interact.

The authors therefore standardize four benchmark families under one classical MOS formulation: non-negative additive edge-local costs evaluated by Pareto dominance. For each family, they provide fixed graph instances, fixed start-goal queries, and reference exact and approximate Pareto solution sets. The key contribution is not merely “more data.” It is controlled diversity.

A useful benchmark suite should not be a trophy shelf of famous datasets. It should be a stress chamber.

Road networks test scale, but also reveal the comfort of correlated objectives

Road networks are the obvious starting point. They are large, practical, and familiar. They also show why obvious benchmarks can become too comfortable.

The suite includes the classical DIMACS New York and Florida road networks with two objectives: geographic distance and travel time. These networks are huge: New York has 264,346 vertices and 733,846 edges; Florida has 1,070,376 vertices and 2,712,798 edges. Scale is not the problem.

The problem is redundancy. With distance and time correlated at $0.96$ and $0.97$, many trade-offs collapse. A route that is short is often also fast. That does not eliminate all Pareto structure, but it narrows the kind of conflict the algorithm sees. An algorithm performing well here may simply be good at exploiting an easy objective relationship.

The paper then extends the New York network by adding objectives such as elevation difference, average vertex degree, and hop count. This produces three-, four-, and five-objective variants while preserving the same topology and the same 100 fixed New York queries. The design is elegant because it changes the objective space while keeping the graph structure stable. Distance and time remain strongly correlated, elevation is only moderately correlated with them, and vertex degree plus hop count are almost uncorrelated with the other metrics.

That makes the extended road networks a useful intermediate category: still real road topology, but no longer trapped in the distance-time comfort zone.

The GER10 road benchmark goes further in dimensionality. It uses a connected subgraph of a German road network with ten objectives: distance, travel time, ascent, road-type costs, gas price, energy, unit cost, and quietness. Some of these objectives are highly correlated, especially routing-related ones such as distance, energy, gas price, and small-road preference. Others introduce additional structure. This is not a pure independence test. It is closer to what many business systems actually face: several objectives, some redundant, some partially distinct, and all inconveniently bundled into one decision.

For business evaluation, the lesson is simple enough to be annoying: do not test a multi-objective optimizer only on objectives that naturally agree. That is not multi-objective stress. That is group harmony.

Synthetic graphs test whether difficulty is real or just inherited from geography

Synthetic benchmarks are less romantic than real maps. They are also indispensable.

The paper includes two synthetic categories: random-cost grids and NetMaker graphs. Their purpose is not to mimic a city or warehouse. Their purpose is control.

The random-cost grids use two, three, or four objectives. Each objective cost is independently sampled from integers in $[1, 10]$. The resulting edge-level objective correlations satisfy $|\rho| < 0.01$. In plain English: the objectives are nearly independent by construction. The grids range over $7 \times 7 = 49$ width-height combinations, using sizes from 300 to 600 in each dimension. Each instance has an auxiliary source connected to the leftmost column and an auxiliary target connected from the rightmost column, creating a standardized left-to-right query.

This category tests what happens when the algorithm cannot rely on friendly correlations. Independent objectives can inflate Pareto sets and expose dominance-checking costs. A method that performs beautifully on correlated road objectives may suddenly discover that the frontier has teeth.

NetMaker provides a different synthetic stressor. It builds directed graphs with a Hamiltonian cycle for global connectivity plus local shortcut edges. Cycle edges receive one low, one medium, and one high cost component across the three objectives, with the assignment permuted. This induces moderate negative correlation on cycle edges. Locality edges receive independent low costs, creating alternative routing options. The released NetMaker instances cover 10,000, 20,000, and 30,000 vertices, with different locality settings and 50 sampled queries per graph.

The important detail is query design. NetMaker samples sources from the first 10% of vertex identifiers and targets from the last 10%, while restricting local edges by a locality window. That forces long-range traversal through intermediate graph regions, reducing trivial frontiers and producing richer trade-offs.

Synthetic graphs are not “less real” in the only sense that matters for benchmarking. They are more honest about which property they are testing.

Game grids add spatial risk without pretending to be road networks

The third family converts game-based grid maps into bi-objective search instances. The maps come from GUARDS-style weighted pathfinding environments and include layouts such as FireWalker, 64room, and maze512. The paper turns them into directed eight-connected graphs with two objectives: path length and guard exposure.

Path length is straightforward: orthogonal moves cost 10 and diagonal moves cost 14, approximating Euclidean distance. Guard exposure is more interesting. It measures the number of guards that can observe a cell, creating a spatially structured risk objective. For diagonal moves, exposure is computed using the maximum guard value among the destination and intermediate side cells, and diagonal moves are allowed only when both side cells are passable.

The resulting graphs are large, roughly 200,000 to 260,000 vertices with more than one million directed edges per instance. Yet the critical property is not just size. The paper reports small correlations between path length and guard exposure, with $|\rho| < 0.07$ across these instances.

This category matters because it introduces structured spatial trade-offs that are neither road-like nor purely synthetic. A shorter path may cross exposed territory. A safer path may detour. The conflict is local, spatial, and unevenly distributed.

The query design also keeps an operationally relevant nuisance: some randomly sampled start-goal pairs are disconnected, and the benchmark retains them as empty solution sets. That is good. Real systems spend plenty of time discovering that the requested path, schedule, or plan is impossible. A benchmark that deletes all impossibility is not clean; it is sanitized.

Robot roadmaps turn “safety” from a slogan into several objectives

The fourth family uses motion-planning roadmaps for a Franka Emika Panda robot arm in a tabletop scene with obstacles. This is where the benchmark moves from two-dimensional pathfinding into high-dimensional robot configuration space.

Each edge corresponds to a collision-free interpolation between robot joint configurations. The first objective is joint-space path length. Additional objectives convert obstacle clearance into additive penalties using a smooth function that is zero beyond a safety margin and increases as the robot approaches obstacles.

The benchmark includes two formulations. Panda-RRG_2 aggregates clearance across all seven arm links, producing a two-objective trade-off between path length and safety. Panda-RRG_8 keeps clearance separate for each link, producing eight objectives: path length plus seven per-link clearance penalties.

That distinction is not cosmetic. Aggregated clearance collapses risk into one number. Per-link clearance preserves the geometry of the manipulator. The paper notes that two links closest to the base remain far from obstacles and therefore have zero clearance penalties across all edges, but the objectives are retained to preserve structural consistency.

This is precisely the kind of modeling decision that business users usually never see, until it breaks something expensive.

Panda-RRG_2 contains 20,000 vertices and 364,770 edges, with a median exact Pareto size of 51. Panda-RRG_8 is smaller in graph scale, at 1,000 vertices and 10,260 edges, but far larger in Pareto complexity: the median exact Pareto size is 4,676, with a maximum of 8,726. The number of vertices is not the difficulty story. Objective dimensionality and frontier structure are.

The benchmark families expose different failure modes

The paper’s category design is stronger than a single leaderboard because each family pressures a different weakness. A compact way to read the suite is below.

Benchmark family	What it mainly tests	Failure mode it exposes	Business analogue
Classical road networks	Large-scale routing with familiar objectives	Mistaking correlated objectives for real trade-offs	Delivery optimization where time and distance mostly agree
Extended road networks and GER10	More objectives on road topology	Algorithms that degrade as objective dimensionality grows	Fleet planning with cost, time, fuel, terrain, and policy constraints
Random-cost grids	Near-independent objectives under controlled structure	Dominance and frontier explosion under weak correlation	Planning systems where goals conflict rather than cooperate
NetMaker	Rich synthetic long-range routing with designed trade-offs	Methods that only work on trivial or short-frontier queries	Stress-testing optimizer behavior before deployment
Game-based grids	Spatial risk versus path length	Ignoring structured local hazards and disconnected cases	Security-aware navigation, warehouse routing, inspection planning
Robot roadmaps	High-dimensional planning and safety formulation	Collapsing safety into an oversimplified scalar	Robotic manipulation where “safe” depends on which link is near what

This table also clarifies why the paper should not be read as a competition result. It is not saying one benchmark family is the “best.” It is saying that evaluation should be structurally plural. A serious optimizer should be tested against more than one kind of discomfort.

The evidence characterizes the suite; it does not crown an algorithm

The paper’s empirical material is best read as benchmark characterization rather than algorithm benchmarking.

Table 1 is main evidence for benchmark coverage. It reports graph scale and exact Pareto-set cardinality across the suite. The spread is large. Classical DIMACS-NY_2 has a median Pareto size of 94; DIMACS-FLA_2 has 235. The extended New York variants jump into the thousands, with median exact Pareto sizes above 2,200. Panda-RRG_8 reaches a median of 4,676 despite having only 1,000 vertices. Again: difficulty is not just graph size.

Figure 3 is main evidence for the approximate-reference side of the suite. It shows how Pareto-front cardinality changes as $\varepsilon$ increases across representative benchmarks. The paper reports that, averaged over benchmark families, the median Pareto set size at $\varepsilon = 0.1$ is reduced by 97.9% relative to the exact case, with a 95.6% mean reduction and a range from 84.2% to 99.9% across benchmarks.

This is not a license to be sloppy. It is a reminder that exact frontiers may be computationally extravagant relative to operational need. In many business settings, a compact set of good-enough trade-offs is more valuable than a museum-quality frontier of every non-dominated option. The important condition is that the approximation is explicit, fixed, and comparable.

Figure 5 and Table 2 are diagnostic evidence about frontier geometry. The paper visualizes representative exact two-objective Pareto fronts and reports spread ratios along each objective axis. The fronts differ not only in size but also in shape and distribution. Panda-RRG_2, for example, shows larger average spread than the classical DIMACS two-objective instances. That matters because two problems with similar cardinality can still differ in how easy they are to approximate, sample, or present to a human decision-maker.

In short, the paper’s experiments are not trying to show that Algorithm A beats Algorithm B. They show that the benchmark suite contains structurally different kinds of evidence. That is more foundational and less headline-friendly. Naturally.

What Cognaptus infers for business evaluation

The paper directly shows that MOS evaluation can be standardized across diverse graph families using fixed instances, fixed queries, and reference exact and approximate Pareto sets. It also directly shows that commonly used road benchmarks can contain highly correlated objectives, and that benchmark families differ sharply in Pareto cardinality, objective correlation, and frontier geometry.

The business inference is broader but still disciplined: firms should treat benchmark design as part of system quality assurance, not as a post-hoc demonstration slide.

For logistics, routing, robotics, inspection, scheduling, and autonomous workflow systems, multi-objective claims are cheap. “Balances cost, speed, risk, and reliability” is the sort of phrase that can be written before lunch. The hard question is whether the system has been tested on cases where those objectives genuinely conflict.

A practical evaluation checklist would ask:

Evaluation question	Why it matters
Are the objectives correlated, partially correlated, or independent?	Correlated objectives can make trade-off reasoning look easier than it is.
Are start-goal or task queries fixed and reusable?	Without fixed queries, cross-version comparison becomes performance theater.
Are disconnected or infeasible cases retained?	Real operations include failure detection, not only successful paths.
Are exact and approximate outputs both evaluated?	Exact optimality and usable approximation answer different business questions.
Is frontier size separated from frontier geometry?	A small frontier can still be awkward; a large one can still be structured.
Does safety remain decomposed when decomposition matters?	Aggregated safety metrics can hide which component creates risk.

This is not only relevant to classical path planning. The same evaluation logic applies to AI agents that trade off latency, cost, accuracy, user satisfaction, compliance risk, and tool-use reliability. The paper does not test LLM agents. It should not be cited as if it does. But it offers a useful evaluation pattern: before arguing about which system is smarter, define whether the test actually contains the conflicts the system claims to manage.

The boundary is important: this is classical MOS infrastructure

The authors are careful about scope, and the article should be too.

The benchmark suite focuses on classical MOS: non-negative, additive, edge-local costs evaluated through Pareto dominance. It does not standardize settings with negative edge weights, history-dependent objectives, hierarchical objectives, or aggregation-based preference models. Those extensions are acknowledged as future work.

That boundary matters for business translation. Many operational systems do not have purely additive costs. A route may become riskier because of cumulative exposure. A robot action may change future feasibility. A financial or supply-chain decision may depend on path history, capacity constraints, or regulatory thresholds. The benchmark is therefore not a universal evaluation solution.

It is better understood as a missing foundation for one important class of optimization problems. Foundations are not buildings. But without them, the building tends to develop opinions about gravity.

Better benchmarks make weaker claims easier to reject

The best contribution of this paper is not that it expands MOS benchmarking. It is that it changes what counts as a credible claim.

After this suite, it becomes harder to say an MOS algorithm is broadly strong because it performs well on a single road dataset. It becomes harder to hide behind scale when objective conflict is weak. It becomes harder to compare methods when queries, objectives, and approximation settings are quietly different.

That is what good evaluation infrastructure does. It does not guarantee progress. It makes fake progress more expensive.

For businesses, the message is equally plain. If a vendor, internal team, or research prototype claims to optimize across competing objectives, ask what kind of conflict it has survived. Road scale? Objective independence? Spatial risk? High-dimensional safety? Disconnected cases? Approximation sensitivity?

A system that performs well only when objectives agree is not balancing trade-offs. It is enjoying consensus.

Benchmarks should make that visible before production does.

Cognaptus: Automate the Present, Incubate the Future.

Hadar Peer, Carlos Hernandez, Sven Koenig, Ariel Felner, and Oren Salzman, “Bridging the Evaluation Gap: Standardized Benchmarks for Multi-Objective Search,” arXiv:2603.24084v1, 2026. https://arxiv.org/abs/2603.24084 ↩︎

The real problem is not missing datasets; it is incompatible evidence#

Road networks test scale, but also reveal the comfort of correlated objectives#

Synthetic graphs test whether difficulty is real or just inherited from geography#

Game grids add spatial risk without pretending to be road networks#

Robot roadmaps turn “safety” from a slogan into several objectives#

The benchmark families expose different failure modes#

The evidence characterizes the suite; it does not crown an algorithm#

What Cognaptus infers for business evaluation#

The boundary is important: this is classical MOS infrastructure#

Better benchmarks make weaker claims easier to reject#