Benchmarks on Quicksand: Why Static Scores Fail Living Models

A benchmark score looks wonderfully solid until the model changes, the dataset changes, the deployment stack changes, the GPU behaves differently, the logging pipeline drops half the useful metadata, and someone asks whether the result still means anything for their actual application.

At that point, the leaderboard number is not wrong. It is worse: it is under-described.

That is the central value of AI Benchmark Democratization and Carpentry, a broad position-and-framework paper from the MLCommons Science and HPC benchmarking community.¹ The paper does not introduce a new model, claim a new state-of-the-art result, or run a clean ablation table that can be summarized in three heroic bullets. Instead, it argues that AI benchmarking has become an operational discipline in its own right. Not a side activity. Not a PDF appendix. Not a leaderboard screenshot to decorate a procurement memo.

The authors call the missing discipline AI benchmark carpentry: the practical skill set needed to design, run, share, interpret, and evolve AI benchmarks across changing data, models, hardware, software environments, metrics, and constraints.

The phrase sounds humble. That is probably why it is useful. “Benchmark carpentry” does not promise a grand theory of AI evaluation. It says: before arguing about which model wins, please learn how to hold the measuring tool straight. A radical proposal, apparently.

The paper’s most important correction is aimed at a common misconception: democratizing AI benchmarks is not the same as publishing more leaderboards or letting smaller teams run the same elite benchmark on weaker hardware. Democratization requires infrastructure, metadata, containers, workflows, energy logs, profiling, domain mapping, and training. In other words, access to the test is not enough. People need the craft of testing.

This article reads the paper as a map rather than a linear survey. The useful categories are: what a benchmark must specify, why static benchmarks fail living systems, why infrastructure belongs inside the result, why energy and GPU variability cannot remain footnotes, and why benchmark democratization is ultimately a curriculum problem.

Static benchmarks fail when the target keeps moving

Traditional benchmarks work best when the object being tested is stable enough to define in advance. A fixed task. A fixed dataset. A fixed environment. A fixed metric. A fixed procedure. Run the test, compare the results, update the table.

AI increasingly refuses to behave so politely.

The paper identifies several moving parts: model architectures evolve, datasets change, deployment settings vary, and some AI systems — especially large language models — may memorize static benchmark material. The resulting problem is not just benchmark contamination in the narrow sense. It is a broader mismatch between benchmark conditions and real deployment conditions.

A static benchmark can still be useful. The paper does not throw it out. It makes a more disciplined point: static tests are only valid when the benchmark components are sufficiently specified and when the user understands what is fixed, what is variable, and what the resulting score can reasonably imply.

That matters because businesses often consume benchmarks as if they were product labels:

What the business wants	What the benchmark often gives	What benchmark carpentry asks instead
“Which model is best?”	A ranked score under a specific setup	Best for which task, dataset, metric, constraint, and deployment environment?
“Which GPU should we buy?”	Peak performance or throughput	Under what workload, software stack, power profile, and variability pattern?
“Which vendor is cheaper?”	Token price or latency estimate	At what accuracy target, throughput level, energy cost, monitoring overhead, and failure rate?
“Can we trust this leaderboard?”	A public comparison	Is the benchmark reproducible, versioned, documented, containerized, and domain-relevant?

The paper’s answer is not “ignore benchmarks.” That would be lazy, which is very fashionable but still lazy. The answer is to treat benchmarking as a structured evaluation workflow.

A benchmark is not a score; it is a specification

The formal core of the paper is its benchmark specification:

$$ B = (I, D, T \text{ or } W, M, C, R, V) $$

Here, $B$ is the benchmark. The components are infrastructure, dataset, task or workflow, metrics, constraints, results, and version or timestamp. A task can be written as:

$$ T = (A, P) $$

where $A$ is the application and $P$ is its parameters. For more complex systems, the paper allows a workflow $W$ made of multiple tasks and dependencies.

This may look like harmless notation. It is not. It is a checklist against benchmark theater.

A benchmark score without infrastructure details is not reproducible. A benchmark without dataset lineage is not interpretable. A benchmark without constraints is not fair. A benchmark without versioning is a future archaeology project. A benchmark without metrics that match the real business question is just numerology with better typography.

The paper’s specification matters because AI systems are now evaluated across heterogeneous conditions. A model may perform differently depending on the GPU generation, memory bandwidth, batch size, compiler, container runtime, orchestration layer, dataset freshness, retrieval pipeline, quantization setting, or power cap. The formal tuple forces these variables into the benchmark definition rather than leaving them as tribal knowledge.

For business users, the translation is straightforward:

Benchmark component	Operational question	Business consequence
Infrastructure $I$	What hardware, software, cloud, libraries, power environment, and runtime stack were used?	Prevents false comparisons between vendor demos and internal deployment conditions.
Dataset $D$	What data was used, how was it split, documented, refreshed, and versioned?	Reduces benchmark contamination, domain mismatch, and hidden bias.
Task or workflow $T/W$	Is this a single task or an end-to-end process with dependencies?	Avoids optimizing a toy task while the real workflow fails.
Metrics $M$	Are we measuring accuracy, latency, throughput, cost, energy, robustness, or a trade-off?	Aligns model selection with actual ROI drivers.
Constraints $C$	What limits apply to model size, training time, inference budget, data access, hardware, or compliance?	Makes comparisons fair and procurement-relevant.
Results $R$	Are outputs reported clearly with error analysis and dashboards?	Supports diagnosis rather than score worship.
Version $V$	When was this run, on which version of data/model/code?	Keeps benchmark history usable as systems evolve.

The paper’s quiet insight is that many benchmark debates are actually specification failures. People argue about the score because the benchmark did not adequately define the object being measured.

Living datasets turn evaluation into maintenance

The paper distinguishes static datasets from dynamic and “living” datasets. A living dataset is updated over time with new data, edge cases, or corrections. It may represent real-time feeds, as in earth science or healthcare, or it may simulate time-based data ingestion from a static source.

This is where the article’s title earns its keep. Static scores sit on quicksand when the benchmarked reality keeps changing.

A static benchmark assumes the test data is a stable proxy for future use. That assumption can be reasonable for some hardware runtime tests. It is much weaker for systems exposed to changing user behavior, current events, new scientific observations, medical updates, fraud patterns, cyberattack methods, or regulatory text.

For companies deploying AI, the relevant question is not simply: “What did the model score?” It is:

How does the evaluation process stay meaningful after the product, data, users, and threat environment change?

That question pushes benchmarking toward maintenance. The benchmark itself needs versioning, data-refresh rules, update triggers, and comparison logic across time. A living benchmark should not become an excuse to constantly move the goalposts. It needs constraints, documented changes, and reproducible snapshots. Otherwise, “dynamic evaluation” becomes a polite way to say “nobody can reproduce anything anymore.”

The paper is careful here. It does not argue that every benchmark should use living data. It says static testing is preferable when behavior can be tested statically because dynamic setups can create combinatorial explosion. That boundary is important. A benchmark should evolve when evolution improves relevance, not because dynamism sounds modern and therefore fundable.

Workflows matter because AI products are rarely single tasks

The paper’s workflow extension is especially relevant for business AI. Many enterprise AI failures do not happen because a model cannot perform a narrow task. They happen because the task sits inside a messy pipeline.

A customer-support assistant is not just answer generation. It may include retrieval, policy checking, tool use, escalation, logging, summarization, and quality review. A trading assistant is not just signal generation. It includes data ingestion, feature calculation, risk rules, execution constraints, state tracking, and post-trade evaluation. A medical AI system is not just classification. It includes data quality, clinical context, safety thresholds, auditability, and workflow integration.

Benchmarking the isolated model may be useful, but it is not enough.

The paper’s workflow formulation $W=(T,E)$ treats a scientific task as a graph of subtasks and dependencies. That structure fits modern agentic AI especially well. Agent benchmarks cannot stop at “did the model answer correctly?” They must ask whether the system selected the right tools, passed the right information, respected constraints, recovered from intermediate errors, and produced an auditable result.

For businesses, this changes benchmark design from a product-comparison exercise into a process-evaluation exercise. The better question is not “Which model tops the table?” but “Which configuration completes our workflow reliably under our constraints?”

Yes, this is less glamorous than a leaderboard. So is accounting. Companies still need it.

Infrastructure is part of the result, not background scenery

The paper repeatedly returns to infrastructure because infrastructure is where benchmark scores quietly become non-transferable.

The authors discuss workflows, containerization, system-dependent software, logging, monitoring, and profiling. These are not decorative engineering details. They determine whether a benchmark can be reproduced, compared, debugged, and trusted.

Containerization is a good example. A benchmark that runs in one lab but collapses in another because of library differences is not democratized. It is a local ritual. The paper notes that Docker-style containers can simplify deployment, while HPC environments may require Apptainer because of root-access restrictions. That detail matters because many AI benchmarks span local machines, cloud platforms, and supercomputing environments.

Logging is another example. The paper points out that some benchmark logs are not human-readable and require post-processing. That sounds minor until a team has to explain a model-selection decision to procurement, compliance, or a client. Logs that only benchmark insiders can interpret are not transparent. They are a velvet rope.

Profiling is the diagnostic layer. It explains why one implementation is faster than another, whether overhead comes from data preprocessing or GPU kernels, and which parts of the heterogeneous system are actually being used. The paper lists profiling tools across framework, system, kernel, compiler, communication, cloud, and memory levels. This is best read as a coverage map, not as a recommendation to profile everything all the time.

The authors explicitly warn against over-profiling because profiling has its own cost. That is one of the more practical points in the paper: good benchmark carpentry includes knowing how much instrumentation is enough.

GPU variability makes “same hardware” less comforting than it sounds

One of the paper’s most business-relevant sections concerns GPU variability. The authors cite prior work showing that application performance can vary even across hardware with the same architecture and vendor SKU. For GPU-rich clusters, the cited evidence includes average application variability of 8%, maximum variability of 22%, and outliers up to 1.5 times slower than the median GPU. In multi-GPU jobs, the risk compounds because synchronized workloads may wait for the slowest device.

This is not the paper’s own experiment; it is reviewed evidence used to motivate benchmark carpentry. Its purpose is not to prove a new GPU-variability theorem. It supports a practical claim: reproducible AI benchmarking needs to account for hardware behavior, scheduling, and profiling, not merely list the GPU model.

For businesses, this affects cloud cost, procurement, and SLA interpretation. Suppose two model configurations appear to differ by a small latency margin. If the benchmark was run once, on an unspecified allocation, without variability controls, the difference may be noise wearing a lab coat.

A mature evaluation process should record allocation details, repeat runs where feasible, identify slow-device effects, and use workload-relevant profiling. For large jobs, scheduling policies and GPU grouping can become part of benchmark design. This is not exciting. It is merely the difference between measurement and vibes.

Energy benchmarking turns performance into cost-to-solution

The energy section is one of the paper’s clearest bridges from research benchmarking to business decision-making.

The authors argue that traditional metrics such as FLOPS or latency can hide energy-to-solution: the total energy required to complete a task. They illustrate the issue with estimated energy consumption for GPT-style model training and inference, while explicitly marking GPT-5 and GPT-6 values as estimates or projections rather than public measurements. They also use Oak Ridge leadership-class systems to show a dual trend: total power consumption rose across generations, while performance per energy unit improved substantially.

The purpose of these tables and figures is illustrative, not a controlled experiment. They make energy visible as a benchmark dimension. They do not establish a universal law of AI energy use, and they should not be read as an independent audit of proprietary model training energy. The serious point is simpler: if energy is not measured, it cannot enter the optimization problem.

The paper proposes energy metrics at several layers:

Layer	Example metrics	What it helps decide
Device or micro-architectural layer	Energy per flop, energy per inference, temperature logs	Kernel efficiency, hardware limits, safe operating ranges
Job or system layer	kWh, energy-delay product	Job scheduling, power caps, cost-to-solution
Facility or data-center layer	PUE, DCiE	Infrastructure efficiency, sustainability reporting, facility planning

This layered framing is useful because businesses often mix these questions. A model team may care about energy per inference. Finance may care about monthly cloud cost. Sustainability reporting may care about facility-level energy and carbon accounting. Procurement may care about throughput per watt. These are related but not identical.

A benchmark that collapses them into a single “efficient” label is asking for confusion. A better benchmark states the layer, metric, logging method, and decision purpose.

Benchmark catalogs are useful only when they are searchable, rated, and contextualized

The paper reviews MLCommons, MLPerf, MLPerf Tiny, MLPerf Storage, MLPerf Science, AILuminate, Croissant ML, MLCube, and related efforts. It also discusses an ontology for scientific machine learning benchmarks, including domain, task, metric, model type, and ratings for documentation, specification, software, metrics, dataset, and reference solution.

The ontology section is easy to underestimate because it looks like catalog work. It is actually one of the strongest democratization mechanisms in the paper.

If there are too many benchmarks to manually inspect, then discoverability becomes part of evaluation infrastructure. The paper reports that as of October 1, 2025, a search for “AI benchmark” returned 106 arXiv entries and 2,490 Google Scholar entries. The exact count is less important than the implication: benchmark abundance can become benchmark opacity.

A catalog without structure does not democratize anything. It just gives users a bigger pile.

The ontology/rating approach helps answer practical questions:

User need	Catalog feature that helps
Find benchmarks for a domain	Domain and application tags
Compare benchmark maturity	Ratings for documentation, software, dataset, metrics, and reference solution
Identify reusable components	Metadata about tasks, models, hardware, and KPIs
Avoid irrelevant leaderboards	Search by scientific task and deployment constraints
Support community improvement	Open submission workflow and rubric-based review

For business teams, the analogy is internal benchmark governance. A company evaluating AI systems should not maintain a messy folder of one-off notebooks named final_benchmark_v7_REAL.ipynb. It should maintain a catalog of benchmark blueprints: what each benchmark measures, when to use it, what data it depends on, how to run it, what results are comparable, and what the known limitations are.

Revolutionary? No. Necessary? Unfortunately, yes.

Democratization is a skills problem before it is a platform problem

The paper’s curriculum section is where “benchmark carpentry” becomes concrete. The proposed curriculum includes software carpentry foundations, AI benchmarking fundamentals, reproducibility and experiment management, ethical considerations and bias mitigation, hands-on projects, open-source contribution, energy efficiency, simulation, performance tuning, and leaderboard management.

This is not just academic training. It maps cleanly onto organizational capability.

A company can buy access to a model API. It can rent GPUs. It can subscribe to evaluation platforms. None of that guarantees it knows how to benchmark its own AI use cases. The missing capability sits between data science, infrastructure engineering, product management, risk, and finance.

Benchmark carpentry therefore becomes an internal operating discipline:

Capability	Practical business form
Benchmark specification	Evaluation templates for each use case
Reproducible execution	Containers, CI/CD benchmark runs, pinned dependencies
Data provenance	Versioned datasets, refresh rules, contamination checks
Metric design	Accuracy-latency-cost-energy trade-off dashboards
Profiling	Bottleneck diagnosis before scaling spend
Energy logging	Cost-to-solution and sustainability reporting
Benchmark catalog	Internal library of approved tests and interpretation notes
Curriculum	Training for analysts, engineers, product owners, and procurement teams

The paper’s democratization argument is broader than business, but the business implication is direct: benchmark literacy reduces bad AI spending. It helps teams avoid buying a model because it won a benchmark that does not resemble their workload, deploying on hardware whose performance profile they do not understand, or optimizing latency while quietly raising total cost.

That is not glamorous innovation. That is avoiding expensive nonsense. Often, that is the better ROI.

Simulation extends access, but it does not replace reality

The paper also discusses simulation tools such as Accel-Sim, gem5, SST, and ExaDigiT. Their purpose is to democratize design-space exploration for teams that lack access to target hardware or want to test large-scale configurations before physical deployment.

This is important, but the boundary matters. Simulation supports estimation, planning, and what-if analysis. It does not magically make real-world benchmarking unnecessary. Models of hardware and infrastructure are themselves approximations, and their fidelity varies by layer and use case.

The paper’s simulation review is best read as an access strategy. If smaller teams cannot repeatedly run large-scale experiments, simulation can help them reason about likely behavior, compare design options, and narrow the search before spending money. It lowers the cost of inquiry. It does not abolish validation.

For enterprises, the same logic applies. Simulation can inform architecture planning, capacity decisions, power and cooling analysis, and workload scheduling. But any high-stakes deployment still needs real measurement under realistic constraints. The simulator is a rehearsal room, not the concert.

What the paper directly supports — and what Cognaptus infers

Because this paper is broad and framework-oriented, it is important to separate the paper’s direct contribution from business interpretation.

Claim	Status in the paper	Business meaning	Boundary
AI benchmarks need a formal specification covering infrastructure, data, tasks/workflows, metrics, constraints, results, and versioning	Direct framework contribution	Use benchmark templates instead of ad hoc score collection	The framework is conceptual, not validated as a standard through controlled adoption studies
Static benchmarks can become misaligned with evolving AI systems, data, and deployment contexts	Direct argument supported by examples and prior benchmark experience	Treat benchmark relevance as something to maintain over time	Not every benchmark should be dynamic; static tests remain useful when appropriate
Benchmark democratization requires education, tooling, metadata, sharing, and reproducible workflows	Direct thesis	Build internal benchmark capability, not just access to public leaderboards	Curriculum design is proposed, not empirically evaluated
GPU variability and system-level effects affect reproducibility	Reviewed evidence from prior work	Repeat runs, profile workloads, and document infrastructure before making procurement decisions	The cited variability results depend on specific clusters, workloads, and prior studies
Energy benchmarking should become part of AI evaluation	Direct argument with illustrative tables and metric taxonomy	Include cost-to-solution, energy-to-solution, and sustainability metrics in model selection	Some energy estimates in the paper are illustrative or projected, not direct measurements
Simulation can democratize large-scale benchmark planning	Direct review and synthesis	Use simulation for architecture exploration before expensive deployment	Simulation requires validation against real systems

The distinction matters because business readers are tempted to ask, “What did the paper prove?” The answer is: it does not prove a new benchmark wins. It defines what a serious benchmarking practice should contain.

That is less convenient than a number. It is also more useful.

The business value is not a better leaderboard; it is cheaper diagnosis

For companies, the actionable lesson is to stop treating AI benchmarks as external truth machines. Public benchmarks are inputs. They are not substitutes for application-specific evaluation.

A practical internal benchmark carpentry program would begin with five moves.

First, define benchmark blueprints for major AI workflows. Each blueprint should specify infrastructure, data, task or workflow, metrics, constraints, result format, and versioning. The blueprint should be short enough to use and strict enough to prevent interpretive chaos.

Second, separate model benchmarks from workflow benchmarks. A model may answer questions well but fail inside a retrieval process. A tool-using agent may solve a demo task but fail under permission constraints, latency budgets, or messy data. Evaluating only the model is sometimes like testing a car engine while ignoring the brakes. Bold, but not ideal.

Third, record cost and energy alongside quality. For production AI, accuracy without cost context is incomplete. Latency without throughput context is incomplete. Energy without workload context is incomplete. The benchmark should report trade-offs, not just winners.

Fourth, use containerized and versioned benchmark runs. The goal is not bureaucratic neatness. The goal is to know whether a later score changed because the model improved, the dataset changed, the library changed, the hardware changed, or someone accidentally benchmarked a different task. Again.

Fifth, train people to interpret results. Procurement, product, engineering, and risk teams do not need the same depth of detail, but they need shared concepts. Otherwise, benchmarks become political artifacts: everyone cites the score that supports the decision they already wanted.

The return on benchmark carpentry is not only better model selection. It is faster diagnosis when AI systems disappoint. If the evaluation stack is well designed, teams can identify whether the issue is data drift, infrastructure bottleneck, metric mismatch, workflow failure, or cost explosion. If the evaluation stack is not designed, the team gets a meeting, a spreadsheet, and several confident opinions. Nature’s worst dashboard.

Limitations: this is a framework paper, not a finished benchmark suite

The paper’s strength is breadth. Its limitation is also breadth.

It surveys, formalizes, proposes, and organizes. It does not present a controlled empirical validation of the benchmark carpentry curriculum. It does not show that adopting the proposed specification improves organizational outcomes. It does not deliver a plug-and-play enterprise benchmark platform. It also draws together many examples from HPC, MLCommons, energy benchmarking, GPU variability, simulation, and scientific AI; readers should not treat every table as equal evidence for one central causal claim.

This is not a flaw if the paper is read correctly. It is a blueprint paper.

The correct question is not “Can I use this tomorrow as a complete benchmark standard?” The better question is “Which missing pieces in our current AI evaluation practice does this framework expose?”

For many organizations, the answer will be uncomfortable:

Benchmarks are not versioned.
Datasets are not documented well enough.
Infrastructure details are treated as incidental.
Energy and cost are measured after deployment, if at all.
Public leaderboards are cited without workload mapping.
Evaluation notebooks are not reproducible.
Nobody owns benchmark literacy across teams.

That is the quicksand. The model score is just standing on it.

Conclusion: benchmark the benchmark before trusting the score

The paper’s most useful contribution is not the phrase “AI benchmark carpentry,” although the phrase is memorable. Its real contribution is shifting attention from benchmark results to benchmark capability.

A modern AI benchmark should be a reproducible, versioned, infrastructure-aware, data-aware, metric-aware, constraint-aware workflow. It should support static tests when static tests are sufficient, living datasets when the world changes, workflow evaluation when the product is a pipeline, profiling when performance needs diagnosis, and energy metrics when cost-to-solution matters.

Democratization, in this framing, does not mean everyone gets to stare at the same leaderboard. It means more people can build, run, adapt, interpret, and improve benchmarks that actually match their problems.

That is a less glamorous story than “Model X beats Model Y.” It is also the story businesses need. Because in production AI, the winning benchmark score is only useful after you know what was benchmarked, under what constraints, on which infrastructure, against which data, with which metric, at what cost, and for whose workflow.

Minor details. The kind that decide whether the deployment works.

Cognaptus: Automate the Present, Incubate the Future.

Gregor von Laszewski et al., “AI Benchmark Democratization and Carpentry,” arXiv:2512.11588, 2025, https://arxiv.org/pdf/2512.11588. ↩︎

Static benchmarks fail when the target keeps moving#

A benchmark is not a score; it is a specification#

Living datasets turn evaluation into maintenance#

Workflows matter because AI products are rarely single tasks#

Infrastructure is part of the result, not background scenery#

GPU variability makes “same hardware” less comforting than it sounds#

Energy benchmarking turns performance into cost-to-solution#

Benchmark catalogs are useful only when they are searchable, rated, and contextualized#

Democratization is a skills problem before it is a platform problem#

Simulation extends access, but it does not replace reality#

What the paper directly supports — and what Cognaptus infers#

The business value is not a better leaderboard; it is cheaper diagnosis#

Limitations: this is a framework paper, not a finished benchmark suite#

Conclusion: benchmark the benchmark before trusting the score#