Opening — Why this matters now
The AI industry has quietly entered a dangerous phase: we are measuring everything, and understanding very little.
If you ask five vendors whether their model is “safe,” you will likely get five confident “yes” answers—each backed by benchmarks, metrics, and charts. The problem is not the lack of evaluation. It is that the evaluations no longer agree on what they are measuring.
A recent paper, fileciteturn0file0, puts it bluntly: the AI safety ecosystem is not suffering from a shortage of benchmarks, but from a collapse in measurement coherence. There are now nearly 200 safety benchmarks. There is still no shared definition of what “safe” actually means.
For businesses deploying AI into regulated, high-risk environments, this is not an academic inconvenience. It is a governance liability.
Background — From scarcity to fragmentation
Early AI safety evaluation had a simple problem: not enough benchmarks. Researchers built datasets to test toxicity, bias, jailbreak resistance, and rule-following behavior. Over time, this evolved into a sprawling ecosystem.
According to the paper’s catalogue (see summary on page 2), the landscape now includes:
| Metric | Value |
|---|---|
| Total benchmarks | 195 |
| Peak growth year | 2023 |
| English-only benchmarks | 165 / 195 |
| Evaluation-only resources | 170 / 195 |
| Stale GitHub repos | 137 / 195 |
| Stale datasets | 96 / 195 |
At first glance, this looks like progress. In reality, it signals something else: fragmentation without governance.
Benchmarks are being created faster than they are standardized, maintained, or meaningfully compared.
Analysis — What the paper actually uncovers
1. The illusion of comparable metrics
The paper’s most uncomfortable finding is what it calls metric collision.
Two benchmarks may both report “accuracy” or “F1 score,” yet differ in:
- What is being measured (moderation vs factuality vs rule-following)
- Who judges the output (human vs model vs heuristic)
- How results are aggregated
- What counts as success
In other words, the same label often refers to different mathematical objects.
| Metric Label | Hidden Variation |
|---|---|
| Accuracy | Moderation labels vs factual answers vs policy compliance |
| F1 Score | Different class definitions and units of analysis |
| Safety Score | Non-leakage vs medical safety vs personalized safety |
| Composite Scores | Different weighting, aggregation, and gating rules |
This creates a dangerous shortcut: executives see familiar metric names and assume comparability.
They shouldn’t.
2. Benchmark growth without benchmark authority
The paper introduces a complexity taxonomy: Popular, High, Medium, Low.
Distribution (page 7 chart):
| Tier | Count | Interpretation |
|---|---|---|
| Popular | 7 | Widely trusted benchmarks |
| High | 68 | Advanced but not standardized |
| Medium | 94 | Usable but fragmented |
| Low | 26 | Narrow or simple |
The takeaway is subtle but important: the field has many benchmarks, but very few reference standards.
This is the opposite of mature industries, where a small number of benchmarks dominate (think credit ratings or accounting standards).
AI safety, instead, resembles early financial derivatives—creative, abundant, and slightly unhinged.
3. Governance is the real bottleneck
The dataset reveals a less glamorous but more critical issue: maintenance.
- ~70% of repositories are stale
- ~50% of datasets are not actively maintained
- Heavy reliance on arXiv preprints
This means many benchmarks are effectively abandoned measurement systems still being cited as evidence.
The paper’s implication is clear: a benchmark is not just a dataset. It is an ongoing governance commitment.
4. Safety is multidimensional—but measured as scalar
Several case studies in the paper illustrate a deeper issue: safety is often reduced to a single number.
Take the idea of joint compliance (page 3 discussion):
| Dimension | Traditional Approach | Improved Approach |
|---|---|---|
| Safety | Measured independently | Coupled with helpfulness |
| Helpfulness | Measured independently | Must coexist with safety |
| Result | Easy to game | Harder to fake |
A model that always refuses can look “safe.” A model that always answers can look “helpful.” Neither is actually useful.
The insight: marginal metrics create false confidence; joint metrics reveal real capability.
Findings — What actually matters (and what doesn’t)
Key structural failures
| Problem | Business Impact |
|---|---|
| Metric inconsistency | Misleading vendor comparisons |
| Benchmark fragmentation | No clear evaluation standard |
| Stale infrastructure | Hidden reliability risks |
| English bias | Limited global deployment validity |
What the charts (pages 6–7) really show
- Benchmark growth peaked around GPT-4’s release (2023)
- Growth is slowing—suggesting a shift from creation to consolidation
- English dominance (~85%) indicates poor multilingual readiness
This is not just an academic pattern. It is a signal that the industry is transitioning from experimentation to governance—whether it realizes it or not.
Implications — How businesses should respond
1. Stop trusting metric names
“Accuracy,” “safety score,” and “F1” are no longer standardized signals. Treat them as labels, not guarantees.
Ask instead:
- What exactly is being measured?
- Who is judging?
- What is the aggregation logic?
If the answer is vague, the metric is not decision-grade.
2. Evaluate benchmark portfolios, not single scores
A single benchmark is no longer sufficient.
You need a portfolio approach:
| Dimension | Coverage Question |
|---|---|
| Risk types | Are different failure modes tested? |
| Metrics | Are definitions consistent? |
| Language | Is it multilingual? |
| Maintenance | Is it actively updated? |
This mirrors portfolio theory: diversification reduces hidden risk.
3. Treat evaluation as infrastructure
The paper’s most important recommendation is easy to overlook: benchmark stewardship is part of quality.
If a benchmark is not maintained, documented, and governed, it should not be used for critical decisions.
This reframes evaluation from a one-off task into an operational system.
4. Expect regulation to move here next
Once regulators realize that “AI safety” is being measured inconsistently, standardization will follow.
Not because it is elegant—but because it is necessary.
Think:
- Audit standards for AI outputs
- Certified benchmark registries
- Mandatory metric disclosures
In other words, today’s fragmentation is tomorrow’s compliance framework.
Conclusion — More benchmarks, less certainty
The uncomfortable truth is that AI safety evaluation is not failing due to lack of effort. It is failing because success metrics have outpaced shared meaning.
The industry has built nearly 200 ways to measure safety. It has not agreed on what safety is.
Until that gap is closed, benchmark results will remain persuasive—but not necessarily reliable.
And for businesses, that distinction is where real risk lives.
Cognaptus: Automate the Present, Incubate the Future.