Opening — Why this matters now
After a year of inflated expectations, AI has run into a familiar problem: it can explain strategy better than it can execute it.
Benchmarks—once the currency of AI progress—are increasingly unreliable. Static tests are saturated, interactive benchmarks are fragmented, and most evaluations still collapse performance into a single, almost ceremonial metric: did it win or lose?
That might work for chess. It does not work for a 300-turn geopolitical simulation.
The paper CivBench fileciteturn0file0 introduces something more uncomfortable—and more useful: a way to measure whether AI is actually progressing toward a goal, not just occasionally stumbling into one.
Background — The limits of outcome-based intelligence
Most current AI benchmarks suffer from three structural flaws:
| Problem | Description | Business Analogy |
|---|---|---|
| Sparse signals | Only final outcomes are measured | Judging a CEO only by IPO success |
| Short horizon | Tasks are isolated and episodic | Ignoring long-term strategy |
| No competition | Agents operate in isolation | Testing a firm without rivals |
Even advanced agent benchmarks tend to emphasize step-level correctness rather than trajectory quality. The result is predictable: models that can articulate optimal strategies but fail to execute them.
This “knowing–doing gap” is not a bug. It is what happens when evaluation ignores time.
CivBench addresses this by moving evaluation into a fully dynamic environment—Civilization V—which introduces:
- Long-horizon decision-making (hundreds of turns)
- Multi-agent competition (8 players per game)
- Interdependent systems (economy, diplomacy, warfare)
In other words, something closer to actual strategy.
Analysis — Measuring progress instead of outcomes
The core idea is deceptively simple:
Instead of asking who won, estimate who is likely to win at every point in time.
This is implemented through a progress-based evaluation framework:
1. Turn-level victory probability
Machine learning models are trained on game-state features to predict each player’s probability of winning at each turn.
2. Aggregated competitive standing
These probabilities are aggregated into a continuous measure of performance across the game.
3. Cross-game capability estimation
Final capability is derived using Bradley–Terry models (ELO-style rankings).
Key Design Shift — From snapshots to trajectories
| Traditional Benchmark | CivBench |
|---|---|
| Final outcome only | Continuous probability tracking |
| Static tasks | Long-horizon gameplay |
| Individual evaluation | Multi-agent competition |
| Binary success | Gradual strategic progress |
This shift allows something rare in AI evaluation: diagnostics.
Not just what happened—but how it unfolded.
Findings — What the benchmark actually reveals
1. Performance is closer than you think
Despite rapid progress in LLMs, none consistently outperform the built-in rule-based AI (VPAI).
| Agent Type | Approx. ELO | Observation |
|---|---|---|
| VPAI (baseline) | ~1500 | Still competitive |
| Top LLMs | ~1490–1503 | Comparable, not superior |
| Null agent | ~1339 | Strategic collapse |
Translation: LLMs have reached competence, not dominance.
2. Architecture matters more than model size
The same model behaves differently depending on its agent setup:
| Model | Simple Setup | Briefed Setup | Effect |
|---|---|---|---|
| Kimi-K2.5 | 1436 | 1503 | +67 |
| Qwen-3.5 | 1421 | 1496 | +75 |
| Sonnet-4.5 | 1497 | 1398 | -99 |
The “briefing” system—where a weaker model summarizes game state—helps some models and harms others.
This is a subtle but critical point:
In agent systems, interfaces matter as much as intelligence.
3. Strategy ≠ capability
One of the more revealing insights is misalignment between what models choose and what they are good at.
Examples from the study:
- Some models heavily pursue science victory but perform better in diplomatic victory
- Others commit strongly to a strategy even when losing
This suggests that LLM agents exhibit:
- Overcommitment
- Poor adaptation timing
- Reactive rather than proactive strategy shifts
In short, they behave less like generals and more like analysts who panic late.
4. Adaptation is reactive, not strategic
Most models pivot strategies when their win probability is already low.
| Metric | Observation |
|---|---|
| Strategy pivots | 2–6 per game |
| Pivot timing | Typically at low win probability |
| Human-like baseline (VPAI) | ~19 pivots per game |
This indicates that models are responding to failure, not anticipating it.
Implications — What this means outside a video game
CivBench is not really about Civilization V.
It is about how we measure intelligence in systems that operate over time.
1. Outcome-based evaluation is structurally flawed
In business terms:
- A failed startup might have executed excellent strategy
- A successful one might have been lucky
Evaluating only outcomes hides both.
2. Agent design is now a first-class variable
The study shows that:
- Same model + different setup → drastically different performance
For enterprises, this implies:
| Layer | Importance |
|---|---|
| Foundation model | Necessary |
| Agent architecture | Critical |
| Interface design | Often decisive |
The market narrative is still fixated on models. The real leverage is shifting elsewhere.
3. Progress-based metrics are transferable
The methodology can be applied to:
- Autonomous trading systems
- Supply chain optimization
- Strategic planning tools
- Multi-agent simulations
Anywhere decisions unfold over time, binary evaluation is insufficient.
4. AI is not yet a strategist
The paper itself notes that these systems are not ready for real-world strategic deployment.
Observed behaviors include:
- Overcommitment to single strategies
- Late-stage panic pivots
- Misalignment between preference and capability
These are not edge cases—they are structural patterns.
Conclusion — Intelligence is a trajectory, not a result
CivBench forces a reframing.
Instead of asking:
“Did the AI win?”
It asks:
“Was the AI on track to win—and when did it stop being?”
That difference sounds minor. It is not.
It turns evaluation from a scoreboard into a narrative.
And in strategy—whether in games, markets, or organizations—the narrative is where intelligence actually lives.
Cognaptus: Automate the Present, Incubate the Future.