Opening — Why this matters now

After a year of inflated expectations, AI has run into a familiar problem: it can explain strategy better than it can execute it.

Benchmarks—once the currency of AI progress—are increasingly unreliable. Static tests are saturated, interactive benchmarks are fragmented, and most evaluations still collapse performance into a single, almost ceremonial metric: did it win or lose?

That might work for chess. It does not work for a 300-turn geopolitical simulation.

The paper CivBench fileciteturn0file0 introduces something more uncomfortable—and more useful: a way to measure whether AI is actually progressing toward a goal, not just occasionally stumbling into one.


Background — The limits of outcome-based intelligence

Most current AI benchmarks suffer from three structural flaws:

Problem Description Business Analogy
Sparse signals Only final outcomes are measured Judging a CEO only by IPO success
Short horizon Tasks are isolated and episodic Ignoring long-term strategy
No competition Agents operate in isolation Testing a firm without rivals

Even advanced agent benchmarks tend to emphasize step-level correctness rather than trajectory quality. The result is predictable: models that can articulate optimal strategies but fail to execute them.

This “knowing–doing gap” is not a bug. It is what happens when evaluation ignores time.

CivBench addresses this by moving evaluation into a fully dynamic environment—Civilization V—which introduces:

  • Long-horizon decision-making (hundreds of turns)
  • Multi-agent competition (8 players per game)
  • Interdependent systems (economy, diplomacy, warfare)

In other words, something closer to actual strategy.


Analysis — Measuring progress instead of outcomes

The core idea is deceptively simple:

Instead of asking who won, estimate who is likely to win at every point in time.

This is implemented through a progress-based evaluation framework:

1. Turn-level victory probability

Machine learning models are trained on game-state features to predict each player’s probability of winning at each turn.

2. Aggregated competitive standing

These probabilities are aggregated into a continuous measure of performance across the game.

3. Cross-game capability estimation

Final capability is derived using Bradley–Terry models (ELO-style rankings).


Key Design Shift — From snapshots to trajectories

Traditional Benchmark CivBench
Final outcome only Continuous probability tracking
Static tasks Long-horizon gameplay
Individual evaluation Multi-agent competition
Binary success Gradual strategic progress

This shift allows something rare in AI evaluation: diagnostics.

Not just what happened—but how it unfolded.


Findings — What the benchmark actually reveals

1. Performance is closer than you think

Despite rapid progress in LLMs, none consistently outperform the built-in rule-based AI (VPAI).

Agent Type Approx. ELO Observation
VPAI (baseline) ~1500 Still competitive
Top LLMs ~1490–1503 Comparable, not superior
Null agent ~1339 Strategic collapse

Translation: LLMs have reached competence, not dominance.


2. Architecture matters more than model size

The same model behaves differently depending on its agent setup:

Model Simple Setup Briefed Setup Effect
Kimi-K2.5 1436 1503 +67
Qwen-3.5 1421 1496 +75
Sonnet-4.5 1497 1398 -99

The “briefing” system—where a weaker model summarizes game state—helps some models and harms others.

This is a subtle but critical point:

In agent systems, interfaces matter as much as intelligence.


3. Strategy ≠ capability

One of the more revealing insights is misalignment between what models choose and what they are good at.

Examples from the study:

  • Some models heavily pursue science victory but perform better in diplomatic victory
  • Others commit strongly to a strategy even when losing

This suggests that LLM agents exhibit:

  • Overcommitment
  • Poor adaptation timing
  • Reactive rather than proactive strategy shifts

In short, they behave less like generals and more like analysts who panic late.


4. Adaptation is reactive, not strategic

Most models pivot strategies when their win probability is already low.

Metric Observation
Strategy pivots 2–6 per game
Pivot timing Typically at low win probability
Human-like baseline (VPAI) ~19 pivots per game

This indicates that models are responding to failure, not anticipating it.


Implications — What this means outside a video game

CivBench is not really about Civilization V.

It is about how we measure intelligence in systems that operate over time.

1. Outcome-based evaluation is structurally flawed

In business terms:

  • A failed startup might have executed excellent strategy
  • A successful one might have been lucky

Evaluating only outcomes hides both.


2. Agent design is now a first-class variable

The study shows that:

  • Same model + different setup → drastically different performance

For enterprises, this implies:

Layer Importance
Foundation model Necessary
Agent architecture Critical
Interface design Often decisive

The market narrative is still fixated on models. The real leverage is shifting elsewhere.


3. Progress-based metrics are transferable

The methodology can be applied to:

  • Autonomous trading systems
  • Supply chain optimization
  • Strategic planning tools
  • Multi-agent simulations

Anywhere decisions unfold over time, binary evaluation is insufficient.


4. AI is not yet a strategist

The paper itself notes that these systems are not ready for real-world strategic deployment.

Observed behaviors include:

  • Overcommitment to single strategies
  • Late-stage panic pivots
  • Misalignment between preference and capability

These are not edge cases—they are structural patterns.


Conclusion — Intelligence is a trajectory, not a result

CivBench forces a reframing.

Instead of asking:

“Did the AI win?”

It asks:

“Was the AI on track to win—and when did it stop being?”

That difference sounds minor. It is not.

It turns evaluation from a scoreboard into a narrative.

And in strategy—whether in games, markets, or organizations—the narrative is where intelligence actually lives.


Cognaptus: Automate the Present, Incubate the Future.