CivBench: When AI Stops Guessing and Starts Planning

Opening — Why this matters now

After a year of inflated expectations, AI has run into a familiar problem: it can explain strategy better than it can execute it.

Benchmarks—once the currency of AI progress—are increasingly unreliable. Static tests are saturated, interactive benchmarks are fragmented, and most evaluations still collapse performance into a single, almost ceremonial metric: did it win or lose?

That might work for chess. It does not work for a 300-turn geopolitical simulation.

The paper CivBench fileciteturn0file0 introduces something more uncomfortable—and more useful: a way to measure whether AI is actually progressing toward a goal, not just occasionally stumbling into one.

Background — The limits of outcome-based intelligence

Most current AI benchmarks suffer from three structural flaws:

Problem	Description	Business Analogy
Sparse signals	Only final outcomes are measured	Judging a CEO only by IPO success
Short horizon	Tasks are isolated and episodic	Ignoring long-term strategy
No competition	Agents operate in isolation	Testing a firm without rivals

Even advanced agent benchmarks tend to emphasize step-level correctness rather than trajectory quality. The result is predictable: models that can articulate optimal strategies but fail to execute them.

This “knowing–doing gap” is not a bug. It is what happens when evaluation ignores time.

CivBench addresses this by moving evaluation into a fully dynamic environment—Civilization V—which introduces:

Long-horizon decision-making (hundreds of turns)
Multi-agent competition (8 players per game)
Interdependent systems (economy, diplomacy, warfare)

In other words, something closer to actual strategy.

Analysis — Measuring progress instead of outcomes

The core idea is deceptively simple:

Instead of asking who won, estimate who is likely to win at every point in time.

This is implemented through a progress-based evaluation framework:

1. Turn-level victory probability

Machine learning models are trained on game-state features to predict each player’s probability of winning at each turn.

2. Aggregated competitive standing

These probabilities are aggregated into a continuous measure of performance across the game.

3. Cross-game capability estimation

Final capability is derived using Bradley–Terry models (ELO-style rankings).

Key Design Shift — From snapshots to trajectories

Traditional Benchmark	CivBench
Final outcome only	Continuous probability tracking
Static tasks	Long-horizon gameplay
Individual evaluation	Multi-agent competition
Binary success	Gradual strategic progress

This shift allows something rare in AI evaluation: diagnostics.

Not just what happened—but how it unfolded.

Findings — What the benchmark actually reveals

1. Performance is closer than you think

Despite rapid progress in LLMs, none consistently outperform the built-in rule-based AI (VPAI).

Agent Type	Approx. ELO	Observation
VPAI (baseline)	~1500	Still competitive
Top LLMs	~1490–1503	Comparable, not superior
Null agent	~1339	Strategic collapse

Translation: LLMs have reached competence, not dominance.

2. Architecture matters more than model size

The same model behaves differently depending on its agent setup:

Model	Simple Setup	Briefed Setup	Effect
Kimi-K2.5	1436	1503	+67
Qwen-3.5	1421	1496	+75
Sonnet-4.5	1497	1398	-99

The “briefing” system—where a weaker model summarizes game state—helps some models and harms others.

This is a subtle but critical point:

In agent systems, interfaces matter as much as intelligence.

3. Strategy ≠ capability

One of the more revealing insights is misalignment between what models choose and what they are good at.

Examples from the study:

Some models heavily pursue science victory but perform better in diplomatic victory
Others commit strongly to a strategy even when losing

This suggests that LLM agents exhibit:

Overcommitment
Poor adaptation timing
Reactive rather than proactive strategy shifts

In short, they behave less like generals and more like analysts who panic late.

4. Adaptation is reactive, not strategic

Most models pivot strategies when their win probability is already low.

Metric	Observation
Strategy pivots	2–6 per game
Pivot timing	Typically at low win probability
Human-like baseline (VPAI)	~19 pivots per game

This indicates that models are responding to failure, not anticipating it.

Implications — What this means outside a video game

CivBench is not really about Civilization V.

It is about how we measure intelligence in systems that operate over time.

1. Outcome-based evaluation is structurally flawed

In business terms:

A failed startup might have executed excellent strategy
A successful one might have been lucky

Evaluating only outcomes hides both.

2. Agent design is now a first-class variable

The study shows that:

Same model + different setup → drastically different performance

For enterprises, this implies:

Layer	Importance
Foundation model	Necessary
Agent architecture	Critical
Interface design	Often decisive

The market narrative is still fixated on models. The real leverage is shifting elsewhere.

3. Progress-based metrics are transferable

The methodology can be applied to:

Autonomous trading systems
Supply chain optimization
Strategic planning tools
Multi-agent simulations

Anywhere decisions unfold over time, binary evaluation is insufficient.

4. AI is not yet a strategist

The paper itself notes that these systems are not ready for real-world strategic deployment.

Observed behaviors include:

Overcommitment to single strategies
Late-stage panic pivots
Misalignment between preference and capability

These are not edge cases—they are structural patterns.

Conclusion — Intelligence is a trajectory, not a result

CivBench forces a reframing.

Instead of asking:

“Did the AI win?”

It asks:

“Was the AI on track to win—and when did it stop being?”

That difference sounds minor. It is not.

It turns evaluation from a scoreboard into a narrative.

And in strategy—whether in games, markets, or organizations—the narrative is where intelligence actually lives.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The limits of outcome-based intelligence#

Analysis — Measuring progress instead of outcomes#

1. Turn-level victory probability#

2. Aggregated competitive standing#

3. Cross-game capability estimation#

Key Design Shift — From snapshots to trajectories#

Findings — What the benchmark actually reveals#

1. Performance is closer than you think#

2. Architecture matters more than model size#

3. Strategy ≠ capability#

4. Adaptation is reactive, not strategic#

Implications — What this means outside a video game#

1. Outcome-based evaluation is structurally flawed#

2. Agent design is now a first-class variable#

3. Progress-based metrics are transferable#

4. AI is not yet a strategist#

Conclusion — Intelligence is a trajectory, not a result#