Thinking Inside the Gameboard: Evaluating LLM Reasoning Step-by-Step

LLMs are great at spitting out answers—but are they any good at thinking through problems? A new benchmark, AdvGameBench, introduces a process-based evaluation approach that places LLMs into three rule-based strategic games to measure not outcomes, but the quality of reasoning. Developed by Yuan et al., this framework focuses on how LLMs plan, revise, and make resource-limited decisions in dynamic settings.

Three Games, Three Cognitive Demands

1. Tower Defense tests spatial planning and rule-following. Models place defenders on a battlefield to block enemies—positioning, cooldowns, and cost management are key.

2. Battle Card Game emphasizes team composition under uncertainty. Each model buys units within a budget, balancing attack and defense while anticipating initiative rules and synergy effects.

3. Turn-Based Combat challenges long-term strategic adaptation. With elemental affinities (Fire, Water, Light, etc.) and skill cycles, models must allocate limited skill points to plan multi-turn tactics.

Models Compared: Some Think, Others Panic

Twelve models were evaluated: ChatGPT variants (4.1, 4o, o3, o3-mini), Claude-3.5, DeepSeek R1/V3, Gemini 2/2.5 Flash, LLaMA-3-70B, and Qwen Max/Plus.

🏆 ChatGPT-o3-mini excelled with:
- 74.7% win rate, highest among all.
- 78.6% correction success rate—meaning nearly every revision it made improved outcomes.
- 0% Over-Budget Rate, showing total adherence to rules.
- A positive improvement slope (+0.041), getting better across matches.
🚨 Qwen-Plus showed the opposite:
- 81.6% Over-Correction Risk Rate, constantly revising even when unnecessary.
- Only 24.3% of those revisions helped.
- 50% Over-Budget Rate, violating constraints in half its moves.
- 25.6% win rate, among the lowest.

These behaviors show: smart revision is rare. The best models revised less, but better.

Metrics that Expose the Process

AdvGameBench introduces process-aware metrics:

Win Rate (WR): Did the model succeed?
Over-Correction Risk (ORR): How often did it revise after feedback?
Correction Success Rate (CSR): Did those changes help?
Improvement Slope (β): Did it learn across rounds?
Over-Budget Rate (OBR): Did it stay within cost limits?

These combine to show whether the model thinks, adapts, and obeys constraints—not just whether it guesses right.

Budget Discipline Predicts Strategy

Models that respected budgets—especially ChatGPT-o3 and o3-mini—achieved the highest win rates. A Pearson correlation of –0.95 between OBR and WR suggests a clear link: if you overspend, you underperform.

In contrast, Qwen-Plus and Qwen-Max routinely broke limits and lost games, showing that constraint adherence is not optional—it’s essential.

Revision ≠ Intelligence

Models that revised frequently tended to perform worse. ORR negatively correlated with win rate, slope, and CSR. In essence: more edits often mean worse reasoning.

The top performers, like o3-mini, revised only when it mattered—and succeeded when they did.

Memory Biases Still Haunt Us

Some models hallucinated “peashooters” in the tower defense game—a term never introduced in the task. Why? Pretraining on Plants vs. Zombies memes. This illustrates a major flaw: retrieval overrides reasoning. AdvGameBench deliberately scrubs such biases by using custom environments unfamiliar to the model’s training data.

Why This Matters

AdvGameBench isn’t just another leaderboard. It’s a shift in how we evaluate LLMs: from output-based accuracy to process-based quality. Real-world systems need more than answers—they need reliability, strategic discipline, and the ability to adapt under pressure.

This framework shows how LLMs can (or can’t) plan ahead, recover from failure, and respect limits. And that’s what will separate tools from teammates.

Cognaptus: Automate the Present, Incubate the Future.

Three Games, Three Cognitive Demands#

Models Compared: Some Think, Others Panic#

Metrics that Expose the Process#

Budget Discipline Predicts Strategy#

Revision ≠ Intelligence#

Memory Biases Still Haunt Us#

Why This Matters#