Thinking Inside the Gameboard: Evaluating LLM Reasoning Step-by-Step

TL;DR for operators

Most AI evaluations still ask the wrongly narrow question: did the model get the answer right? That is useful, but it is not enough when the model is expected to act as an agent, revise plans, obey constraints, and recover from failure without turning the workflow into a procedural bonfire.

The paper behind this article, Tracing LLM Reasoning Processes with Strategic Games, introduces AdvGameBench, a benchmark that puts LLMs into closed, rule-based strategic games so their reasoning process becomes observable rather than politely hidden behind a final answer.¹ The key move is not “games are fun”. The key move is that games create replayable decision traces: initial plan, simulator outcome, feedback, revision, budget use, rule violations, and final result.

The operational lesson is sharp: more revision is not automatically better reasoning. ChatGPT-o3-mini posts the strongest average win rate in the reported benchmark table at 74.7%, paired with the highest average correction success rate at 78.6%. Qwen-Plus, meanwhile, revises aggressively, with an average over-correction risk rate of 81.6%, but wins only 25.6% of matches and overspends in roughly half of its turns. In other words, “try again” is not a strategy. It is sometimes just failure with extra API calls.

For business use, the paper points toward a better agent evaluation stack. Procurement teams, AI governance teams, and product owners should not only test whether an agent can complete a task. They should inspect whether it plans before acting, whether its corrections actually improve the state, whether it obeys budgets and rules, and whether it becomes more stable or more chaotic after feedback.

The boundary: AdvGameBench is controlled, synthetic, adversarial, and turn-based. It does not prove that a model will behave well in customer service, finance operations, procurement approval, software engineering, or logistics planning. But it gives a practical diagnostic template for those environments: evaluate the trail, not just the trophy.

The useful trick is turning reasoning into a replay

An LLM’s final answer is often a bad witness. It may be right for the wrong reason, wrong after a promising start, or correct only because the task was forgiving. That is inconvenient for benchmarks, but it is fatal for operations.

A deployed agent does not merely answer. It sequences actions. It uses budget. It revises after feedback. It decides whether a failure requires a small correction or a full re-plan. It may have to satisfy policy constraints, customer constraints, technical constraints, and commercial constraints all at once. The interesting question is no longer “can it produce a good-looking response?” The interesting question is whether its behaviour remains coherent while reality pushes back.

AdvGameBench makes that behaviour visible by using strategic games as a controlled laboratory. Each model receives explicit rules, produces a strategy, has that strategy executed by a simulator, receives outcome feedback, and may revise. This loop repeats over rounds. The environment is closed enough to score automatically, but rich enough to expose planning, adaptation, and constraint handling.

That is the mechanism-first value of the paper. The games are not the intellectual centre. The trace is.

A normal benchmark collapses the work into one number. AdvGameBench stretches the work back out into a sequence. Once the sequence is visible, a model’s apparent intelligence can be decomposed into behaviours: did it begin with a viable plan, did it overreact to failure, did its revision help, did it respect the budget, did it improve across rounds, and did it behave differently when moving first or second?

For enterprise AI, that decomposition is the real asset. It turns agent evaluation from a beauty contest into an audit log.

AdvGameBench tests three kinds of operational competence

The benchmark uses three game environments, each selected to stress a different part of process-level reasoning.

Tower Defense asks models to place defenders against incoming attackers. This stresses spatial planning, sequencing, unit selection, and rule compliance. A model must understand not only what units are strong, but where and when they matter.

Battle Card Game asks models to assemble units under composition and initiative rules. This stresses resource allocation and strategic trade-offs under uncertainty. Buying the strongest available component is not automatically good if the resulting composition collapses under the battle rules. A familiar problem, as anyone who has watched enterprise software procurement can confirm.

Turn-Based Attribute Game asks each side to choose characters and skills with elemental relationships and cost constraints. This stresses multi-step consistency: a plan must survive across repeated interactions, not merely look plausible at the first move.

Together, the environments cover three operationally relevant behaviours:

Game environment	Main cognitive pressure	Business analogue
Tower Defense	Spatial and sequential planning under threats	Scheduling, resource placement, routing, incident response
Battle Card Game	Composition and allocation under rule uncertainty	Team assembly, portfolio construction, vendor/tool selection
Turn-Based Attribute Game	Multi-step adaptation with constraints	Workflow automation, negotiation sequences, policy-bound decisioning

The point is not that businesses should benchmark agents by asking them to play games before approving invoices. Tempting, but no. The point is that games create a clean proxy for behaviours that appear in real work: planning before execution, revising after feedback, and staying inside hard limits.

The metric stack separates success from competence

The paper’s strongest contribution is its metric design. Win rate remains present, but it is deliberately demoted from “the whole story” to “one observable outcome among several”. That matters because two models can win for very different reasons, and two models can lose in ways that imply very different operational risks.

The core metrics are:

Metric	What it measures	Why operators should care
Win Rate (WR)	The proportion of matches a model wins, with rule violations causing immediate forfeiture	Final task success still matters; nobody is buying philosophical elegance by the seat
Over-Correction Risk Rate (ORR)	How often a model revises after negative feedback	High reactivity can signal instability rather than learning
Correction Success Rate (CSR)	How often a revision improves the result, such as removing a violation or turning a loss into a win	Measures whether the model can recover, not merely flail
Improvement Slope ($\beta$)	Whether performance improves across repeated interactions	Captures adaptation over time rather than one-shot cleverness
Over-Budget Rate (OBR)	How often proposals exceed explicit resource constraints	Tests whether the model can internalise hard limits, not just optimise in fantasy mode

The distinction between ORR and CSR is especially important. ORR asks: does the model change its mind after failure? CSR asks: was changing its mind useful?

That split is where the paper becomes valuable for AI operations. Many agent demos reward visible effort. The agent tries, retries, calls tools, rewrites plans, explains itself, and generally performs the theatre of diligence. But if those revisions do not improve the state, the behaviour is not diligence. It is churn.

The headline result is not that one model won; it is how it won

The benchmark evaluates 12 models: ChatGPT-4.1, ChatGPT-4o, ChatGPT-o3, ChatGPT-o3-mini, Claude-3.5-Sonnet, DeepSeek-R1, DeepSeek-V3, Gemini-2-Flash, Gemini-2.5-Flash, LLaMA-3-70B, Qwen-Max, and Qwen-Plus. The models are tested across the three game environments and against diverse opponents, including ChatGPT-4o, Claude-3.5-Sonnet, and DeepSeek-V3.

The headline numbers are straightforward. In the reported average table, ChatGPT-o3-mini leads with a 74.7% win rate and 78.6% correction success rate. ChatGPT-o3 follows closely with a 74.2% win rate and 73.7% correction success rate. These models do not merely win; they combine strong outcomes with effective corrections and disciplined resource use.

Qwen-Plus provides the counterexample. It has an 81.6% average over-correction risk rate, meaning it frequently revises after negative feedback. Yet its average correction success rate is only 24.3%, and its win rate is 25.6%. This is the paper’s central behavioural warning: responsiveness is not the same as reasoning.

A useful way to read the result is this:

Model pattern	What the paper observes	Operational interpretation
High win rate + high CSR + low budget violations	ChatGPT-o3-mini and ChatGPT-o3 perform strongly across dimensions	Balanced agent behaviour: plan, recover, obey constraints
High ORR + low CSR	Qwen-Plus and Qwen-Max revise often but recover poorly	Reactive agent behaviour: lots of motion, little repair
Strong initial performance but decline over rounds	DeepSeek-R1 has the highest initial win rate at 75.0% but declines	Good first plans may still be brittle under repeated feedback
Moderate profiles without dominance	Claude and Gemini variants show mixed strengths	Capable but not uniformly disciplined across the whole loop

This is why a leaderboard-only summary would miss the actual paper. The important result is not simply “o3-mini wins”. The important result is that the winning behaviour looks balanced: good planning, selective revision, successful recovery, and budget discipline.

That combination is what an enterprise agent needs. A customer-support agent that escalates every slightly confusing case is not robust. A finance agent that corrects itself by violating approval thresholds is not adaptive. A coding agent that rewrites working modules after every test failure is not improving. It is making a small bonfire and calling it iteration.

Budget discipline is a reasoning signal, not administrative tidiness

The strongest quantitative relationship in the paper is about budget adherence. Figure 4 reports over-budget ratio, and the authors find a strong negative correlation between over-budget ratio and win rate: Pearson $r = -0.95$, with $p < 0.001$. ChatGPT-o3 and ChatGPT-o3-mini record perfect budget adherence in the reported analysis. Qwen-Plus and Qwen-Max sit at the other end, with high over-budget ratios and poor win rates.

This matters because budget violations are easy to dismiss as formatting errors or minor compliance slips. They are not. In this benchmark, staying inside budget is part of the reasoning task. A model that “solves” the game by overspending has not solved the game. It has ignored the game.

The business parallel is direct. Many AI agents appear competent only when constraints are soft. They can draft the plan if procurement limits are optional, schedule the project if labour capacity is imaginary, propose the marketing campaign if legal review does not exist, and optimise the portfolio if risk exposure is treated as a decorative spreadsheet column.

AdvGameBench treats the constraint as part of the problem. That is the correct stance for enterprise AI. The constraint is not an afterthought. The constraint is the job.

The revision result is directional, not a statistical hammer

The paper also studies whether revising more helps. The answer is: not reliably, and often no.

Figure 5 analyses correlations between over-correction risk rate and other process metrics. The relationships point in the expected direction: higher over-correction risk tends to associate with lower win rate, lower correction success, weaker improvement slope, and higher over-budget ratio. The figure reports moderate correlations, but they do not reach conventional statistical significance because the sample is only 12 models.

That distinction matters. The safe interpretation is not “frequent revision has been proven harmful in all LLM systems”. The safe interpretation is narrower and more useful: in this controlled benchmark, the models that revise most often are not the models that recover best. Revision frequency is therefore a poor proxy for reasoning quality.

The paper’s discussion summarises the broader pattern as calibrated self-editing outperforming “spray-and-pray” revision. That is exactly the right operational phrase, even if slightly impolite to the spray can.

For agent evaluation, this suggests a practical rule: log revisions, but do not reward them directly. Reward repairs. Measure whether a correction reduces violations, improves state quality, preserves useful structure, or produces a better final result. Otherwise, the evaluation may accidentally optimise for restless behaviour.

The evidence stack is mostly main evidence, with a few diagnostic controls

The paper’s figures and appendices serve different purposes. Treating them all as equal “results” would flatten the study.

Paper component	Likely purpose	What it supports	What it does not prove
Table 1: WR, ORR, CSR across games	Main evidence	Models differ sharply not only in winning, but in revision quality	Does not isolate causal mechanisms behind model behaviour
Figures 2 and 3: initial win rate, slope, trajectories	Main evidence for planning and adaptation	Strong openings and long-run improvement are different capabilities	Does not show real learning in the training sense; it is repeated interaction behaviour
Figure 4: over-budget ratio	Main evidence for constraint discipline	Budget adherence tracks strongly with success in this setup	Does not prove the same correlation in every business workflow
Figure 5: ORR correlations	Diagnostic correlation analysis	Frequent revision is directionally associated with weaker process outcomes	Small sample means it is suggestive, not decisive
Figure 6: first-mover advantage	Robustness/control for role asymmetry	Turn order is mostly controlled, though some models are sensitive to it	Does not eliminate all game-structure biases
Figure 7: radar chart	Synthesis visualisation	Balanced process quality is easier to see across dimensions	Not independent evidence; it repackages measured metrics
Appendix A: game rules and units	Implementation detail	The environments are rule-bound and inspectable	Does not itself validate ecological realism
Appendix B: extra metrics such as rule violation, constructive rate, strategic similarity, and first-mover advantage	Diagnostic framework extension	Offers knobs for finer-grained trace analysis	Not all metrics are equally central to the reported headline findings

This separation is important for business readers. The main evidence supports a diagnostic claim: process metrics reveal failure modes hidden by outcome metrics. The robustness and appendix material support the benchmark’s interpretability and extensibility. They do not magically convert a strategic game environment into a full replica of enterprise work.

The “peashooter” failure is not cute; it is a benchmark-design warning

One of the paper’s more revealing details appears in the discussion. In the tower defense experiments, defensive units are described as soldiers. Yet several models generated the term “peashooter”, a term not introduced in the task instructions. The likely source is pretraining association with Plants vs. Zombies, where tower defense and peashooters are culturally glued together.

This is easy to laugh at. It is also more serious than it looks.

The failure shows how a model can import a memorised pattern into a supposedly novel rule system. That contaminates evaluation. The benchmark may accidentally become a test of latent cultural association rather than reasoning over the rules provided. The authors respond by redesigning game environments to neutralise lexical cues and reduce strategy leakage from popular games.

For enterprise evaluation, the equivalent problem is everywhere. If a procurement agent has seen thousands of boilerplate RFP templates, it may complete the current one by template gravity rather than by reading the actual constraints. If a legal drafting assistant recognises a familiar clause pattern, it may assume jurisdictional details not present in the document. If a customer-support agent sees a common complaint structure, it may route the ticket by stereotype rather than by the actual customer state.

The peashooter is the warning label: models do not only reason inside the task. They also drag in the internet’s residue. Because of course they do. That residue must be tested, not politely ignored.

How operators should steal the benchmark logic

The practical value of AdvGameBench is not that every company should recreate three strategic games. The practical value is the evaluation architecture.

A business agent evaluation should include at least four layers.

First, test the initial plan. Before giving feedback, inspect whether the agent can produce a coherent first strategy under the task rules. This is the analogue of initial win rate. It matters because some systems are strong after scaffolding but weak at independent framing.

Second, test recovery. Give the agent explicit negative feedback, but measure whether the revision improves the state. Do not measure whether it apologises elegantly. Do not measure whether it tries again with admirable enthusiasm. Measure repair.

Third, test constraint adherence. Budgets, policies, approval rules, data-access limits, and deadlines should be executable checks, not prose instructions buried somewhere in the prompt. A model that violates them should be scored accordingly, even if its answer looks clever. Especially if its answer looks clever.

Fourth, test trajectory. Run repeated interactions against matched scenarios. Some models open strong and degrade. Some improve after feedback. Some oscillate. A single task result will not show that.

A minimal enterprise version might look like this:

Enterprise agent task	Process metric to add	Example operational check
Customer support triage	Correction success	After feedback, does the agent route the case to the right queue without adding irrelevant escalation?
Finance approval workflow	Over-budget/rule violation rate	Does the agent respect spend thresholds and approval hierarchy every time?
Sales proposal generation	Strategic similarity after revision	Does revision preserve the useful commercial strategy or rewrite everything after minor feedback?
Software debugging	Improvement slope	Do repeated test failures produce increasingly targeted fixes or broader code churn?
Procurement analysis	Initial plan quality	Does the agent identify mandatory constraints before comparing vendors?

This is where Cognaptus would translate the paper into business practice: not as another model leaderboard, but as an evaluation design pattern for agent governance.

What the paper directly shows, what we infer, and what remains uncertain

The paper directly shows that, within AdvGameBench, process-aware metrics reveal large behavioural differences among frontier and open models. It shows that ChatGPT-o3-mini and ChatGPT-o3 perform strongly across win rate, correction success, improvement, and budget adherence. It shows that Qwen-Plus and Qwen-Max are far more reactive and far less effective in this setup. It also shows that budget discipline is strongly associated with success in the benchmark.

Cognaptus infers that enterprise AI evaluations should adopt trace-based scoring. That means recording not only final outputs but planning steps, revisions, constraint checks, and recovery quality. It also means resisting a common management temptation: rewarding agents for activity. Activity is cheap. Corrective action is expensive. The difference belongs in the scorecard.

What remains uncertain is external validity. AdvGameBench covers three turn-based adversarial game genres. It does not cover real-time environments, cooperative work, human emotional dynamics, ambiguous organisational goals, domain-specific regulation, or tool-use failures in messy software stacks. The paper also notes that its environments log unit-level actions but do not attribute win contributions to individual decisions. That limits causal explanation.

There are also small reporting rough edges. The abstract describes 4,320 adversarial rounds, while the discussion later describes 4,752. This does not change the qualitative reading, but it does mean the safest interpretation is “thousands of controlled adversarial rounds” rather than an argument resting on one exact denominator.

So the right business stance is neither dismissal nor worship. AdvGameBench is not a deployment certificate. It is a useful microscope.

The gameboard is a rehearsal room for agent governance

The strongest idea in the paper is that reasoning quality becomes more legible when a model is forced to act inside a closed loop. A model must plan, receive consequences, revise, and obey limits. That sequence resembles real agentic work more closely than static question answering does.

The uncomfortable lesson is that many models can look adaptive while merely being reactive. They revise often, but not well. They respond to failure by changing something, because changing something feels like progress. Anyone who has sat through a corporate transformation programme will recognise the pattern.

AdvGameBench offers a better evaluation habit. Ask what the model did first. Ask what feedback changed. Ask whether the change helped. Ask whether constraints survived contact with optimisation. Ask whether the system improved across rounds or simply became more expensive.

That is the difference between evaluating an answer and evaluating an operator.

For businesses building or buying AI agents, the message is simple: do not let the final answer monopolise the audit. The reasoning trail is where the risk lives.

Cognaptus: Automate the Present, Incubate the Future.

Xiaopeng Yuan, Xingjian Zhang, Ke Xu, Yifan Xu, Lijun Yu, Jindong Wang, Yushun Dong, and Haohan Wang, “Tracing LLM Reasoning Processes with Strategic Games: A Framework for Planning, Revision, and Resource-Constrained Decision Making,” arXiv:2506.12012, 2025. https://arxiv.org/abs/2506.12012 ↩︎

TL;DR for operators#

The useful trick is turning reasoning into a replay#

AdvGameBench tests three kinds of operational competence#

The metric stack separates success from competence#

The headline result is not that one model won; it is how it won#

Budget discipline is a reasoning signal, not administrative tidiness#

The revision result is directional, not a statistical hammer#

The evidence stack is mostly main evidence, with a few diagnostic controls#

The “peashooter” failure is not cute; it is a benchmark-design warning#

How operators should steal the benchmark logic#

What the paper directly shows, what we infer, and what remains uncertain#

The gameboard is a rehearsal room for agent governance#