Opening — Why this matters now
Every few months, the AI industry proclaims another breakthrough in “reasoning.” Models solve Olympiad geometry, pass graduate-level coding contests, and produce clean explanations that sound almost insultingly confident. The narrative writes itself: AGI is practically here; please adjust your expectations accordingly.
Then you hand the same models a chessboard—and they implode.
The paper LLM CHESS (Kolasani et al., 2025) delivers a humbling counterpoint to our collective triumphalism. Chess, that old drosophila of AI, turns out to be an unreasonably effective probe of modern LLM failures: hallucinated actions, inability to follow multi-step instructions, strategic amnesia, and a general tendency to behave like a distracted toddler pawing at a grandmaster’s pieces.
For those of us building AI agents for automation, finance, or operations, the real message is simple: reasoning benchmarks lie; interactive environments do not.
Background — Context and prior art
Chess has always served as AI’s reality check. Turing, Shannon, Wiener—each treated the game as a microcosm of strategic intelligence. Later, brute-force search and deep reinforcement learning produced engines such as Deep Blue and AlphaZero that now operate on a different planet from human cognition.
But LLMs are not chess engines. They are stochastic pattern machines pretending to be reasoners.
Previous LLM–chess explorations tended to fall into one of three categories:
- Static PGN completion (easy to overfit).
- Constrained rule-learning (good for legality, bad for strategy).
- LLM-vs-LLM mini tournaments (amusing but not diagnostic).
What has been missing is an agentic, multi-turn setting—one that tests the two skills enterprises actually care about:
- Instruction-following reliability
- Generalizable, non-memorized reasoning
The authors of LLM CHESS finally deliver that.
Analysis — What the paper does
At its core, LLM CHESS is a deceptively simple benchmark:
- The LLM plays chess as Black.
- Each move requires multi-turn tool use: request board state → request legal moves → issue UCI move.
- Models get 10 conversational turns per ply and 3 attempts to issue a legal action.
- First, they play 30 games against a random opponent.
- Strong performers graduate to a second stage against Komodo Dragon 1, a calibratable chess engine.
The brilliance lies in its focus on failure surfaces, not success:
- Can the model follow the tool interface?
- Can it avoid hallucinating moves not in the legal move list?
- Can it avoid looping endlessly when confused?
- Can it avoid timing out under reasoning compute?
Spoiler: most cannot.
According to the benchmark’s large-scale evaluation of over 50 models, the results break cleanly into two species:
| Category | Avg Win Rate vs Random | Instruction-Following Failures |
|---|---|---|
| Reasoning-enhanced LLMs | ~45% | 24% |
| Standard LLMs | ~0.7% | 71.9% |
(Derived from Table 1, p.5–6) fileciteturn0file0
The takeaway is stark: even against a random agent—essentially a drunk tourist moving pieces at random—most LLMs cannot reliably finish a game.
And among the elite? The strongest model tested, o3 (low), achieves a peak Elo of ~758. That’s barely above the average chess.com user and galaxies away from expert-level play. (Figure 3, p.7) fileciteturn0file0
Let’s be clear: we’re not laughing at the Elo. We’re laughing at the gap between marketing narratives and empirical reality.
Findings — Results with visualization
A few patterns emerge, each more revealing than the last:
1. Tool use is the real bottleneck
Across all models, 64.8% of abnormal terminations arise from issuing wrong actions—not wrong moves. (Table 11, p.29) fileciteturn0file0
LLMs aren’t strategically incompetent; they’re interface-incompetent.
2. Removing tools makes models better
In ablation studies, supplying the board and legal moves directly in the prompt—removing the need for actions—improved performance by 20–30 percentage points. (Table 3, p.8–9) fileciteturn0file0
This is darkly humorous: LLMs reason better when you eliminate the part requiring reasoning about the environment.
3. Test-time compute helps… but only up to a point
Scaling reasoning depth improved win rates by 15–20%, but also increased timeout instability. (Figure 4a, Appendix E) fileciteturn0file0
More thinking doesn’t fix fragility.
4. MoA (Mixture-of-Agents) adds marginal benefit
Parallelizing multiple copies of a model yields only small gains—nothing like the dramatic improvements seen in code generation.
A simple cost–benefit framing:
| Configuration | Performance vs Random | Additional Cost |
|---|---|---|
| o4-mini (medium) | Baseline | — |
| 3× MoA | Slightly higher | 3× calls |
| 5× MoA | Worse | 5× calls |
(Figure 4b, p.8) fileciteturn0file0
As with chess engines, coordination is harder than computation.
5. Including move history reduces blunders dramatically
Adding previous moves drops blunder rates by up to 9.6%. (Table 8, p.28) fileciteturn0file0
Memory matters—but LLMs don’t handle it naturally.
Implications — What this means for business and AI deployment
Chess here is merely a stress test for domains that actually matter: automation, finance, cybersecurity, procurement, logistics.
The warning signs translate directly:
1. Agentic workflows amplify LLM brittleness
Your AI workflow is only as reliable as the model’s ability to:
- choose the correct tool,
- issue structured commands consistently,
- avoid hallucinating parameters.
If an LLM can’t reliably output make_move e7e5, imagine trusting it with:
execute_transfer(amount=5_000_000, currency=USD)terminate_process(pid=2974)deploy_model(version=prod-v12)
2. Strategic reasoning under uncertainty is still beyond modern LLMs
Chess exposes deficits that matter deeply to enterprise automation:
- long-horizon planning
- state tracking
- working memory
- recovering from errors
- adapting to adversarial inputs
If a model can’t remember the state of a bishop, it cannot remember:
- the compliance status of a vendor,
- the financial exposure of a portfolio,
- the sequence of dependencies in a supply chain.
3. Benchmark saturation is a myth
LLM CHESS is immune to memorization:
- Massive state space
- Dynamic opponent
- Tool-mediated interaction
This makes it an ideal stress test for AI agents, not just a benchmark.
4. Cost structures reveal uncomfortable truths
From Table 5 (p.26), some models consume thousands of tokens per move, costing $5–20 per game. fileciteturn0file0
High reasoning ≠ cheap reasoning.
In production automation, cost blowups like these are unacceptable.
Conclusion — Wrapping up
Where does this leave us? Precisely where sober practitioners already suspected:
LLMs excel at symbolic projection—explanations, summaries, clever analogies. They are competent at bounded tasks with abundant training signals, like math and code.
But turn them into autonomous agents navigating dynamic, rule-based environments—and their polished veneer cracks instantly.
The paper’s final message may be accidental but essential:
If your AI system must follow rules, track state, and avoid hallucinated actions, you need more than a big LLM. You need actual architecture.
Chess just happens to make that failure painfully, beautifully obvious.
Cognaptus: Automate the Present, Incubate the Future.