Opening — Why this matters now

Every few months, the AI industry proclaims another breakthrough in “reasoning.” Models solve Olympiad geometry, pass graduate-level coding contests, and produce clean explanations that sound almost insultingly confident. The narrative writes itself: AGI is practically here; please adjust your expectations accordingly.

Then you hand the same models a chessboard—and they implode.

The paper LLM CHESS (Kolasani et al., 2025) delivers a humbling counterpoint to our collective triumphalism. Chess, that old drosophila of AI, turns out to be an unreasonably effective probe of modern LLM failures: hallucinated actions, inability to follow multi-step instructions, strategic amnesia, and a general tendency to behave like a distracted toddler pawing at a grandmaster’s pieces.

For those of us building AI agents for automation, finance, or operations, the real message is simple: reasoning benchmarks lie; interactive environments do not.

Background — Context and prior art

Chess has always served as AI’s reality check. Turing, Shannon, Wiener—each treated the game as a microcosm of strategic intelligence. Later, brute-force search and deep reinforcement learning produced engines such as Deep Blue and AlphaZero that now operate on a different planet from human cognition.

But LLMs are not chess engines. They are stochastic pattern machines pretending to be reasoners.

Previous LLM–chess explorations tended to fall into one of three categories:

  • Static PGN completion (easy to overfit).
  • Constrained rule-learning (good for legality, bad for strategy).
  • LLM-vs-LLM mini tournaments (amusing but not diagnostic).

What has been missing is an agentic, multi-turn setting—one that tests the two skills enterprises actually care about:

  1. Instruction-following reliability
  2. Generalizable, non-memorized reasoning

The authors of LLM CHESS finally deliver that.

Analysis — What the paper does

At its core, LLM CHESS is a deceptively simple benchmark:

  • The LLM plays chess as Black.
  • Each move requires multi-turn tool use: request board state → request legal moves → issue UCI move.
  • Models get 10 conversational turns per ply and 3 attempts to issue a legal action.
  • First, they play 30 games against a random opponent.
  • Strong performers graduate to a second stage against Komodo Dragon 1, a calibratable chess engine.

The brilliance lies in its focus on failure surfaces, not success:

  • Can the model follow the tool interface?
  • Can it avoid hallucinating moves not in the legal move list?
  • Can it avoid looping endlessly when confused?
  • Can it avoid timing out under reasoning compute?

Spoiler: most cannot.

According to the benchmark’s large-scale evaluation of over 50 models, the results break cleanly into two species:

Category Avg Win Rate vs Random Instruction-Following Failures
Reasoning-enhanced LLMs ~45% 24%
Standard LLMs ~0.7% 71.9%

(Derived from Table 1, p.5–6) fileciteturn0file0

The takeaway is stark: even against a random agent—essentially a drunk tourist moving pieces at random—most LLMs cannot reliably finish a game.

And among the elite? The strongest model tested, o3 (low), achieves a peak Elo of ~758. That’s barely above the average chess.com user and galaxies away from expert-level play. (Figure 3, p.7) fileciteturn0file0

Let’s be clear: we’re not laughing at the Elo. We’re laughing at the gap between marketing narratives and empirical reality.

Findings — Results with visualization

A few patterns emerge, each more revealing than the last:

1. Tool use is the real bottleneck

Across all models, 64.8% of abnormal terminations arise from issuing wrong actions—not wrong moves. (Table 11, p.29) fileciteturn0file0

LLMs aren’t strategically incompetent; they’re interface-incompetent.

2. Removing tools makes models better

In ablation studies, supplying the board and legal moves directly in the prompt—removing the need for actions—improved performance by 20–30 percentage points. (Table 3, p.8–9) fileciteturn0file0

This is darkly humorous: LLMs reason better when you eliminate the part requiring reasoning about the environment.

3. Test-time compute helps… but only up to a point

Scaling reasoning depth improved win rates by 15–20%, but also increased timeout instability. (Figure 4a, Appendix E) fileciteturn0file0

More thinking doesn’t fix fragility.

4. MoA (Mixture-of-Agents) adds marginal benefit

Parallelizing multiple copies of a model yields only small gains—nothing like the dramatic improvements seen in code generation.

A simple cost–benefit framing:

Configuration Performance vs Random Additional Cost
o4-mini (medium) Baseline
3× MoA Slightly higher 3× calls
5× MoA Worse 5× calls

(Figure 4b, p.8) fileciteturn0file0

As with chess engines, coordination is harder than computation.

5. Including move history reduces blunders dramatically

Adding previous moves drops blunder rates by up to 9.6%. (Table 8, p.28) fileciteturn0file0

Memory matters—but LLMs don’t handle it naturally.

Implications — What this means for business and AI deployment

Chess here is merely a stress test for domains that actually matter: automation, finance, cybersecurity, procurement, logistics.

The warning signs translate directly:

1. Agentic workflows amplify LLM brittleness

Your AI workflow is only as reliable as the model’s ability to:

  • choose the correct tool,
  • issue structured commands consistently,
  • avoid hallucinating parameters.

If an LLM can’t reliably output make_move e7e5, imagine trusting it with:

  • execute_transfer(amount=5_000_000, currency=USD)
  • terminate_process(pid=2974)
  • deploy_model(version=prod-v12)

2. Strategic reasoning under uncertainty is still beyond modern LLMs

Chess exposes deficits that matter deeply to enterprise automation:

  • long-horizon planning
  • state tracking
  • working memory
  • recovering from errors
  • adapting to adversarial inputs

If a model can’t remember the state of a bishop, it cannot remember:

  • the compliance status of a vendor,
  • the financial exposure of a portfolio,
  • the sequence of dependencies in a supply chain.

3. Benchmark saturation is a myth

LLM CHESS is immune to memorization:

  • Massive state space
  • Dynamic opponent
  • Tool-mediated interaction

This makes it an ideal stress test for AI agents, not just a benchmark.

4. Cost structures reveal uncomfortable truths

From Table 5 (p.26), some models consume thousands of tokens per move, costing $5–20 per game. fileciteturn0file0

High reasoning ≠ cheap reasoning.

In production automation, cost blowups like these are unacceptable.

Conclusion — Wrapping up

Where does this leave us? Precisely where sober practitioners already suspected:

LLMs excel at symbolic projection—explanations, summaries, clever analogies. They are competent at bounded tasks with abundant training signals, like math and code.

But turn them into autonomous agents navigating dynamic, rule-based environments—and their polished veneer cracks instantly.

The paper’s final message may be accidental but essential:

If your AI system must follow rules, track state, and avoid hallucinated actions, you need more than a big LLM. You need actual architecture.

Chess just happens to make that failure painfully, beautifully obvious.

Cognaptus: Automate the Present, Incubate the Future.