Blunders, Patterns, and Predictability: What n‑Gram Models Teach Us About Human Chess

Chess engines are very good at telling you what a player should do.

That is not the same as predicting what the player will do.

Anyone who has watched a beginner hang a queen, an intermediate player force a dubious attack, or a strong player choose a quiet positional squeeze already knows the difference. Optimality is one question. Human behavior is another. Most AI systems enjoy pretending those two questions are basically cousins. They are not. One is about the board. The other is about the person touching the pieces.

That distinction is the useful part of the paper Predicting Human Chess Moves: An AI Assisted Analysis of Chess Games Using Skill-group Specific n-gram Language Models.¹ The paper does not try to build a better chess engine. It tries to predict human chess moves by treating games as sequences of tokens and players as members of skill groups. Instead of asking, “What is the best move in this position?”, it asks, “Given how this kind of player tends to play, what move is likely next?”

That sounds smaller. It is actually the more transferable idea.

The business version is simple: before predicting the next action, first identify the behavioral regime. A novice customer, expert operator, careless employee, rushed analyst, and power user may all face the same interface, but they do not generate the same next action distribution. If your model ignores that, it may produce a very confident average answer. A confident average answer is often just a nicely formatted mistake.

The mechanism is routing before prediction

The paper’s core design is not complicated, which is precisely why it is interesting.

The authors divide Lichess games into seven rating-based skill groups, from L1 novice games to L7 expert games. They then train one 5-gram KenLM language model for each group. A chess game is represented as a sequence of algebraic move tokens, such as:

e4 e6 Bc4 d5 exd5 exd5 Bb3 Nf6 ...

A 5-gram model predicts the next token using only a short local history. In practical terms, it looks back at up to four previous move tokens. No deep board evaluation. No search tree. No legal-move reasoning. Just statistical continuation.

That sounds almost embarrassingly simple next to Stockfish or AlphaZero. But again, the paper is not solving chess. It is modeling human patterns.

The framework has three steps:

Step	What the system does	Why it matters
Skill-specific modeling	Train separate n-gram models for seven rating bands	Different skill levels produce different move patterns
Model selection	Use cumulative surprisal to infer which skill-level model best fits the observed game prefix	The system routes the game to a behavioral model before predicting
Move prediction	Predict the next move using the selected skill-specific model	Prediction becomes conditional on player type, not just global confidence

The model selector is the small but important trick. Each skill-specific model assigns surprisal values to the observed move sequence. Lower surprisal means the sequence is less “surprising” under that model. The selector chooses the skill model with the lowest cumulative surprisal over the observed prefix.

A compact way to express the idea is:

$$ S_l(g_{1:k}) = \sum_{t=1}^{k} -\log P_l(m_t \mid m_{t-4:t-1}) $$

Here, $S_l$ is the cumulative surprisal assigned by skill-level model $l$ to the first $k$ half-moves of a game. The system chooses the level whose model finds the observed sequence least surprising.

This is not magic. It is routing.

And routing is often where business AI systems become useful. Not because routing is glamorous. It is not. It is plumbing with statistical manners. But a model that first asks “Which behavioral world am I in?” can outperform a model that tries to make one global prediction across incompatible user types.

The data pipeline turns chess into a behavioral text stream

The authors use the public Lichess database, which contains standard-rated games. They group games by the average of White and Black ratings:

Level	Rating range
L1	≤ 1000
L2	1000–1400
L3	1400–1600
L4	1600–1800
L5	1800–2000
L6	2000–2250
L7	≥ 2250

For training, they randomly sample 10% of each level’s games from July 2024. For testing, they use 1,000 games from August 2024 for each level. The resulting corpus includes seven training sets and seven corresponding test sets, with training sets ranging from roughly 0.5 million to 3 million games.

The preprocessing choice is worth noticing. The authors originally considered parsing PGN files with python-chess, but instead treated the PGNs as plain text to accelerate processing. They remove metadata, side variations, move numbers, annotations, and engine evaluations. What remains is a clean move sequence.

This choice gives the framework speed and simplicity. It also creates one of the paper’s central boundaries: the model does not understand the board. It sees tokens, not legal positions.

That tradeoff matters. For a chess engine, ignoring legality would be absurd. For behavioral pattern detection, it may be acceptable as a first pass. In business terms, this resembles using clickstream logs, support-ticket sequences, workflow events, or transaction histories without fully modeling every underlying constraint. You may get a useful behavioral signal quickly. You may also recommend an impossible next step. Congratulations, you have rediscovered why production systems need guardrails.

The heatmap checks whether skill fingerprints exist

Before asking whether the selector improves move prediction, the paper first checks whether the seven skill-specific models actually behave differently.

This is the likely purpose of the perplexity heatmap in the results section. It is not an ablation. It is more like a diagnostic validation: do models trained on one skill group assign lower perplexity to games from that same or nearby skill group?

The broad pattern supports that idea. Lower-level models tend to fit lower-level games better, and higher-level models tend to fit higher-level games better. The diagonal is not perfectly clean, which is important. Human behavior does not politely arrange itself for a conference figure. Nearby skill bands overlap. Some intermediate groups look statistically similar. Still, the heatmap suggests that rating-specific move patterns are strong enough for a selector to exploit.

That is the first meaningful result: chess skill leaves a statistical trace in move sequences, even when the model only sees short token histories.

The paper also reports average surprisal over the first 100 half-moves for L1 games. Surprisal rises as games move toward the middle game, suggesting that predictions become less certain when the number of plausible continuations expands. This figure is best read as an explanatory diagnostic, not the main proof. It helps explain why move prediction becomes harder around the middle game.

The business analogue is familiar. Early-stage behavior is often structured. Onboarding flows, opening moves, first purchases, initial support requests, and standard operating procedures produce patterns. Later behavior branches. Users improvise. Experts customize. Beginners get lost. The middle game arrives, wearing a fake mustache.

The selector is useful, but it is not a mind reader

The selector’s skill classification accuracy is modest, and that modesty is part of the lesson.

Using the first 16 half-moves, the selector reaches 31.7% overall accuracy across seven classes. Random guessing across balanced classes would be about 14.3%, so 31.7% is meaningfully better than chance. But it is not close to “the system knows your skill level.” With the first 100 half-moves, overall accuracy falls to 26.8%.

The per-level results are more revealing:

Game level	16 half-moves	100 half-moves	Interpretation
L1	37.2%	22.3%	Novice behavior becomes harder to classify later
L2	35.4%	39.0%	More history helps
L3	23.3%	23.9%	Little change
L4	24.5%	27.9%	More history helps modestly
L5	26.6%	31.8%	More history helps
L6	32.0%	27.6%	More history hurts
L7	43.3%	15.3%	Expert behavior becomes much harder to classify later

This table should prevent two bad readings.

The first bad reading is that more data automatically improves classification. It does not. More moves help L2 through L5, but hurt L1, L6, and especially L7.

The second bad reading is that experts are simply more predictable. In the early game, L7 games are classified most accurately at 43.3%. But when the window expands to 100 half-moves, L7 accuracy collapses to 15.3%. The authors suggest that strong players may deviate from common patterns as strategy becomes more sophisticated, while novices become erratic. That explanation is plausible, though the paper does not prove it causally.

The broader lesson is sharper: behavioral labels can be useful even when classification is imperfect. The selector does not need to be a perfect psychologist. It only needs to route often enough to improve downstream prediction.

That is a very practical point. Many business systems obsess over whether a segment classifier is “accurate.” The better question is whether routing improves the final decision. A mediocre classifier can still be operationally valuable if its mistakes are not too costly and its correct routes produce better predictions.

Top-1 prediction shows the limit of asking for one human answer

The paper compares the selector-assisted framework against a benchmark that runs the input through all models and chooses the move with the highest global confidence score. This benchmark is important because it captures the tempting alternative: skip routing and just take the most confident prediction from anywhere.

For Top-1 move prediction, the selector-assisted system improves accuracy by up to 6.6% over the benchmark. That is a real but modest gain. The figure also shows that around 50 half-moves, the advantage nearly disappears; the plotted values are essentially tied, with the benchmark even slightly ahead in the chart.

That tiny detail matters because it keeps the result honest. The mechanism helps, but it does not eliminate uncertainty. Predicting one exact human move in a complex middle game is hard, especially when the model has no board-state awareness and only a four-token memory.

Top-1 prediction is also a slightly unfair test of human behavior modeling. Humans often have several plausible moves. A player might choose one of several reasonable developing moves in the opening, one of several recaptures, or one of several strategic plans. If the model’s second guess is what the player actually does, a Top-1 metric still marks the prediction wrong.

That is why the Top-3 result is more informative.

Top-3 prediction is where the behavioral framing earns its keep

The Top-3 experiment asks whether the actual move appears among the three most likely predicted moves. This better matches the behavioral reality: human action prediction is often about narrowing the plausible set, not naming the single future event with divine confidence.

Here the selector-assisted framework performs much better. The paper reports up to a 39.1% improvement over the benchmark. From the plotted values, the largest gain appears early: at 16 half-moves, selector-assisted Top-3 accuracy is about 0.475 versus benchmark accuracy around 0.342. Gains remain visible at later checkpoints, including 20, 40, 50, and 80 half-moves.

The interpretation is not “the system can predict chess.” The stronger interpretation is narrower and more useful: skill-conditioned routing improves the ranking of plausible moves, especially when the evaluation allows for multiple reasonable continuations.

That distinction matters for business applications. Many operational AI systems do not need one perfect next action. They need a short, well-ranked candidate set:

Domain	Bad objective	Better objective
Customer support	Predict the exact next complaint	Rank likely issue categories
Fraud review	Declare one reason for suspicious behavior	Surface plausible risk scenarios
Sales automation	Predict the exact next purchase	Rank next-best offers or objections
Workflow assistance	Guess the precise next click	Suggest likely next steps
Training systems	Label the learner perfectly	Recommend a small set of likely mistakes

Top-3 thinking is less theatrical than Top-1 prediction. It is also more useful. The world rarely asks humans to be deterministic tokens. Most AI dashboards simply pretend otherwise because a single answer looks cleaner on a slide.

What each experiment actually supports

The paper’s experiments should be read as a sequence of supporting tests, not as one giant accuracy claim.

Evidence item	Likely purpose	What it supports	What it does not prove
Perplexity heatmap across skill models and game levels	Diagnostic validation	Skill-specific models capture different move distributions	Perfect separation between skill levels
Surprisal over 100 half-moves	Exploratory diagnostic	Prediction uncertainty rises into the middle game	A causal theory of middle-game complexity
16 vs. 100 half-move selector accuracy	Main routing evidence plus sensitivity check	Early-game information can identify skill group better than chance	Reliable individual-level skill classification
Top-1 prediction comparison	Main downstream evaluation	Selector routing can modestly improve exact next-move prediction	Strong exact prediction in all game phases
Top-3 prediction comparison	Main downstream evaluation	Routing is more valuable when predicting plausible move sets	Generalization beyond chess or beyond the tested months
L1 and L7 discussion	Post-hoc explanation	Extremes may behave less predictably outside structured openings	Definitive explanation for novice and expert unpredictability

This table is useful because the paper’s most important result is not one number. It is the combination of three observations:

Skill-level move distributions differ enough to be modeled.
A simple surprisal selector can route games better than chance.
Routing improves downstream move prediction, especially for Top-3 prediction.

That mechanism is the exportable part.

The business value is segmentation before prediction, not chess with smaller models

The lazy business takeaway would be: “n-gram models are back.” They are not. Please do not build a 2026 strategy deck around the glorious return of 5-grams. The point is not that n-gram models beat modern AI. The paper does not test that claim.

The better takeaway is architectural: cheap specialized models can become useful when paired with a lightweight selector.

This matters because many organizations face action-prediction problems where full foundation-model reasoning is expensive, slow, unnecessary, or hard to control. Examples include:

predicting the next step in a workflow;
suggesting likely support-ticket categories;
identifying typical mistakes in training simulations;
predicting customer journey branches;
detecting behavior that deviates from a user’s normal pattern;
ranking plausible operational next actions for human review.

In these cases, the useful unit may not be “one large model for everyone.” It may be:

build compact models for meaningful behavioral groups;
infer the current behavioral group from recent actions;
route prediction through the matching model;
return a ranked set of plausible next actions;
apply business rules, legality checks, or human review before execution.

That architecture is not glamorous. It is usually more deployable than glamour.

It also fits a common business constraint: labels are often coarse. The paper uses rating bins, not psychological profiles. Many firms already have similarly imperfect labels: customer tier, user tenure, job role, risk band, subscription plan, prior behavior cluster, region, device type, or workflow maturity. These labels are not destiny. They are routing hints.

The paper shows that even coarse routing can improve prediction when behavior differs across groups.

What Cognaptus can infer, and what the paper actually shows

A clean business interpretation needs three layers.

Layer	Statement
What the paper directly shows	On Lichess data, seven rating-specific 5-gram models plus a surprisal-based selector improve move prediction against a global-confidence benchmark, with modest Top-1 gains and larger Top-3 gains.
What Cognaptus infers	For business action streams, segment-conditioned lightweight prediction may outperform one global model when user groups exhibit distinct behavioral patterns.
What remains uncertain	The paper does not prove this architecture will generalize to business workflows, longer contexts, domains with hidden constraints, or environments where user segments change rapidly.

That last row is not decorative caution. It changes how the idea should be used.

In a production system, the model should probably not act alone. The chess model can propose illegal moves because it does not know the board. A business analogue can propose unavailable products, noncompliant actions, impossible workflow transitions, or recommendations that violate policy. The prediction layer needs a constraint layer.

A practical architecture would therefore separate three jobs:

Component	Job
Behavioral predictor	Estimate plausible next actions from recent behavior
Constraint engine	Remove illegal, unavailable, unsafe, or noncompliant actions
Decision policy	Rank remaining options by business value, user benefit, and risk

The paper focuses on the first component. Businesses need all three. Otherwise, the system becomes very good at predicting things it should not recommend.

The boundary conditions are not small footnotes

The paper’s limitations are not fatal, but they are central to interpretation.

First, the models only use short context. A 5-gram model can consider at most four previous move tokens. That is enough to capture local patterns, especially in openings or repeated motifs. It is weak for long-range strategy. In business settings, this is like predicting a customer’s next action from the last few clicks while ignoring the full account history, contract status, prior complaints, and organizational context.

Second, the models do not know chess rules. They process algebraic notation as text. This helps speed, but it means the predictor can produce illegal moves. For business systems, the equivalent is a model that predicts a next action without knowing inventory, permissions, compliance rules, or process dependencies. Useful as a signal; dangerous as an executor.

Third, the selector accuracy is modest. It improves downstream prediction, but it should not be mistaken for reliable skill diagnosis. If this were used in coaching, the system should not say, “You are definitely L4.” It should say something closer to, “Your recent move pattern resembles this skill band enough that this model may give useful predictions.” Less dramatic, more correct. A tragic tradeoff for marketing departments everywhere.

Fourth, the evaluation is bounded by the paper’s dataset design: July 2024 samples for training and August 2024 samples for testing, seven coarse rating bins, 1,000 test games per level, and comparisons against a designed global-confidence benchmark. The paper reports point comparisons, but not a broad benchmark suite against modern sequence models or statistical confidence intervals.

So the result is best understood as a proof of mechanism, not a final product benchmark.

The real lesson: humans are not failed optimizers

The most useful conceptual move in the paper is the shift from optimality to behavior.

Traditional chess analysis asks how far a human move deviates from the engine’s preferred move. That is valuable for training. But it treats human play as a shadow of optimal play: sometimes close, often flawed, occasionally tragic.

This paper asks a different question: can those deviations themselves be modeled?

That is a productive question far beyond chess. In many business systems, users are not failed versions of ideal users. Employees are not failed workflow diagrams. Customers are not failed conversion funnels. Traders are not failed expected-utility machines, though some try very hard to prove otherwise.

They are patterned actors under constraints, habits, incentives, skill levels, fatigue, and context.

A model that predicts the “best” action may miss the actual action. A model that predicts the actual action may look less intelligent but become more operationally useful.

That is why the paper’s n-gram simplicity is not a weakness in the article’s business reading. It is part of the point. The authors show that even lightweight models can extract useful behavioral structure when the system is organized around the right question.

Not “What should happen?”

But “What kind of actor is this, and what is plausible next?”

Conclusion: the next move is not always the best move

The paper’s contribution is not that n-gram language models will replace chess engines. They will not. Stockfish can relax.

The contribution is that chess move prediction can be reframed as skill-conditioned human behavior modeling. By training separate models for seven rating groups and selecting among them using surprisal, the framework improves move prediction over a global-confidence benchmark. The Top-1 gains are modest. The Top-3 gains are more substantial. The middle game remains messy. Novices and experts are troublesome in different ways. Humanity, as usual, refuses to be a clean dataset.

For Cognaptus readers, the business lesson is clear: when predicting human action, segmentation is not just a reporting layer. It can be part of the prediction mechanism itself.

Do not ask one model to average incompatible behaviors and then act surprised when it predicts nobody in particular. Route first. Predict second. Constrain before acting. And when the future is genuinely uncertain, rank plausible next moves instead of pretending there is only one.

That is not just a chess lesson.

It is how many practical AI systems should be built.

Cognaptus: Automate the Present, Incubate the Future.

Daren Zhong, Dingcheng Huang, and Clayton Greenberg, “Predicting Human Chess Moves: An AI Assisted Analysis of Chess Games Using Skill-group Specific n-gram Language Models,” arXiv:2512.01880, 2025. https://arxiv.org/abs/2512.01880 ↩︎

The mechanism is routing before prediction#

The data pipeline turns chess into a behavioral text stream#

The heatmap checks whether skill fingerprints exist#

The selector is useful, but it is not a mind reader#

Top-1 prediction shows the limit of asking for one human answer#

Top-3 prediction is where the behavioral framing earns its keep#

What each experiment actually supports#

The business value is segmentation before prediction, not chess with smaller models#

What Cognaptus can infer, and what the paper actually shows#

The boundary conditions are not small footnotes#

The real lesson: humans are not failed optimizers#

Conclusion: the next move is not always the best move#