Chess engines are very good at telling you what a player should do.
That is not the same as predicting what the player will do.
Anyone who has watched a beginner hang a queen, an intermediate player force a dubious attack, or a strong player choose a quiet positional squeeze already knows the difference. Optimality is one question. Human behavior is another. Most AI systems enjoy pretending those two questions are basically cousins. They are not. One is about the board. The other is about the person touching the pieces.
That distinction is the useful part of the paper Predicting Human Chess Moves: An AI Assisted Analysis of Chess Games Using Skill-group Specific n-gram Language Models.1 The paper does not try to build a better chess engine. It tries to predict human chess moves by treating games as sequences of tokens and players as members of skill groups. Instead of asking, “What is the best move in this position?”, it asks, “Given how this kind of player tends to play, what move is likely next?”
That sounds smaller. It is actually the more transferable idea.
The business version is simple: before predicting the next action, first identify the behavioral regime. A novice customer, expert operator, careless employee, rushed analyst, and power user may all face the same interface, but they do not generate the same next action distribution. If your model ignores that, it may produce a very confident average answer. A confident average answer is often just a nicely formatted mistake.
The mechanism is routing before prediction
The paper’s core design is not complicated, which is precisely why it is interesting.
The authors divide Lichess games into seven rating-based skill groups, from L1 novice games to L7 expert games. They then train one 5-gram KenLM language model for each group. A chess game is represented as a sequence of algebraic move tokens, such as:
e4 e6 Bc4 d5 exd5 exd5 Bb3 Nf6 ...
A 5-gram model predicts the next token using only a short local history. In practical terms, it looks back at up to four previous move tokens. No deep board evaluation. No search tree. No legal-move reasoning. Just statistical continuation.
That sounds almost embarrassingly simple next to Stockfish or AlphaZero. But again, the paper is not solving chess. It is modeling human patterns.
The framework has three steps:
| Step | What the system does | Why it matters |
|---|---|---|
| Skill-specific modeling | Train separate n-gram models for seven rating bands | Different skill levels produce different move patterns |
| Model selection | Use cumulative surprisal to infer which skill-level model best fits the observed game prefix | The system routes the game to a behavioral model before predicting |
| Move prediction | Predict the next move using the selected skill-specific model | Prediction becomes conditional on player type, not just global confidence |
The model selector is the small but important trick. Each skill-specific model assigns surprisal values to the observed move sequence. Lower surprisal means the sequence is less “surprising” under that model. The selector chooses the skill model with the lowest cumulative surprisal over the observed prefix.
A compact way to express the idea is:
Here, $S_l$ is the cumulative surprisal assigned by skill-level model $l$ to the first $k$ half-moves of a game. The system chooses the level whose model finds the observed sequence least surprising.
This is not magic. It is routing.
And routing is often where business AI systems become useful. Not because routing is glamorous. It is not. It is plumbing with statistical manners. But a model that first asks “Which behavioral world am I in?” can outperform a model that tries to make one global prediction across incompatible user types.
The data pipeline turns chess into a behavioral text stream
The authors use the public Lichess database, which contains standard-rated games. They group games by the average of White and Black ratings:
| Level | Rating range |
|---|---|
| L1 | ≤ 1000 |
| L2 | 1000–1400 |
| L3 | 1400–1600 |
| L4 | 1600–1800 |
| L5 | 1800–2000 |
| L6 | 2000–2250 |
| L7 | ≥ 2250 |
For training, they randomly sample 10% of each level’s games from July 2024. For testing, they use 1,000 games from August 2024 for each level. The resulting corpus includes seven training sets and seven corresponding test sets, with training sets ranging from roughly 0.5 million to 3 million games.
The preprocessing choice is worth noticing. The authors originally considered parsing PGN files with python-chess, but instead treated the PGNs as plain text to accelerate processing. They remove metadata, side variations, move numbers, annotations, and engine evaluations. What remains is a clean move sequence.
This choice gives the framework speed and simplicity. It also creates one of the paper’s central boundaries: the model does not understand the board. It sees tokens, not legal positions.
That tradeoff matters. For a chess engine, ignoring legality would be absurd. For behavioral pattern detection, it may be acceptable as a first pass. In business terms, this resembles using clickstream logs, support-ticket sequences, workflow events, or transaction histories without fully modeling every underlying constraint. You may get a useful behavioral signal quickly. You may also recommend an impossible next step. Congratulations, you have rediscovered why production systems need guardrails.
The heatmap checks whether skill fingerprints exist
Before asking whether the selector improves move prediction, the paper first checks whether the seven skill-specific models actually behave differently.
This is the likely purpose of the perplexity heatmap in the results section. It is not an ablation. It is more like a diagnostic validation: do models trained on one skill group assign lower perplexity to games from that same or nearby skill group?
The broad pattern supports that idea. Lower-level models tend to fit lower-level games better, and higher-level models tend to fit higher-level games better. The diagonal is not perfectly clean, which is important. Human behavior does not politely arrange itself for a conference figure. Nearby skill bands overlap. Some intermediate groups look statistically similar. Still, the heatmap suggests that rating-specific move patterns are strong enough for a selector to exploit.
That is the first meaningful result: chess skill leaves a statistical trace in move sequences, even when the model only sees short token histories.
The paper also reports average surprisal over the first 100 half-moves for L1 games. Surprisal rises as games move toward the middle game, suggesting that predictions become less certain when the number of plausible continuations expands. This figure is best read as an explanatory diagnostic, not the main proof. It helps explain why move prediction becomes harder around the middle game.
The business analogue is familiar. Early-stage behavior is often structured. Onboarding flows, opening moves, first purchases, initial support requests, and standard operating procedures produce patterns. Later behavior branches. Users improvise. Experts customize. Beginners get lost. The middle game arrives, wearing a fake mustache.
The selector is useful, but it is not a mind reader
The selector’s skill classification accuracy is modest, and that modesty is part of the lesson.
Using the first 16 half-moves, the selector reaches 31.7% overall accuracy across seven classes. Random guessing across balanced classes would be about 14.3%, so 31.7% is meaningfully better than chance. But it is not close to “the system knows your skill level.” With the first 100 half-moves, overall accuracy falls to 26.8%.
The per-level results are more revealing:
| Game level | 16 half-moves | 100 half-moves | Interpretation |
|---|---|---|---|
| L1 | 37.2% | 22.3% | Novice behavior becomes harder to classify later |
| L2 | 35.4% | 39.0% | More history helps |
| L3 | 23.3% | 23.9% | Little change |
| L4 | 24.5% | 27.9% | More history helps modestly |
| L5 | 26.6% | 31.8% | More history helps |
| L6 | 32.0% | 27.6% | More history hurts |
| L7 | 43.3% | 15.3% | Expert behavior becomes much harder to classify later |
This table should prevent two bad readings.
The first bad reading is that more data automatically improves classification. It does not. More moves help L2 through L5, but hurt L1, L6, and especially L7.
The second bad reading is that experts are simply more predictable. In the early game, L7 games are classified most accurately at 43.3%. But when the window expands to 100 half-moves, L7 accuracy collapses to 15.3%. The authors suggest that strong players may deviate from common patterns as strategy becomes more sophisticated, while novices become erratic. That explanation is plausible, though the paper does not prove it causally.
The broader lesson is sharper: behavioral labels can be useful even when classification is imperfect. The selector does not need to be a perfect psychologist. It only needs to route often enough to improve downstream prediction.
That is a very practical point. Many business systems obsess over whether a segment classifier is “accurate.” The better question is whether routing improves the final decision. A mediocre classifier can still be operationally valuable if its mistakes are not too costly and its correct routes produce better predictions.
Top-1 prediction shows the limit of asking for one human answer
The paper compares the selector-assisted framework against a benchmark that runs the input through all models and chooses the move with the highest global confidence score. This benchmark is important because it captures the tempting alternative: skip routing and just take the most confident prediction from anywhere.
For Top-1 move prediction, the selector-assisted system improves accuracy by up to 6.6% over the benchmark. That is a real but modest gain. The figure also shows that around 50 half-moves, the advantage nearly disappears; the plotted values are essentially tied, with the benchmark even slightly ahead in the chart.
That tiny detail matters because it keeps the result honest. The mechanism helps, but it does not eliminate uncertainty. Predicting one exact human move in a complex middle game is hard, especially when the model has no board-state awareness and only a four-token memory.
Top-1 prediction is also a slightly unfair test of human behavior modeling. Humans often have several plausible moves. A player might choose one of several reasonable developing moves in the opening, one of several recaptures, or one of several strategic plans. If the model’s second guess is what the player actually does, a Top-1 metric still marks the prediction wrong.
That is why the Top-3 result is more informative.
Top-3 prediction is where the behavioral framing earns its keep
The Top-3 experiment asks whether the actual move appears among the three most likely predicted moves. This better matches the behavioral reality: human action prediction is often about narrowing the plausible set, not naming the single future event with divine confidence.
Here the selector-assisted framework performs much better. The paper reports up to a 39.1% improvement over the benchmark. From the plotted values, the largest gain appears early: at 16 half-moves, selector-assisted Top-3 accuracy is about 0.475 versus benchmark accuracy around 0.342. Gains remain visible at later checkpoints, including 20, 40, 50, and 80 half-moves.
The interpretation is not “the system can predict chess.” The stronger interpretation is narrower and more useful: skill-conditioned routing improves the ranking of plausible moves, especially when the evaluation allows for multiple reasonable continuations.
That distinction matters for business applications. Many operational AI systems do not need one perfect next action. They need a short, well-ranked candidate set:
| Domain | Bad objective | Better objective |
|---|---|---|
| Customer support | Predict the exact next complaint | Rank likely issue categories |
| Fraud review | Declare one reason for suspicious behavior | Surface plausible risk scenarios |
| Sales automation | Predict the exact next purchase | Rank next-best offers or objections |
| Workflow assistance | Guess the precise next click | Suggest likely next steps |
| Training systems | Label the learner perfectly | Recommend a small set of likely mistakes |
Top-3 thinking is less theatrical than Top-1 prediction. It is also more useful. The world rarely asks humans to be deterministic tokens. Most AI dashboards simply pretend otherwise because a single answer looks cleaner on a slide.
What each experiment actually supports
The paper’s experiments should be read as a sequence of supporting tests, not as one giant accuracy claim.
| Evidence item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Perplexity heatmap across skill models and game levels | Diagnostic validation | Skill-specific models capture different move distributions | Perfect separation between skill levels |
| Surprisal over 100 half-moves | Exploratory diagnostic | Prediction uncertainty rises into the middle game | A causal theory of middle-game complexity |
| 16 vs. 100 half-move selector accuracy | Main routing evidence plus sensitivity check | Early-game information can identify skill group better than chance | Reliable individual-level skill classification |
| Top-1 prediction comparison | Main downstream evaluation | Selector routing can modestly improve exact next-move prediction | Strong exact prediction in all game phases |
| Top-3 prediction comparison | Main downstream evaluation | Routing is more valuable when predicting plausible move sets | Generalization beyond chess or beyond the tested months |
| L1 and L7 discussion | Post-hoc explanation | Extremes may behave less predictably outside structured openings | Definitive explanation for novice and expert unpredictability |
This table is useful because the paper’s most important result is not one number. It is the combination of three observations:
- Skill-level move distributions differ enough to be modeled.
- A simple surprisal selector can route games better than chance.
- Routing improves downstream move prediction, especially for Top-3 prediction.
That mechanism is the exportable part.
The business value is segmentation before prediction, not chess with smaller models
The lazy business takeaway would be: “n-gram models are back.” They are not. Please do not build a 2026 strategy deck around the glorious return of 5-grams. The point is not that n-gram models beat modern AI. The paper does not test that claim.
The better takeaway is architectural: cheap specialized models can become useful when paired with a lightweight selector.
This matters because many organizations face action-prediction problems where full foundation-model reasoning is expensive, slow, unnecessary, or hard to control. Examples include:
- predicting the next step in a workflow;
- suggesting likely support-ticket categories;
- identifying typical mistakes in training simulations;
- predicting customer journey branches;
- detecting behavior that deviates from a user’s normal pattern;
- ranking plausible operational next actions for human review.
In these cases, the useful unit may not be “one large model for everyone.” It may be:
- build compact models for meaningful behavioral groups;
- infer the current behavioral group from recent actions;
- route prediction through the matching model;
- return a ranked set of plausible next actions;
- apply business rules, legality checks, or human review before execution.
That architecture is not glamorous. It is usually more deployable than glamour.
It also fits a common business constraint: labels are often coarse. The paper uses rating bins, not psychological profiles. Many firms already have similarly imperfect labels: customer tier, user tenure, job role, risk band, subscription plan, prior behavior cluster, region, device type, or workflow maturity. These labels are not destiny. They are routing hints.
The paper shows that even coarse routing can improve prediction when behavior differs across groups.
What Cognaptus can infer, and what the paper actually shows
A clean business interpretation needs three layers.
| Layer | Statement |
|---|---|
| What the paper directly shows | On Lichess data, seven rating-specific 5-gram models plus a surprisal-based selector improve move prediction against a global-confidence benchmark, with modest Top-1 gains and larger Top-3 gains. |
| What Cognaptus infers | For business action streams, segment-conditioned lightweight prediction may outperform one global model when user groups exhibit distinct behavioral patterns. |
| What remains uncertain | The paper does not prove this architecture will generalize to business workflows, longer contexts, domains with hidden constraints, or environments where user segments change rapidly. |
That last row is not decorative caution. It changes how the idea should be used.
In a production system, the model should probably not act alone. The chess model can propose illegal moves because it does not know the board. A business analogue can propose unavailable products, noncompliant actions, impossible workflow transitions, or recommendations that violate policy. The prediction layer needs a constraint layer.
A practical architecture would therefore separate three jobs:
| Component | Job |
|---|---|
| Behavioral predictor | Estimate plausible next actions from recent behavior |
| Constraint engine | Remove illegal, unavailable, unsafe, or noncompliant actions |
| Decision policy | Rank remaining options by business value, user benefit, and risk |
The paper focuses on the first component. Businesses need all three. Otherwise, the system becomes very good at predicting things it should not recommend.
The boundary conditions are not small footnotes
The paper’s limitations are not fatal, but they are central to interpretation.
First, the models only use short context. A 5-gram model can consider at most four previous move tokens. That is enough to capture local patterns, especially in openings or repeated motifs. It is weak for long-range strategy. In business settings, this is like predicting a customer’s next action from the last few clicks while ignoring the full account history, contract status, prior complaints, and organizational context.
Second, the models do not know chess rules. They process algebraic notation as text. This helps speed, but it means the predictor can produce illegal moves. For business systems, the equivalent is a model that predicts a next action without knowing inventory, permissions, compliance rules, or process dependencies. Useful as a signal; dangerous as an executor.
Third, the selector accuracy is modest. It improves downstream prediction, but it should not be mistaken for reliable skill diagnosis. If this were used in coaching, the system should not say, “You are definitely L4.” It should say something closer to, “Your recent move pattern resembles this skill band enough that this model may give useful predictions.” Less dramatic, more correct. A tragic tradeoff for marketing departments everywhere.
Fourth, the evaluation is bounded by the paper’s dataset design: July 2024 samples for training and August 2024 samples for testing, seven coarse rating bins, 1,000 test games per level, and comparisons against a designed global-confidence benchmark. The paper reports point comparisons, but not a broad benchmark suite against modern sequence models or statistical confidence intervals.
So the result is best understood as a proof of mechanism, not a final product benchmark.
The real lesson: humans are not failed optimizers
The most useful conceptual move in the paper is the shift from optimality to behavior.
Traditional chess analysis asks how far a human move deviates from the engine’s preferred move. That is valuable for training. But it treats human play as a shadow of optimal play: sometimes close, often flawed, occasionally tragic.
This paper asks a different question: can those deviations themselves be modeled?
That is a productive question far beyond chess. In many business systems, users are not failed versions of ideal users. Employees are not failed workflow diagrams. Customers are not failed conversion funnels. Traders are not failed expected-utility machines, though some try very hard to prove otherwise.
They are patterned actors under constraints, habits, incentives, skill levels, fatigue, and context.
A model that predicts the “best” action may miss the actual action. A model that predicts the actual action may look less intelligent but become more operationally useful.
That is why the paper’s n-gram simplicity is not a weakness in the article’s business reading. It is part of the point. The authors show that even lightweight models can extract useful behavioral structure when the system is organized around the right question.
Not “What should happen?”
But “What kind of actor is this, and what is plausible next?”
Conclusion: the next move is not always the best move
The paper’s contribution is not that n-gram language models will replace chess engines. They will not. Stockfish can relax.
The contribution is that chess move prediction can be reframed as skill-conditioned human behavior modeling. By training separate models for seven rating groups and selecting among them using surprisal, the framework improves move prediction over a global-confidence benchmark. The Top-1 gains are modest. The Top-3 gains are more substantial. The middle game remains messy. Novices and experts are troublesome in different ways. Humanity, as usual, refuses to be a clean dataset.
For Cognaptus readers, the business lesson is clear: when predicting human action, segmentation is not just a reporting layer. It can be part of the prediction mechanism itself.
Do not ask one model to average incompatible behaviors and then act surprised when it predicts nobody in particular. Route first. Predict second. Constrain before acting. And when the future is genuinely uncertain, rank plausible next moves instead of pretending there is only one.
That is not just a chess lesson.
It is how many practical AI systems should be built.
Cognaptus: Automate the Present, Incubate the Future.
-
Daren Zhong, Dingcheng Huang, and Clayton Greenberg, “Predicting Human Chess Moves: An AI Assisted Analysis of Chess Games Using Skill-group Specific n-gram Language Models,” arXiv:2512.01880, 2025. https://arxiv.org/abs/2512.01880 ↩︎