Opening — Why this matters now
Artificial intelligence has mastered many board games. Chess. Go. Even the occasionally confusing world of StarCraft.
But there is a quieter, unresolved problem hiding inside game‑AI research: imperfect information. Most real‑world decisions—from trading markets to negotiations—look far more like poker than chess. Players operate with partial knowledge, uncertain beliefs, and constantly shifting probabilities.
Yet strangely, many AI systems designed for these environments are evaluated on only a handful of games. Poker. Maybe a trick‑taking game. Occasionally something exotic like Dou Di Zhu.
The result? Researchers often claim algorithmic improvements based on performance in just one or two domains.
In other words: impressive benchmarks, questionable generalization.
A recent research effort introduces Valet, a standardized testbed of 21 traditional card games designed specifically to address this gap. The idea is simple but powerful—if we want AI systems that reason under uncertainty, we should test them across a broad spectrum of uncertainty structures.
And card games, as it turns out, are perfect laboratories.
Background — Imperfect information and the benchmarking problem
Game AI historically advanced through clear benchmark milestones:
| Era | Game | Key Property | AI Milestone |
|---|---|---|---|
| 1990s | Chess | Perfect information | Deep Blue |
| 2010s | Go | Large search space | AlphaGo |
| 2010s–2020s | Poker | Imperfect information | Libratus / Pluribus |
The last category—imperfect information games—is especially important for real‑world AI systems.
In these environments:
- Some information is hidden
- Outcomes contain randomness
- Players must infer beliefs about others
- Strategy includes deception and probabilistic reasoning
Card games naturally encode all of these features.
Hidden hands create private information. Random draws introduce stochasticity. Observed actions reveal clues about opponents.
Despite this richness, most algorithm comparisons rely on a tiny subset of games. The research ecosystem frequently reuses the same examples:
- Poker variants
- Trick‑taking games (Hearts, Bridge)
- Climbing games (Dou Di Zhu)
This leads to a methodological issue: algorithm performance may reflect properties of the chosen game rather than general capability.
Valet attempts to solve exactly that.
Analysis — What the Valet benchmark actually introduces
Valet curates 21 traditional card games spanning multiple genres, player counts, deck structures, and information patterns.
The selection intentionally emphasizes diversity rather than popularity.
Genre coverage
| Category | Example Games | Strategic Traits |
|---|---|---|
| Trick‑taking | Hearts, Whist, Euchre | Sequential inference and suit constraints |
| Shedding / hand management | Crazy Eights, President | Hand reduction strategy |
| Betting / comparison | Blackjack, Leduc Hold’em | Probabilistic reasoning |
| Capture / scoring | Scopa, Goofspiel | Simultaneous or tactical scoring |
| Multi‑phase play | Cribbage | Hybrid strategy stages |
This diversity matters because each genre stresses different algorithmic capabilities.
For example:
| Mechanic | AI challenge |
|---|---|
| Hidden hands | Belief modeling |
| Random card draws | Stochastic planning |
| Sequential play | Long‑horizon reasoning |
| Limited legal actions | Constrained decision trees |
| Deduction from play | Opponent modeling |
Instead of evaluating algorithms on a single difficulty profile, Valet spreads evaluation across multiple dimensions.
Standardized rule encoding
A second major contribution is rule normalization.
Traditional card games have countless regional variants. Researchers often implement slightly different versions, making comparisons unreliable.
Valet solves this using a domain description language called RECYCLE, which encodes fixed rulesets for each game. This ensures that experiments across frameworks reference the same underlying game logic.
In practical terms, that means:
- Reproducible experiments
- Cross‑framework comparability
- Reduced ambiguity in benchmarking
A small change in rules can radically alter strategy spaces. Valet removes that variable.
Findings — Measuring diversity across the benchmark
To demonstrate the benchmark’s usefulness, the researchers analyzed four structural properties of the games.
1. Information structure
Card games hide and reveal information through several mechanisms.
| Information Mechanism | Example |
|---|---|
| Public visibility | Cards played face‑up |
| Hidden cards | Face‑down piles |
| Private information | Player hands |
| Shared transfers | Cards exchanged between players |
| Deduction | Inference from actions |
Even within the same category of games, these mechanisms vary significantly.
For instance:
- Cribbage introduces shared private information.
- Goofspiel uses hidden simultaneous bidding.
- Trick‑taking games reveal suit constraints through play.
This produces a wide spectrum of uncertainty structures.
2. Branching factor
The branching factor—the number of choices available at each decision point—varies widely across the testbed.
| Game Type | Typical Branching Behavior |
|---|---|
| Trick‑taking games | Decreasing options as suits constrain play |
| Hand‑management games | Moderate branching |
| Climbing games (President) | High branching |
| Guessing games (Go Fish) | Very high early branching |
Some games force narrow decision trees, while others explode combinatorially.
For AI researchers, this means different search strategies may excel in different games.
3. Game length
Game duration—measured in decision points—ranges broadly.
| Length Category | Examples |
|---|---|
| Short (10–20 moves) | Blackjack |
| Medium (20–80 moves) | Hearts, Euchre |
| Long (>100 moves) | Rummy, Skitgubbe |
Longer games require deeper planning and memory of past actions.
Short games emphasize tactical decisions.
A benchmark covering both exposes algorithm weaknesses more clearly.
4. Score distributions
The study also compared Monte Carlo Tree Search (MCTS) against random players.
Across the benchmark:
- MCTS outperformed random play in most games
- Performance gaps varied dramatically
- Some games showed little improvement
This variation suggests that algorithmic effectiveness depends heavily on game structure.
Which is precisely the problem Valet aims to expose.
Implications — Why this matters beyond card games
At first glance, a collection of card games may seem like a niche academic tool.
In reality, it points to a broader methodological issue across AI research.
Many AI benchmarks today suffer from benchmark monoculture:
- Image models judged by ImageNet
- Language models judged by a handful of datasets
- Game AI judged by a few canonical games
Such benchmarks can distort progress by rewarding systems optimized for narrow tasks.
Valet suggests an alternative philosophy:
Benchmark diversity reveals algorithm robustness.
For agent systems—especially those operating under uncertainty—this principle is critical.
Applications that resemble imperfect‑information games include:
| Real‑world domain | Similarity to card games |
|---|---|
| Financial trading | Hidden information and probabilistic inference |
| Cybersecurity | Partial visibility of adversary actions |
| Negotiation systems | Strategic information disclosure |
| Multi‑agent coordination | Belief modeling and uncertainty |
Better evaluation environments ultimately lead to more reliable AI agents.
Valet therefore represents something larger than a card‑game library.
It is a proposal for more scientifically rigorous benchmarking in imperfect‑information AI.
Conclusion — Better games, better science
Artificial intelligence has made enormous progress in game environments.
But progress in benchmark design has lagged behind progress in algorithms.
Valet addresses a deceptively simple issue: evaluating AI across many diverse games instead of a few iconic ones.
That shift forces researchers to confront an uncomfortable but healthy question:
Does an algorithm actually generalize—or does it merely exploit the quirks of a particular game?
Card games have always been about reading hidden information and managing uncertainty.
Now they may also help researchers do the same.
Cognaptus: Automate the Present, Incubate the Future.