Opening — Why this matters now

Artificial intelligence has mastered many board games. Chess. Go. Even the occasionally confusing world of StarCraft.

But there is a quieter, unresolved problem hiding inside game‑AI research: imperfect information. Most real‑world decisions—from trading markets to negotiations—look far more like poker than chess. Players operate with partial knowledge, uncertain beliefs, and constantly shifting probabilities.

Yet strangely, many AI systems designed for these environments are evaluated on only a handful of games. Poker. Maybe a trick‑taking game. Occasionally something exotic like Dou Di Zhu.

The result? Researchers often claim algorithmic improvements based on performance in just one or two domains.

In other words: impressive benchmarks, questionable generalization.

A recent research effort introduces Valet, a standardized testbed of 21 traditional card games designed specifically to address this gap. The idea is simple but powerful—if we want AI systems that reason under uncertainty, we should test them across a broad spectrum of uncertainty structures.

And card games, as it turns out, are perfect laboratories.


Background — Imperfect information and the benchmarking problem

Game AI historically advanced through clear benchmark milestones:

Era Game Key Property AI Milestone
1990s Chess Perfect information Deep Blue
2010s Go Large search space AlphaGo
2010s–2020s Poker Imperfect information Libratus / Pluribus

The last category—imperfect information games—is especially important for real‑world AI systems.

In these environments:

  • Some information is hidden
  • Outcomes contain randomness
  • Players must infer beliefs about others
  • Strategy includes deception and probabilistic reasoning

Card games naturally encode all of these features.

Hidden hands create private information. Random draws introduce stochasticity. Observed actions reveal clues about opponents.

Despite this richness, most algorithm comparisons rely on a tiny subset of games. The research ecosystem frequently reuses the same examples:

  • Poker variants
  • Trick‑taking games (Hearts, Bridge)
  • Climbing games (Dou Di Zhu)

This leads to a methodological issue: algorithm performance may reflect properties of the chosen game rather than general capability.

Valet attempts to solve exactly that.


Analysis — What the Valet benchmark actually introduces

Valet curates 21 traditional card games spanning multiple genres, player counts, deck structures, and information patterns.

The selection intentionally emphasizes diversity rather than popularity.

Genre coverage

Category Example Games Strategic Traits
Trick‑taking Hearts, Whist, Euchre Sequential inference and suit constraints
Shedding / hand management Crazy Eights, President Hand reduction strategy
Betting / comparison Blackjack, Leduc Hold’em Probabilistic reasoning
Capture / scoring Scopa, Goofspiel Simultaneous or tactical scoring
Multi‑phase play Cribbage Hybrid strategy stages

This diversity matters because each genre stresses different algorithmic capabilities.

For example:

Mechanic AI challenge
Hidden hands Belief modeling
Random card draws Stochastic planning
Sequential play Long‑horizon reasoning
Limited legal actions Constrained decision trees
Deduction from play Opponent modeling

Instead of evaluating algorithms on a single difficulty profile, Valet spreads evaluation across multiple dimensions.

Standardized rule encoding

A second major contribution is rule normalization.

Traditional card games have countless regional variants. Researchers often implement slightly different versions, making comparisons unreliable.

Valet solves this using a domain description language called RECYCLE, which encodes fixed rulesets for each game. This ensures that experiments across frameworks reference the same underlying game logic.

In practical terms, that means:

  • Reproducible experiments
  • Cross‑framework comparability
  • Reduced ambiguity in benchmarking

A small change in rules can radically alter strategy spaces. Valet removes that variable.


Findings — Measuring diversity across the benchmark

To demonstrate the benchmark’s usefulness, the researchers analyzed four structural properties of the games.

1. Information structure

Card games hide and reveal information through several mechanisms.

Information Mechanism Example
Public visibility Cards played face‑up
Hidden cards Face‑down piles
Private information Player hands
Shared transfers Cards exchanged between players
Deduction Inference from actions

Even within the same category of games, these mechanisms vary significantly.

For instance:

  • Cribbage introduces shared private information.
  • Goofspiel uses hidden simultaneous bidding.
  • Trick‑taking games reveal suit constraints through play.

This produces a wide spectrum of uncertainty structures.

2. Branching factor

The branching factor—the number of choices available at each decision point—varies widely across the testbed.

Game Type Typical Branching Behavior
Trick‑taking games Decreasing options as suits constrain play
Hand‑management games Moderate branching
Climbing games (President) High branching
Guessing games (Go Fish) Very high early branching

Some games force narrow decision trees, while others explode combinatorially.

For AI researchers, this means different search strategies may excel in different games.

3. Game length

Game duration—measured in decision points—ranges broadly.

Length Category Examples
Short (10–20 moves) Blackjack
Medium (20–80 moves) Hearts, Euchre
Long (>100 moves) Rummy, Skitgubbe

Longer games require deeper planning and memory of past actions.

Short games emphasize tactical decisions.

A benchmark covering both exposes algorithm weaknesses more clearly.

4. Score distributions

The study also compared Monte Carlo Tree Search (MCTS) against random players.

Across the benchmark:

  • MCTS outperformed random play in most games
  • Performance gaps varied dramatically
  • Some games showed little improvement

This variation suggests that algorithmic effectiveness depends heavily on game structure.

Which is precisely the problem Valet aims to expose.


Implications — Why this matters beyond card games

At first glance, a collection of card games may seem like a niche academic tool.

In reality, it points to a broader methodological issue across AI research.

Many AI benchmarks today suffer from benchmark monoculture:

  • Image models judged by ImageNet
  • Language models judged by a handful of datasets
  • Game AI judged by a few canonical games

Such benchmarks can distort progress by rewarding systems optimized for narrow tasks.

Valet suggests an alternative philosophy:

Benchmark diversity reveals algorithm robustness.

For agent systems—especially those operating under uncertainty—this principle is critical.

Applications that resemble imperfect‑information games include:

Real‑world domain Similarity to card games
Financial trading Hidden information and probabilistic inference
Cybersecurity Partial visibility of adversary actions
Negotiation systems Strategic information disclosure
Multi‑agent coordination Belief modeling and uncertainty

Better evaluation environments ultimately lead to more reliable AI agents.

Valet therefore represents something larger than a card‑game library.

It is a proposal for more scientifically rigorous benchmarking in imperfect‑information AI.


Conclusion — Better games, better science

Artificial intelligence has made enormous progress in game environments.

But progress in benchmark design has lagged behind progress in algorithms.

Valet addresses a deceptively simple issue: evaluating AI across many diverse games instead of a few iconic ones.

That shift forces researchers to confront an uncomfortable but healthy question:

Does an algorithm actually generalize—or does it merely exploit the quirks of a particular game?

Card games have always been about reading hidden information and managing uncertainty.

Now they may also help researchers do the same.

Cognaptus: Automate the Present, Incubate the Future.