Opening — Why This Matters Now
Every few months, a new model release arrives wrapped in confident headlines: human-level reasoning, expert-level coding, AGI within reach. Benchmarks light up. Leaderboards shift. Twitter celebrates.
And yet, when these same models are asked to play a casual mobile game for two minutes — the kind designed for bored commuters — they collapse into hesitation, confusion, or paralysis.
That tension sits at the heart of AI GAMESTORE, a new evaluation framework that proposes something both deceptively simple and quietly radical: if you want to measure general intelligence, stop testing static benchmarks. Test whether AI can learn to play the multiverse of human games.
Not chess. Not Go. Not a curated reasoning dataset.
All the games humans invent, enjoy, and spread.
The results? Frontier models score less than 10% of human median performance across 100 representative games — while taking 12–18× longer to think.
For anyone building AI products, autonomous agents, or automation systems, this is not just academic curiosity. It’s a calibration moment.
Background — From Static Benchmarks to Living Testbeds
Traditional AI evaluation has a structural problem: it measures fragments of intelligence.
- Language understanding (GLUE, BIG-bench)
- Coding tasks (SWE-bench)
- Math problem solving
- Fixed board games
These benchmarks are useful — until models saturate them.
The AI GAMESTORE paper reframes the problem. Instead of asking whether a model can solve predefined tasks, it asks:
Can a machine learn and play any human-designed game as efficiently as a human with the same time budget?
The authors define this space as the “Multiverse of Human Games.”
Why games?
Because games are distilled abstractions of real-world skills:
| Game Genre | Real-World Cognitive Parallel |
|---|---|
| Strategy games | Long-horizon planning & resource management |
| Puzzle games | Constraint reasoning & working memory |
| Action games | Spatial-temporal coordination |
| Social deduction | Theory of mind & deception reasoning |
| Sandbox / open world | World-model construction |
Games are cultural compression. They encode what humans think is worth practicing.
If a system fails broadly across this distribution, calling it “general” becomes aspirational branding.
The AI GAMESTORE — A Practical Proxy for an Infinite Space
Of course, evaluating all conceivable human games is impossible.
So AI GAMESTORE constructs a scalable proxy:
- Source popular digital games (Apple App Store, Steam)
- Filter for suitability (playable in minutes, measurable score, no domain-specific trivia)
- Use LLMs to generate standardized p5.js versions
- Refine with humans-in-the-loop
- Annotate cognitive demands
- Evaluate models and humans under identical 2-minute budgets
The system is intentionally dynamic — a living benchmark designed to resist saturation.
This matters commercially. Static benchmarks create predictable optimization targets. Living benchmarks create moving constraints — closer to real-world deployment environments.
What Was Tested
100 Generated Games
Sourced from top charts across multiple countries and genres.
106 Human Participants
Each played 10 games, 120 seconds per game.
7 Frontier Vision-Language Models
- GPT-5.2
- GPT-5-mini
- Gemini-2.5-Pro
- Gemini-2.5-Flash
- Claude-Opus-4.5
- Qwen-3-VL-32B
- Llama-4-Maverick
Models were given the same 120-second gameplay budget — implemented via a pause-and-query harness.
Performance was normalized to the human median score (set to 100).
Findings — The Performance Cliff
1. Aggregate Performance
| Model | Geometric Mean (Human = 100) |
|---|---|
| GPT-5.2 | 8.5 |
| Claude-Opus-4.5 | ~7–8 |
| Gemini-2.5-Pro | ~7–9 |
| Others | 3–6 |
Even the strongest models achieved <10% of human median performance.
2. Bimodal Failure Pattern
Across 100 games, models exhibited two modes:
- Moderate underperformance (10–30% of human score)
- Near-total collapse (<1% of human score)
This is not graceful degradation. It is cognitive brittleness.
3. The Real Bottlenecks
When performance was broken down by cognitive demand, three weaknesses dominated:
| Capability | Observed Model Weakness |
|---|---|
| Memory | Difficulty maintaining cross-frame state even with scratchpad |
| Planning | Poor multi-step simulation |
| World Model Learning | Struggles inferring hidden mechanics |
The more capabilities a game required, the sharper the performance drop.
In other words: integration fails before components do.
Time — The Hidden Variable
Humans: 120 seconds per game.
Models: Often 20+ minutes due to reasoning latency.
Even ignoring raw score, efficiency matters.
If an autonomous agent requires 15× human thinking time to reach 8% performance, it is not competitive in dynamic environments — especially in robotics, operations, or real-time decision systems.
Speed is not cosmetic. It is architectural.
Why This Is a Business Story, Not Just a Research Story
For founders and operators building AI-driven systems, three implications stand out.
1. Benchmark Success ≠ General Robustness
Excelling at coding benchmarks or reasoning tests does not imply adaptability in open-ended task spaces.
Game-based evaluation reveals integration gaps invisible in static datasets.
2. Memory and World Modeling Are the Commercial Frontier
Enterprise automation increasingly requires:
- Cross-session state retention
- Learning implicit process rules
- Planning under uncertainty
These map directly to the weakest cognitive areas exposed by AI GAMESTORE.
This is where infrastructure innovation — not prompt tweaking — will matter.
3. Living Benchmarks Reduce Overfitting Risk
A continuously evolving evaluation suite is closer to deployment reality.
Products built only against static evaluation targets risk silent fragility.
Limitations — And Why They Don’t Undermine the Signal
AI GAMESTORE currently focuses on short, casual games.
It does not yet include:
- Long-horizon narrative environments
- Sophisticated multi-agent theory-of-mind games
- Complex economic simulations
Which makes the result more striking.
If models struggle with lightweight two-minute games, scaling difficulty upward will not narrow the gap.
The Bigger Picture — Toward Capability Profiling
The paper suggests moving beyond leaderboard obsession toward capability-oriented diagnostics.
Instead of asking:
“What is the overall score?”
Ask:
“Which latent cognitive systems fail under integration stress?”
This aligns closely with where enterprise AI evaluation must go:
- Latent capability measurement
- Sequential decision profiling
- Multi-capability interaction stress tests
General intelligence is not the sum of isolated competencies. It is the stability of their coordination.
Conclusion — The Multiverse Is Not Impressed
AI models are extraordinary pattern machines.
But the Multiverse of Human Games is unimpressed.
Across 100 small, culturally ordinary tasks, frontier systems achieved less than 10% of human performance, struggled with memory and planning, and required dramatically more time to think.
That does not mean progress is stagnant.
It means the frontier has shifted.
The next breakthroughs will not be won on static text benchmarks. They will be won in environments where:
- Rules must be inferred
- State must be remembered
- Plans must be simulated
- Multiple skills must integrate seamlessly
In other words — in worlds that look suspiciously like the real one.
Cognaptus: Automate the Present, Incubate the Future.