Games are not toys to an AI lab. They are controlled worlds with messy consequences.
A game gives an agent what enterprise software and robotics both struggle to provide at scale: visual ambiguity, delayed goals, menus, navigation, tool use, failure states, and a reset button that does not involve a broken warehouse robot or a furious operations manager. That is why Google DeepMind’s SIMA 2 paper is more interesting than “AI can play games again.” We have had that headline several times. It is getting a little tired, and it should probably hydrate.
The paper introduces SIMA 2, a Gemini-based vision-language-action agent that can perceive 3D virtual worlds, reason about goals, converse with users, and act through keyboard and mouse controls across a portfolio of research environments and commercial games.1 The important part is not that it acts inside games. The important part is the training recipe: start with a foundation model, teach it low-level embodied control, add bridge data so reasoning and dialogue connect to action, improve with reinforcement learning, evaluate in seen and unseen worlds, then use Gemini again as task setter and reward model for self-improvement.
That mechanism matters because it attacks a common but wrong assumption: if a frontier vision-language model understands images and language, perhaps it can become an embodied agent through prompting. SIMA 2 says: not really. The paper reports that non-finetuned Gemini Flash-Lite and Gemini Pro reach only 3.2% and 7.0% success, respectively, on the programmatic embodied evaluation suite, despite prompt engineering to make them output the right action format. Seeing a screen is not the same as knowing how to move through it. The model has to learn consequences.
SIMA 2 is not just Gemini with a controller duct-taped on
The core design decision is deceptively simple: SIMA 2 uses a Gemini Flash-Lite model as the agent’s base, then finetunes it to output structured text that can be deterministically parsed into keyboard and mouse actions. The agent receives RGB video frames, recent interaction history, language instructions, its own previous reasoning, and its dialogue outputs. It does not receive privileged game-state variables. It sees pixels and presses keys, like a human player with less nostalgia and more tensor math.
This matters because the agent is not trained as a narrow game bot with custom APIs. It uses a generic human-computer interface: standard keyboard keys, mouse clicks, and discretized mouse movement. That interface is inefficient compared with a simulator API, but strategically powerful. It forces the agent to learn through the same surface that many future agents will use: screen, language, memory, action.
The paper’s mechanism has four major layers.
| Mechanism layer | What the paper does | Why it matters |
|---|---|---|
| Foundation model core | Starts from Gemini Flash-Lite | Gives the agent pretrained visual, language, dialogue, and reasoning capability |
| Human gameplay data | Uses trajectories with frames, instructions, and keyboard/mouse actions | Teaches low-level embodied control, not just semantic recognition |
| Bridge data | Uses Gemini Pro to annotate high-quality examples with internal reasoning and dialogue | Connects high-level intent, explanation, and action in a single stream |
| RL on verifiable tasks | Applies online reinforcement learning using tasks with verification functions | Improves controllability and task success in training environments |
The bridge data is especially important. Human gameplay teaches “what keys were pressed.” It does not naturally teach the agent to say, “I am looking for the campfire,” or to reason that a “house colored like a ripe tomato” means the red house. SIMA 2 needs synthetic reasoning and dialogue examples because the target behavior is not merely motor imitation. It is embodied cooperation.
That is the first business-relevant lesson: if an enterprise wants agents that can explain, ask clarifying questions, and recover from ambiguity while operating software or machines, the training data cannot contain only successful clicks. It needs the connective tissue between intention, observation, reasoning, action, and completion.
The evidence is a stack, not a single victory lap
The paper’s experimental structure is useful because it does not rely on one glamorous demo. It builds a ladder of evidence: new capabilities, training-environment performance, held-out generalization, comparison with base Gemini, hierarchical use of Gemini Pro, and self-improvement.
Here is the evidence map.
| Evidence item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Figures 4–5: dialogue, reasoning, complex instructions, multimodal prompting | Capability demonstration | SIMA 2 inherits and operationalizes Gemini-style language and vision abilities inside embodied tasks | It does not quantify broad reliability |
| Figure 6: average embodied task performance | Main quantitative evidence | SIMA 2 roughly doubles SIMA 1’s success rate and approaches timed human performance on training environments | It does not prove physical-world transfer |
| Figures 8–9: per-environment performance | Main evidence plus diagnostic breakdown | Gains are broad across environments, especially complex commercial games | It does not isolate which training component caused each gain |
| Figure 7: skill categories | Diagnostic analysis | Interaction and object management are closer to human performance; resource gathering and combat remain harder | It is not a full ablation |
| Figure 10: ASKA and MineDojo held-out performance | Main generalization evidence | SIMA 2 handles new visuals, menus, and mechanics better than SIMA 1 | Success rates remain modest |
| The Gunk and Genie 3 examples | Exploratory extension | SIMA 2 can perform novel navigation/tool-use tasks and operate in photorealistic generated worlds | These are qualitative, not large-scale benchmarks |
| Table 1: reasoning benchmark retention | Tradeoff / regression test | Embodied finetuning does not erase general reasoning capability | Some benchmark drops are still meaningful |
| Figure 14 and Appendix B | Exploratory composition test | Gemini Pro can act as slower planner/memory layer above SIMA 2 | It is not yet a production architecture |
| Figures 15–17 | Self-improvement evidence | Gemini-generated tasks and rewards can drive improvement in ASKA and Genie 3 | Reward reliability and open-ended safety remain open problems |
On training environments, Figure 6 is the cleanest quantitative anchor. In human-evaluated tasks, SIMA 1 is around 33% success while SIMA 2 reaches around 65%, close to the timed human baseline of 66%; removing the timeout lifts human performance to about 86%. In automatic evaluations, SIMA 1 is around 30%, SIMA 2 reaches around 76%, timed humans are around 78%, and human performance without timeout is again around 86%.
The interpretation is not “SIMA 2 is human-level at games.” The interpretation is narrower and more useful: under the paper’s task definitions and time constraints, SIMA 2 closes much of the gap on many embodied tasks in environments included in training. That is already non-trivial, because those environments include visually rich games with menus, navigation, object interaction, and physics-like dynamics.
The paper also breaks results down by environment. In human-evaluated settings, SIMA 2 improves over SIMA 1 across Goat Simulator 3, Hydroneer, No Man’s Sky, Satisfactory, Valheim, and Wobbly Life, with reported gains ranging from +18% to +46%. In automatic evaluations, the gains range from small but positive in WorldLab to very large in several commercial-game environments, with some reported improvements above +50%.
This is where the mechanism-first view helps. If we only summarize the result, we get “SIMA 2 is better.” Fine. Please alert the media. If we follow the mechanism, the result becomes more informative: the improvements are largest where the task demands more than short instruction following. Menus, visual diversity, tool use, and game dynamics are exactly the areas where a foundation model’s semantic knowledge plus trained control should matter.
The held-out tests ask whether SIMA 2 learned embodiment, not just game trivia
The paper’s strongest generalization evidence comes from ASKA and MineDojo, both held out from training. ASKA is a Viking survival game with village-building, resource gathering, menus, crafting, and combat. MineDojo is a Minecraft benchmark suite with combat, harvesting, and tech-tree tasks.
SIMA 2 outperforms SIMA 1 by +12% in ASKA and +13% in MineDojo. The absolute rates are not presented as a solved benchmark. That is the point. Held-out worlds are harder because the agent sees new visuals, new menus, and new mechanics. In MineDojo, SIMA 1 completes only two task types, while SIMA 2 completes tasks in 26 of 50 task categories. That is not domination; it is transfer.
The human baseline is also carefully framed. Naive human players with general video game experience but no prior exposure to ASKA or MineDojo score roughly 32% in ASKA and 19% in MineDojo on representative subsets. The paper warns against direct over-comparison because humans and agents fail differently: humans often lose to time constraints, while agents often fail through suboptimal exploration.
That distinction is more than academic politeness. In enterprise settings, failure type matters as much as success rate. A warehouse robot that fails because it is slow is a different risk from an agent that confidently explores the wrong aisle. A software agent that cannot find a button is different from one that finds the wrong button and clicks it beautifully. SIMA 2’s held-out results suggest stronger generalization, but they also point to exploration and goal verification as operational bottlenecks.
The qualitative held-out examples extend the story. In The Gunk, a story-driven action-adventure game unseen during training, SIMA 2 progresses through the first 15–20 minutes under manual instruction, including scanning objects, climbing ledges, jumping gaps, and using on-screen cues such as “ABSORB” and “HOLD” to operate a new suction tool. In Genie 3, it navigates photorealistic generated environments, particularly in navigation-based tasks.
These are not benchmark-closing claims. They are probes. Their value is to show that the keyboard-and-mouse, pixel-only interface can travel beyond the exact environments used in training.
Prompting alone fails because action is a different distribution
The most useful negative result in the paper is the baseline Gemini comparison. SIMA 2 is based on Gemini Flash-Lite, and Gemini Pro is the stronger model. A casual reader might expect Gemini Pro to do reasonably well if prompted to output actions. It does not.
The paper reports 3.2% success for baseline Gemini Flash-Lite and 7.0% for baseline Gemini Pro on programmatic evaluations across the training domains. The authors note this low performance remains despite prompt engineering to produce the correct action and text formatting.
This is the misconception-correction moment:
| Reader belief | Paper’s correction | Business consequence |
|---|---|---|
| “A strong VLM can operate an environment if we prompt it well.” | Baseline Gemini models perform poorly without embodied finetuning. | Prompt engineering is not a substitute for action data and environment-specific feedback. |
| “Reasoning is the hard part; control is just output format.” | Keyboard/mouse control, visual grounding, timing, and consequence learning are distinct skills. | UI agents and robotic agents need training loops, not just better instruction templates. |
| “Specialized action training will destroy general intelligence.” | SIMA 2 retains much of Gemini’s benchmark capability after action training. | The design goal is not choosing between reasoning and action, but preserving both. |
Table 1 addresses the last point. Compared with the baseline Gemini model, SIMA 2 after supervised finetuning shows relative drops of -4.0% on LiveCodeBench, -25.5% on AIME, and -16.3% on GPQA Diamond. After SFT plus RL, the drops are -8.4%, -15.4%, and -19.5%, respectively.
This is not “no loss.” AIME dropping by double digits is still a real regression. But the result is important because action finetuning does not collapse the model into a mute controller. SIMA 2 still reasons, dialogues, and handles complex instructions. For business adoption, this is the design frontier: not just teaching agents to click, but teaching them to click without forgetting why.
Gemini Pro becomes the planner, not the fingers
SIMA 2 uses Gemini Flash-Lite because embodied action has latency constraints. A slow, brilliant model is not always the right motor controller. Anyone who has watched a committee make a sandwich already knows this.
The paper explores a hierarchical setup where Gemini Pro operates at a slower cadence, reviewing recent video history, issuing natural-language instructions to SIMA 2, and producing a summary that functions as recurrent memory. SIMA 2 remains the embodied executor.
This matters because it suggests an agent architecture with separated time scales:
| Layer | Role | Enterprise analogy |
|---|---|---|
| Fast embodied executor | Converts local perception and instruction into actions | UI operator, robot controller, simulation actor |
| Slower reasoning planner | Interprets diagrams, maintains context, decomposes goals | Supervisor, workflow planner, QA reviewer |
| Summary memory | Carries long-horizon context between planner calls | Case file, runbook state, task ledger |
The paper demonstrates this with a complex multimodal instruction-following example: a diagram instructs the combined Gemini Pro + SIMA 2 system to build a campfire. The system decomposes the task, gathers stones and wood, opens the build menu, selects the campfire, and communicates progress.
This is exploratory, but the architecture is business-relevant. Enterprise agents rarely need one monolithic model doing everything at the same frequency. A procurement agent, for example, may need a slow policy-checking layer, a medium-speed planning layer, and a fast UI navigation layer. SIMA 2’s hierarchy points in that direction.
Self-improvement is the most important part, and also the easiest to oversell
The self-improvement section is the paper’s most ambitious contribution. It is also the section most likely to be abused by PowerPoint.
The setup has three model roles:
- A Gemini-based task setter proposes tasks likely to be achievable from the current environment state.
- SIMA 2 attempts those tasks through embodied action.
- A Gemini-based reward model scores the resulting trajectory video from 0 to 100, with 50 or above treated as success.
The reward rubric is calibrated against human preference pairs over a small dataset of trajectories. The agent then trains on scored self-generated experience.
This is not magic self-improvement. It is a loop: generate task, attempt task, score video, train policy, repeat. The novelty is that the loop can operate in open-world 3D environments without relying on privileged game-state rewards for every task.
In ASKA, the paper first isolates the improvement component by using a fixed task set. Tasks include resource gathering, environment interaction, navigation, and menu use. Initially, SIMA 2 succeeds on less than a quarter of the tasks. After successive self-improvement iterations, the self-improved agents exceed the success threshold across all tasks; average performance eventually exceeds the human reference score as judged by the Gemini-based reward model.
That last phrase matters: “as judged by the Gemini-based reward model.” The score is not a neutral law of physics. It is a model-based evaluation pipeline, calibrated but still imperfect. For research, this is a promising scalable feedback mechanism. For business, it is also a governance risk. If the reward model becomes the judge, then reward quality becomes system quality.
The full ASKA self-improvement setup adds the task setter and focuses on game progression skills such as resource gathering, crafting, menu use, and building. The self-improved agent progresses further than both SIMA 1 and the initial SIMA 2, eventually building a shelter within a one-hour window.
In Genie 3, the paper takes the idea further. It uses urban generated environments as training tasks and natural generated environments as held-out test tasks, mostly centered on navigation. Self-improvement raises scores across nearly all train tasks, often by 25 points or more, and the gains transfer to the majority of held-out natural-environment tasks.
That is the paper’s most future-facing result: a general agent trained in virtual worlds, improving through a foundation-model task-and-reward loop, inside generated worlds. It is early, but the direction is clear. A world model can generate environments. A task model can generate goals. A reward model can score attempts. An embodied agent can train on the resulting experience.
The factory for consequences starts to look automated.
The business value is not “game AI”; it is cheaper embodied experimentation
The practical pathway from SIMA 2 to business is not that companies should deploy game-playing agents tomorrow. The serious pathway is that virtual environments can become scalable training and evaluation infrastructure for agents that must perceive, decide, act, recover, and improve.
For robotics, this means simulation spaces that test navigation, object interaction, tool use, and recovery before hardware is involved. For software agents, it means GUI environments where agents learn to operate through screens rather than internal APIs. For industrial digital twins, it means task generation and reward scoring inside simulated plants, warehouses, or maintenance workflows. For game studios, it means more capable NPCs, QA automation, and playtesting agents that can interpret goals rather than follow brittle scripts.
A useful enterprise translation looks like this:
| SIMA 2 concept | Business translation | ROI relevance | Boundary |
|---|---|---|---|
| Diverse virtual worlds | Simulation portfolio covering many workflow variants | Reduces cost of collecting real-world failures | Simulation mismatch remains |
| Pixel-only interface | Agent operates through the same screen humans use | Works where APIs are missing or fragmented | Slower and more error-prone than structured APIs |
| Bridge data | Reasoning/action examples connect intent to behavior | Improves explainability and correction | Synthetic reasoning can become performative if poorly validated |
| Programmatic and human evaluations | Mixed evaluation suite for measurable task success | Enables regression testing and deployment gates | Metrics cover only what they can detect |
| Gemini task setter | Automated curriculum generation | Reduces dependence on human task designers | Task distribution can drift or become too easy |
| Gemini reward model | Scalable scoring of open-ended behavior | Makes self-improvement economically plausible | Reward hacking and evaluator bias become central risks |
| Gemini Pro + SIMA 2 hierarchy | Planner-executor architecture | Separates strategic reasoning from fast control | More moving parts, more latency, more monitoring |
For Cognaptus-style automation work, the takeaway is especially relevant to business process agents. Many enterprise tasks are not clean API calls. They are screen-based, exception-heavy, and full of local cues: pop-ups, menu states, document windows, user confirmations, weird legacy systems, and the occasional button designed by someone having a difficult week. SIMA 2’s interface resembles that mess more than many polished agent benchmarks do.
The business inference is not that SIMA 2 is ready to operate SAP, a forklift, or a hospital robot. The inference is that the winning infrastructure for practical agents may look less like one giant prompt and more like a training environment: task libraries, visual states, action logs, reward models, human review, regression tests, and self-improvement loops.
The boundary: virtual competence is not physical deployment
SIMA 2 is a research preview, and the paper is clear about several limitations.
First, the physical world is not a game engine. Genie 3 photorealistic environments are closer to real-world visuals than stylized games, but they are still generated worlds. Physics, embodiment, safety, actuator noise, sensor failures, and real-world liability are not solved by walking to a red mushroom.
Second, the paper’s qualitative extensions are not the same as large-scale quantitative proof. The Gunk and Genie 3 results are valuable probes, but they should not be read as robust deployment benchmarks.
Third, self-improvement depends heavily on reward quality. A Gemini-based reward model can score open-ended video trajectories, but if the reward model misjudges progress, the agent may learn the wrong behavior. In business terms, automated improvement without evaluator governance is not innovation. It is drift with a nicer font.
Fourth, SIMA 2 still struggles with long-horizon tasks, short memory under low-latency constraints, precise low-level action, and robust visual understanding in complex 3D scenes. Combat and resource gathering remain harder categories, partly because they require timing, fine motor control, search, and environmental contingencies.
Finally, the paper does not prove that the same approach transfers directly to enterprise software, robotics, or industrial automation. It gives a plausible training and evaluation pattern. The implementation details would change by domain, and the cost of failure would change even more.
The actual signal
SIMA 2 should not be read as “AI can play games.” That headline is both true and too small.
The actual signal is that foundation models can be converted into embodied agents without discarding their language and reasoning abilities, but only if they are trained on action, reasoning-action bridge data, and environment feedback. The paper’s main contribution is a mechanism for turning virtual worlds into scalable laboratories for agency.
That mechanism has a sober business message. Agent capability will not come from better prompts alone. It will come from environments where agents can try, fail, be scored, be corrected, and improve. In other words, businesses that want useful agents need to build not only workflows, but training grounds.
SIMA 2 is early. It is imperfect. It is still mostly inside worlds where the worst possible outcome is a failed task and perhaps a confused Viking. But the architecture points toward something important: agents become more useful when they are trained against consequences, not just text.
And consequences, unlike demos, are where automation starts to become real.
Cognaptus: Automate the Present, Incubate the Future.
-
SIMA Team, “SIMA 2: A Generalist Embodied Agent for Virtual Worlds,” arXiv:2512.04797, 2025, https://arxiv.org/abs/2512.04797. ↩︎