Worlds Within Reach: How SIMA 2 Turns Virtual Environments into Training Grounds for Generalist Agents

Games are not toys to an AI lab. They are controlled worlds with messy consequences.

A game gives an agent what enterprise software and robotics both struggle to provide at scale: visual ambiguity, delayed goals, menus, navigation, tool use, failure states, and a reset button that does not involve a broken warehouse robot or a furious operations manager. That is why Google DeepMind’s SIMA 2 paper is more interesting than “AI can play games again.” We have had that headline several times. It is getting a little tired, and it should probably hydrate.

The paper introduces SIMA 2, a Gemini-based vision-language-action agent that can perceive 3D virtual worlds, reason about goals, converse with users, and act through keyboard and mouse controls across a portfolio of research environments and commercial games.¹ The important part is not that it acts inside games. The important part is the training recipe: start with a foundation model, teach it low-level embodied control, add bridge data so reasoning and dialogue connect to action, improve with reinforcement learning, evaluate in seen and unseen worlds, then use Gemini again as task setter and reward model for self-improvement.

That mechanism matters because it attacks a common but wrong assumption: if a frontier vision-language model understands images and language, perhaps it can become an embodied agent through prompting. SIMA 2 says: not really. The paper reports that non-finetuned Gemini Flash-Lite and Gemini Pro reach only 3.2% and 7.0% success, respectively, on the programmatic embodied evaluation suite, despite prompt engineering to make them output the right action format. Seeing a screen is not the same as knowing how to move through it. The model has to learn consequences.

SIMA 2 is not just Gemini with a controller duct-taped on

The core design decision is deceptively simple: SIMA 2 uses a Gemini Flash-Lite model as the agent’s base, then finetunes it to output structured text that can be deterministically parsed into keyboard and mouse actions. The agent receives RGB video frames, recent interaction history, language instructions, its own previous reasoning, and its dialogue outputs. It does not receive privileged game-state variables. It sees pixels and presses keys, like a human player with less nostalgia and more tensor math.

This matters because the agent is not trained as a narrow game bot with custom APIs. It uses a generic human-computer interface: standard keyboard keys, mouse clicks, and discretized mouse movement. That interface is inefficient compared with a simulator API, but strategically powerful. It forces the agent to learn through the same surface that many future agents will use: screen, language, memory, action.

The paper’s mechanism has four major layers.

Mechanism layer	What the paper does	Why it matters
Foundation model core	Starts from Gemini Flash-Lite	Gives the agent pretrained visual, language, dialogue, and reasoning capability
Human gameplay data	Uses trajectories with frames, instructions, and keyboard/mouse actions	Teaches low-level embodied control, not just semantic recognition
Bridge data	Uses Gemini Pro to annotate high-quality examples with internal reasoning and dialogue	Connects high-level intent, explanation, and action in a single stream
RL on verifiable tasks	Applies online reinforcement learning using tasks with verification functions	Improves controllability and task success in training environments

The bridge data is especially important. Human gameplay teaches “what keys were pressed.” It does not naturally teach the agent to say, “I am looking for the campfire,” or to reason that a “house colored like a ripe tomato” means the red house. SIMA 2 needs synthetic reasoning and dialogue examples because the target behavior is not merely motor imitation. It is embodied cooperation.

That is the first business-relevant lesson: if an enterprise wants agents that can explain, ask clarifying questions, and recover from ambiguity while operating software or machines, the training data cannot contain only successful clicks. It needs the connective tissue between intention, observation, reasoning, action, and completion.

The evidence is a stack, not a single victory lap

The paper’s experimental structure is useful because it does not rely on one glamorous demo. It builds a ladder of evidence: new capabilities, training-environment performance, held-out generalization, comparison with base Gemini, hierarchical use of Gemini Pro, and self-improvement.

Here is the evidence map.

Evidence item	Likely purpose	What it supports	What it does not prove
Figures 4–5: dialogue, reasoning, complex instructions, multimodal prompting	Capability demonstration	SIMA 2 inherits and operationalizes Gemini-style language and vision abilities inside embodied tasks	It does not quantify broad reliability
Figure 6: average embodied task performance	Main quantitative evidence	SIMA 2 roughly doubles SIMA 1’s success rate and approaches timed human performance on training environments	It does not prove physical-world transfer
Figures 8–9: per-environment performance	Main evidence plus diagnostic breakdown	Gains are broad across environments, especially complex commercial games	It does not isolate which training component caused each gain
Figure 7: skill categories	Diagnostic analysis	Interaction and object management are closer to human performance; resource gathering and combat remain harder	It is not a full ablation
Figure 10: ASKA and MineDojo held-out performance	Main generalization evidence	SIMA 2 handles new visuals, menus, and mechanics better than SIMA 1	Success rates remain modest
The Gunk and Genie 3 examples	Exploratory extension	SIMA 2 can perform novel navigation/tool-use tasks and operate in photorealistic generated worlds	These are qualitative, not large-scale benchmarks
Table 1: reasoning benchmark retention	Tradeoff / regression test	Embodied finetuning does not erase general reasoning capability	Some benchmark drops are still meaningful
Figure 14 and Appendix B	Exploratory composition test	Gemini Pro can act as slower planner/memory layer above SIMA 2	It is not yet a production architecture
Figures 15–17	Self-improvement evidence	Gemini-generated tasks and rewards can drive improvement in ASKA and Genie 3	Reward reliability and open-ended safety remain open problems

On training environments, Figure 6 is the cleanest quantitative anchor. In human-evaluated tasks, SIMA 1 is around 33% success while SIMA 2 reaches around 65%, close to the timed human baseline of 66%; removing the timeout lifts human performance to about 86%. In automatic evaluations, SIMA 1 is around 30%, SIMA 2 reaches around 76%, timed humans are around 78%, and human performance without timeout is again around 86%.

The interpretation is not “SIMA 2 is human-level at games.” The interpretation is narrower and more useful: under the paper’s task definitions and time constraints, SIMA 2 closes much of the gap on many embodied tasks in environments included in training. That is already non-trivial, because those environments include visually rich games with menus, navigation, object interaction, and physics-like dynamics.

The paper also breaks results down by environment. In human-evaluated settings, SIMA 2 improves over SIMA 1 across Goat Simulator 3, Hydroneer, No Man’s Sky, Satisfactory, Valheim, and Wobbly Life, with reported gains ranging from +18% to +46%. In automatic evaluations, the gains range from small but positive in WorldLab to very large in several commercial-game environments, with some reported improvements above +50%.

This is where the mechanism-first view helps. If we only summarize the result, we get “SIMA 2 is better.” Fine. Please alert the media. If we follow the mechanism, the result becomes more informative: the improvements are largest where the task demands more than short instruction following. Menus, visual diversity, tool use, and game dynamics are exactly the areas where a foundation model’s semantic knowledge plus trained control should matter.

The held-out tests ask whether SIMA 2 learned embodiment, not just game trivia

The paper’s strongest generalization evidence comes from ASKA and MineDojo, both held out from training. ASKA is a Viking survival game with village-building, resource gathering, menus, crafting, and combat. MineDojo is a Minecraft benchmark suite with combat, harvesting, and tech-tree tasks.

SIMA 2 outperforms SIMA 1 by +12% in ASKA and +13% in MineDojo. The absolute rates are not presented as a solved benchmark. That is the point. Held-out worlds are harder because the agent sees new visuals, new menus, and new mechanics. In MineDojo, SIMA 1 completes only two task types, while SIMA 2 completes tasks in 26 of 50 task categories. That is not domination; it is transfer.

The human baseline is also carefully framed. Naive human players with general video game experience but no prior exposure to ASKA or MineDojo score roughly 32% in ASKA and 19% in MineDojo on representative subsets. The paper warns against direct over-comparison because humans and agents fail differently: humans often lose to time constraints, while agents often fail through suboptimal exploration.

That distinction is more than academic politeness. In enterprise settings, failure type matters as much as success rate. A warehouse robot that fails because it is slow is a different risk from an agent that confidently explores the wrong aisle. A software agent that cannot find a button is different from one that finds the wrong button and clicks it beautifully. SIMA 2’s held-out results suggest stronger generalization, but they also point to exploration and goal verification as operational bottlenecks.

The qualitative held-out examples extend the story. In The Gunk, a story-driven action-adventure game unseen during training, SIMA 2 progresses through the first 15–20 minutes under manual instruction, including scanning objects, climbing ledges, jumping gaps, and using on-screen cues such as “ABSORB” and “HOLD” to operate a new suction tool. In Genie 3, it navigates photorealistic generated environments, particularly in navigation-based tasks.

These are not benchmark-closing claims. They are probes. Their value is to show that the keyboard-and-mouse, pixel-only interface can travel beyond the exact environments used in training.

Prompting alone fails because action is a different distribution

The most useful negative result in the paper is the baseline Gemini comparison. SIMA 2 is based on Gemini Flash-Lite, and Gemini Pro is the stronger model. A casual reader might expect Gemini Pro to do reasonably well if prompted to output actions. It does not.

The paper reports 3.2% success for baseline Gemini Flash-Lite and 7.0% for baseline Gemini Pro on programmatic evaluations across the training domains. The authors note this low performance remains despite prompt engineering to produce the correct action and text formatting.

This is the misconception-correction moment:

Reader belief	Paper’s correction	Business consequence
“A strong VLM can operate an environment if we prompt it well.”	Baseline Gemini models perform poorly without embodied finetuning.	Prompt engineering is not a substitute for action data and environment-specific feedback.
“Reasoning is the hard part; control is just output format.”	Keyboard/mouse control, visual grounding, timing, and consequence learning are distinct skills.	UI agents and robotic agents need training loops, not just better instruction templates.
“Specialized action training will destroy general intelligence.”	SIMA 2 retains much of Gemini’s benchmark capability after action training.	The design goal is not choosing between reasoning and action, but preserving both.

Table 1 addresses the last point. Compared with the baseline Gemini model, SIMA 2 after supervised finetuning shows relative drops of -4.0% on LiveCodeBench, -25.5% on AIME, and -16.3% on GPQA Diamond. After SFT plus RL, the drops are -8.4%, -15.4%, and -19.5%, respectively.

This is not “no loss.” AIME dropping by double digits is still a real regression. But the result is important because action finetuning does not collapse the model into a mute controller. SIMA 2 still reasons, dialogues, and handles complex instructions. For business adoption, this is the design frontier: not just teaching agents to click, but teaching them to click without forgetting why.

Gemini Pro becomes the planner, not the fingers

SIMA 2 uses Gemini Flash-Lite because embodied action has latency constraints. A slow, brilliant model is not always the right motor controller. Anyone who has watched a committee make a sandwich already knows this.

The paper explores a hierarchical setup where Gemini Pro operates at a slower cadence, reviewing recent video history, issuing natural-language instructions to SIMA 2, and producing a summary that functions as recurrent memory. SIMA 2 remains the embodied executor.

This matters because it suggests an agent architecture with separated time scales:

Layer	Role	Enterprise analogy
Fast embodied executor	Converts local perception and instruction into actions	UI operator, robot controller, simulation actor
Slower reasoning planner	Interprets diagrams, maintains context, decomposes goals	Supervisor, workflow planner, QA reviewer
Summary memory	Carries long-horizon context between planner calls	Case file, runbook state, task ledger

The paper demonstrates this with a complex multimodal instruction-following example: a diagram instructs the combined Gemini Pro + SIMA 2 system to build a campfire. The system decomposes the task, gathers stones and wood, opens the build menu, selects the campfire, and communicates progress.

This is exploratory, but the architecture is business-relevant. Enterprise agents rarely need one monolithic model doing everything at the same frequency. A procurement agent, for example, may need a slow policy-checking layer, a medium-speed planning layer, and a fast UI navigation layer. SIMA 2’s hierarchy points in that direction.

Self-improvement is the most important part, and also the easiest to oversell

The self-improvement section is the paper’s most ambitious contribution. It is also the section most likely to be abused by PowerPoint.

The setup has three model roles:

A Gemini-based task setter proposes tasks likely to be achievable from the current environment state.
SIMA 2 attempts those tasks through embodied action.
A Gemini-based reward model scores the resulting trajectory video from 0 to 100, with 50 or above treated as success.

The reward rubric is calibrated against human preference pairs over a small dataset of trajectories. The agent then trains on scored self-generated experience.

This is not magic self-improvement. It is a loop: generate task, attempt task, score video, train policy, repeat. The novelty is that the loop can operate in open-world 3D environments without relying on privileged game-state rewards for every task.

In ASKA, the paper first isolates the improvement component by using a fixed task set. Tasks include resource gathering, environment interaction, navigation, and menu use. Initially, SIMA 2 succeeds on less than a quarter of the tasks. After successive self-improvement iterations, the self-improved agents exceed the success threshold across all tasks; average performance eventually exceeds the human reference score as judged by the Gemini-based reward model.

That last phrase matters: “as judged by the Gemini-based reward model.” The score is not a neutral law of physics. It is a model-based evaluation pipeline, calibrated but still imperfect. For research, this is a promising scalable feedback mechanism. For business, it is also a governance risk. If the reward model becomes the judge, then reward quality becomes system quality.

The full ASKA self-improvement setup adds the task setter and focuses on game progression skills such as resource gathering, crafting, menu use, and building. The self-improved agent progresses further than both SIMA 1 and the initial SIMA 2, eventually building a shelter within a one-hour window.

In Genie 3, the paper takes the idea further. It uses urban generated environments as training tasks and natural generated environments as held-out test tasks, mostly centered on navigation. Self-improvement raises scores across nearly all train tasks, often by 25 points or more, and the gains transfer to the majority of held-out natural-environment tasks.

That is the paper’s most future-facing result: a general agent trained in virtual worlds, improving through a foundation-model task-and-reward loop, inside generated worlds. It is early, but the direction is clear. A world model can generate environments. A task model can generate goals. A reward model can score attempts. An embodied agent can train on the resulting experience.

The factory for consequences starts to look automated.

The business value is not “game AI”; it is cheaper embodied experimentation

The practical pathway from SIMA 2 to business is not that companies should deploy game-playing agents tomorrow. The serious pathway is that virtual environments can become scalable training and evaluation infrastructure for agents that must perceive, decide, act, recover, and improve.

For robotics, this means simulation spaces that test navigation, object interaction, tool use, and recovery before hardware is involved. For software agents, it means GUI environments where agents learn to operate through screens rather than internal APIs. For industrial digital twins, it means task generation and reward scoring inside simulated plants, warehouses, or maintenance workflows. For game studios, it means more capable NPCs, QA automation, and playtesting agents that can interpret goals rather than follow brittle scripts.

A useful enterprise translation looks like this:

SIMA 2 concept	Business translation	ROI relevance	Boundary
Diverse virtual worlds	Simulation portfolio covering many workflow variants	Reduces cost of collecting real-world failures	Simulation mismatch remains
Pixel-only interface	Agent operates through the same screen humans use	Works where APIs are missing or fragmented	Slower and more error-prone than structured APIs
Bridge data	Reasoning/action examples connect intent to behavior	Improves explainability and correction	Synthetic reasoning can become performative if poorly validated
Programmatic and human evaluations	Mixed evaluation suite for measurable task success	Enables regression testing and deployment gates	Metrics cover only what they can detect
Gemini task setter	Automated curriculum generation	Reduces dependence on human task designers	Task distribution can drift or become too easy
Gemini reward model	Scalable scoring of open-ended behavior	Makes self-improvement economically plausible	Reward hacking and evaluator bias become central risks
Gemini Pro + SIMA 2 hierarchy	Planner-executor architecture	Separates strategic reasoning from fast control	More moving parts, more latency, more monitoring

For Cognaptus-style automation work, the takeaway is especially relevant to business process agents. Many enterprise tasks are not clean API calls. They are screen-based, exception-heavy, and full of local cues: pop-ups, menu states, document windows, user confirmations, weird legacy systems, and the occasional button designed by someone having a difficult week. SIMA 2’s interface resembles that mess more than many polished agent benchmarks do.

The business inference is not that SIMA 2 is ready to operate SAP, a forklift, or a hospital robot. The inference is that the winning infrastructure for practical agents may look less like one giant prompt and more like a training environment: task libraries, visual states, action logs, reward models, human review, regression tests, and self-improvement loops.

The boundary: virtual competence is not physical deployment

SIMA 2 is a research preview, and the paper is clear about several limitations.

First, the physical world is not a game engine. Genie 3 photorealistic environments are closer to real-world visuals than stylized games, but they are still generated worlds. Physics, embodiment, safety, actuator noise, sensor failures, and real-world liability are not solved by walking to a red mushroom.

Second, the paper’s qualitative extensions are not the same as large-scale quantitative proof. The Gunk and Genie 3 results are valuable probes, but they should not be read as robust deployment benchmarks.

Third, self-improvement depends heavily on reward quality. A Gemini-based reward model can score open-ended video trajectories, but if the reward model misjudges progress, the agent may learn the wrong behavior. In business terms, automated improvement without evaluator governance is not innovation. It is drift with a nicer font.

Fourth, SIMA 2 still struggles with long-horizon tasks, short memory under low-latency constraints, precise low-level action, and robust visual understanding in complex 3D scenes. Combat and resource gathering remain harder categories, partly because they require timing, fine motor control, search, and environmental contingencies.

Finally, the paper does not prove that the same approach transfers directly to enterprise software, robotics, or industrial automation. It gives a plausible training and evaluation pattern. The implementation details would change by domain, and the cost of failure would change even more.

The actual signal

SIMA 2 should not be read as “AI can play games.” That headline is both true and too small.

The actual signal is that foundation models can be converted into embodied agents without discarding their language and reasoning abilities, but only if they are trained on action, reasoning-action bridge data, and environment feedback. The paper’s main contribution is a mechanism for turning virtual worlds into scalable laboratories for agency.

That mechanism has a sober business message. Agent capability will not come from better prompts alone. It will come from environments where agents can try, fail, be scored, be corrected, and improve. In other words, businesses that want useful agents need to build not only workflows, but training grounds.

SIMA 2 is early. It is imperfect. It is still mostly inside worlds where the worst possible outcome is a failed task and perhaps a confused Viking. But the architecture points toward something important: agents become more useful when they are trained against consequences, not just text.

And consequences, unlike demos, are where automation starts to become real.

Cognaptus: Automate the Present, Incubate the Future.

SIMA Team, “SIMA 2: A Generalist Embodied Agent for Virtual Worlds,” arXiv:2512.04797, 2025, https://arxiv.org/abs/2512.04797. ↩︎

SIMA 2 is not just Gemini with a controller duct-taped on#

The evidence is a stack, not a single victory lap#

The held-out tests ask whether SIMA 2 learned embodiment, not just game trivia#

Prompting alone fails because action is a different distribution#

Gemini Pro becomes the planner, not the fingers#

Self-improvement is the most important part, and also the easiest to oversell#

The business value is not “game AI”; it is cheaper embodied experimentation#

The boundary: virtual competence is not physical deployment#

The actual signal#