Google Unleashes SIMA 2: AI Agent Mastering Virtual Worlds with Gemini

Google DeepMind has introduced SIMA 2, a next-generation generalist AI agent powered by Gemini, designed to understand and interact with its environment beyond simple instruction following. This new iteration significantly doubles its predecessor's performance and enables self-improvement, marking a crucial step towards general-purpose robots and Artificial General Intelligence.

Uche Emeka • AI • 7 months ago • 4 minute read •

Google Unleashes SIMA 2: AI Agent Mastering Virtual Worlds with Gemini

Google DeepMind has unveiled SIMA 2, the next iteration of its generalist AI agent, which significantly advances its capabilities by integrating the language and reasoning prowess of Google’s Gemini large language model. This integration allows SIMA 2 to move beyond mere instruction-following to a deeper understanding and interaction with its environment, marking a substantial leap towards more general-purpose AI systems and Artificial General Intelligence (AGI).

SIMA 1, introduced in March 2024, was trained on extensive video game data, enabling it to learn and play various 3D games, even those it hadn't encountered before. While it could follow basic instructions across a broad spectrum of virtual environments, its success rate for complex tasks stood at a modest 31%, compared to a human benchmark of 71%. DeepMind senior research scientist Joe Marino highlighted that SIMA 2 represents a "step change and improvement in capabilities over SIMA 1," boasting enhanced generality, the ability to complete complex tasks in previously unseen environments, and crucial self-improvement capabilities based on its own experiences.

At the core of SIMA 2’s advancements is the Gemini 2.5 flash-lite model. DeepMind defines AGI as a system capable of a wide range of intellectual tasks, with the capacity to learn new skills and generalize knowledge across different domains. The researchers emphasize the critical role of "embodied agents" in achieving generalized intelligence. Marino clarified that an embodied agent, much like a robot or human, interacts with a physical or virtual world through a body, observing inputs and taking actions. This contrasts with non-embodied agents that might manage a calendar or execute code without direct environmental interaction.

Jane Wang, a senior staff research scientist at DeepMind with a background in neuroscience, elaborated that SIMA 2’s scope extends far beyond mere gameplay. It is designed to genuinely comprehend its surroundings, understand user requests, and respond with common sense – a challenging feat for AI. By harnessing Gemini’s advanced language and reasoning abilities alongside its trained embodied skills, SIMA 2 has effectively doubled the performance of its predecessor.

Demonstrations showcased SIMA 2’s sophisticated understanding and interaction. In "No Man’s Sky," the agent accurately described a rocky planet surface and logically determined its next actions by recognizing and engaging with a distress beacon. SIMA 2 also leverages Gemini for internal reasoning; when instructed to find the house the color of a ripe tomato, it internally reasoned that ripe tomatoes are red, then located and approached the red house. Its Gemini-powered nature also allows it to interpret and follow emoji-based commands, such as using the axe and tree emojis to initiate tree-chopping.

Furthermore, Marino demonstrated SIMA 2’s ability to navigate newly generated photorealistic worlds from DeepMind’s world model, Genie, where it proficiently identified and interacted with objects like benches, trees, and butterflies. A significant feature of SIMA 2 is its capacity for self-improvement, largely enabled by Gemini, without extensive human data. Unlike SIMA 1, which relied solely on human gameplay for training, SIMA 2 uses this data as a strong initial baseline. When placed in a new environment, another Gemini model generates new tasks, and a separate reward model scores the agent's attempts. Through these self-generated experiences, SIMA 2 learns from its mistakes, gradually improving its performance and teaching itself new behaviors via trial and error, guided by AI-based feedback.

DeepMind views SIMA 2 as a crucial step towards developing more general-purpose robots. Frederic Besse, senior staff research engineer, articulated that real-world robotic tasks require two main components: a high-level understanding of the environment and necessary actions, coupled with reasoning capabilities. For instance, instructing a humanoid robot to check for bean cans in a cupboard necessitates understanding concepts like 'beans' and 'cupboard' and navigating to the location. Besse noted that SIMA 2 currently emphasizes this high-level behavior over lower-level actions like controlling physical joints and wheels.

While DeepMind has not provided a specific timeline for implementing SIMA 2 in physical robotics systems, Besse mentioned that DeepMind’s recently unveiled robotics foundation models, which also reason about the physical world and create multi-step plans, were trained separately and differently from SIMA. Similarly, there is no immediate timeline for a full public release beyond the current preview. Wang indicated that the immediate goal is to showcase DeepMind’s work and explore potential collaborations and applications.