On Thursday, Google DeepMind announced the launch of two innovative artificial intelligence (AI) models designed to enhance robotic capabilities in real-world scenarios. Named Gemini Robotics and Gemini Robotics-ER (embodied reasoning), these cutting-edge models utilize advanced vision-language technology to exhibit spatial intelligence and execute a variety of tasks. Additionally, the Mountain View-based company disclosed its collaboration with Apptronik to develop humanoid robots powered by Gemini 2.0. Ongoing testing will further assess these models and identify opportunities for improvement.
Google DeepMind Unveils Gemini Robotics AI Models
In a blog post, the company provided insight into the potential of its new AI models. Carolina Parada, Senior Director and Head of Robotics at Google DeepMind, emphasized that for AI to effectively assist people in physical environments, it must exhibit “embodied” reasoning—the capacity to interact with and comprehend its surroundings while executing tasks.
The first model, Gemini Robotics, is an advanced vision-language-action (VLA) framework derived from the Gemini 2.0 model. This system features a newly developed output modality for “physical actions,” enabling it to directly command robots in various tasks.
DeepMind identified three essential capabilities for robotics AI to be effective in physical settings: generality, interactivity, and dexterity. Generality pertains to a model’s adaptability to new circumstances. According to the company, Gemini Robotics excels at managing unfamiliar objects, varied instructions, and diverse environments, with internal tests indicating that it more than doubles performance on a comprehensive generalization benchmark.
The interactivity of the AI model is grounded in the capabilities provided by Gemini 2.0. It can comprehend and react to commands articulated in natural, conversational language across multiple languages. Furthermore, Google noted that the model actively observes its environment, detects changes, and modifies its actions in accordance with the inputs it receives.
DeepMind also revealed that Gemini Robotics is capable of undertaking highly complex, multi-step tasks that necessitate precise manipulation of physical elements. Researchers noted that the AI can execute tasks such as folding a piece of paper or packing a snack into a bag with accuracy.
The second model, Gemini Robotics-ER, is similarly a vision-language model with a focus on spatial reasoning. Leveraging the coding and 3D detection capabilities of Gemini 2.0, this AI model demonstrates the ability to determine the appropriate actions for manipulating objects in real-world settings. For instance, Parada explained that when presented with a coffee mug, the model successfully generated a command for a two-finger grip to lift it by the handle along a designated safe path.
This AI model undertakes an extensive array of processes necessary for robot control in the physical realm, which includes perception, state estimation, spatial understanding, planning, and code generation. It’s important to note that both models are not currently available to the public. DeepMind appears poised to first incorporate these AI models into a humanoid robot for capacity evaluation before considering a broader release of the technology.