On Tuesday, Hugging Face announced the launch of SmolVLA, an open-source vision language action (VLA) artificial intelligence (AI) model designed specifically for robotics workflows and training. The innovative model is tailored to function efficiently on consumer-grade hardware, with capabilities to operate on a single GPU-equipped computer or a MacBook. The New York-based repository asserts that SmolVLA surpasses the performance of significantly larger models.
Hugging Face’s SmolVLA AI Model Can Run Locally on a MacBook
Hugging Face notes that advancements in robotics have not kept pace with developments in AI technology, largely due to a deficiency of high-quality, diverse data and the unavailability of large language models (LLMs) built for robotics applications.
Although vision language actions have started to address some of these challenges, leading models from major tech companies like Google and Nvidia remain proprietary and trained on closed datasets. Consequently, the broader robotics research community, which depends on open-source resources, faces significant obstacles in replicating or advancing these AI models.
VLAs have the capacity to process images, videos, or live camera feeds, enabling them to interpret real-world conditions and subsequently execute given tasks via robotic hardware.
The introduction of SmolVLA aims to alleviate two major issues for the robotics research community, being an open-source, robotics-focused model trained on an accessible dataset from the LeRobot community. This 450 million parameter AI model is capable of operating on desktop systems with a compatible GPU or the latest MacBook models.
The architecture of SmolVLA is grounded in Hugging Face’s vision language models and features a SigLip vision encoder in conjunction with a language decoder named SmolLM2. The vision encoder captures and processes visual data, while the language prompts are tokenized and sent to the decoder.
When it comes to executing tasks involving physical movement, sensorimotor signals are integrated into a single token, which the decoder combines and processes as a cohesive stream of information. This approach enables the model to comprehend real-world data and tasks in a contextual framework rather than as disjointed elements.
Once SmolVLA acquires the necessary information, it transmits this data to a component known as the action expert. The action expert, which utilizes a transformer-based architecture containing 100 million parameters, forecasts a sequence of movements for the robot—referred to as action chunks—such as steps or arm motions.
Although its application is targeted toward a specific audience, individuals involved in robotics can download the model’s open weights, datasets, and training guidelines to reproduce or enhance SmolVLA. Additionally, robotics enthusiasts with access to robotic arms or similar hardware can utilize these tools to implement real-time robotics workflows.