1. News
  2. INTERNET
  3. Hugging Face Unveils SmolVLA: AI for Local Robotics!

Hugging Face Unveils SmolVLA: AI for Local Robotics!

featured
Share

Share This Post

or copy the link

On Tuesday, Hugging Face unveiled SmolVLA, an open-source vision language action (VLA) artificial intelligence model tailored for robotics applications and training tasks. This large language model is designed to operate efficiently on computers equipped with a single consumer GPU or even on a MacBook, according to the company. Hugging Face asserts that SmolVLA can outperform larger models in its category, and the AI model is already available for download.

Hugging Face’s SmolVLA AI Model: Compatible with MacBooks

Hugging Face points out that progress in robotics has stagnated, despite formidable advancements in the AI sector. The company attributes this slowdown to a shortage of high-quality and diverse data, as well as a lack of large language models (LLMs) specifically designed for robotic tasks.

Vision language actions have been proposed as a solution to this challenge. However, leading models from major firms like Google and Nvidia tend to be proprietary and trained on exclusive datasets. This situation has created significant hurdles for the broader robotics research community, which relies heavily on open-source data for innovation and development, as highlighted in the company’s announcement.

These VLA models can analyze images, videos, or live camera feeds, interpret real-world conditions, and execute specified tasks using robotic hardware.

SmolVLA seeks to remedy existing issues within the robotics research space by offering an open-source model trained on publicly available datasets from the LeRobot community. With 450 million parameters, SmolVLA can operate on a desktop computer with a compatible GPU or newer MacBook models.

The architecture is based on Hugging Face’s VLM models, featuring a SigLip vision encoder and a language decoder known as SmolLM2. The vision encoder captures and processes visual information, while natural language prompts are tokenized and transmitted to the language decoder.

For tasks involving motion or physical actions via robotic hardware, sensorimotor signals are integrated into a single token. The decoder aggregates this information into one continuous stream, allowing the model to understand real-world data and tasks contextually, rather than viewing them as isolated components.

SmolVLA also includes a component called the action expert, which determines the appropriate action to take based on what it has learned. This action expert employs a transformer-based architecture with 100 million parameters and is responsible for predicting a sequence of movements for the robot, including steps and arm motions, known as action chunks.

While primarily targeting robotics professionals, the open weights, datasets, and training recipes for SmolVLA can be accessed for reproduction or further development. Robotics enthusiasts with access to equipment like robotic arms can also utilize these resources to experiment with real-time robotic workflows.

Hugging Face Unveils SmolVLA: AI for Local Robotics!
Comment

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Yeni haberlerden haberdar olmak için fırsatı kaçırma ve ücretsiz e-posta aboneliğini hemen başlat.

Your email address will not be published. Required fields are marked *

Login

To enjoy Technology Newso privileges, log in or create an account now, and it's completely free!