Last week, Hugging Face unveiled two new versions of its SmolVLM vision language models. These artificial intelligence (AI) models come in sizes of 256 million and 500 million parameters, with the smaller variant being touted as the world’s smallest vision model. The new offerings aim to maintain the efficiency of the existing two-billion parameter model while achieving significant reductions in size. According to the company, these new models can be operated locally on limited devices, consumer laptops, and may even allow for browser-based inference.
Hugging Face Unveils Compact SmolVLM AI Models
In a recent blog post, Hugging Face announced the launch of the SmolVLM-256M and SmolVLM-500M models, alongside the already available two-billion parameter model. This release includes two base models and a pair of instruction fine-tuned models in the specified parameter sizes.
According to the company, these models can be integrated with transformers, Machine Learning Exchange (MLX), and Open Neural Network Exchange (ONNX) platforms, allowing developers to build upon the base models. Importantly, these models are open-source and available under the Apache 2.0 license, suitable for both personal and commercial use.
Hugging Face’s objective with these new AI models is to deliver multimodal capabilities focused on computer vision for portable devices. For example, the 256 million parameter model requires less than one GB of GPU memory and 15GB of RAM to handle the processing of 16 images per second with a batch size of 64.
Andrés Marafioti, a machine learning research engineer at Hugging Face, mentioned in an interview with VentureBeat, “For a mid-sized company processing 1 million images monthly, this translates to substantial annual savings in compute costs.”
To achieve the reduced size of these AI models, researchers transitioned the vision encoder from the previous SigLIP 400M to a 93M-parameter SigLIP base patch, along with optimizing the tokenization process. The new vision models now encode images at a rate of 4096 pixels per token, compared to 1820 pixels per token in the two-billion parameter model.
While the smaller models are slightly less powerful than the 2B variant, Hugging Face has aimed to minimize this performance gap. The company reports that the 256M model can facilitate tasks such as image and short video captioning, document question-answering, and basic visual reasoning.
Developers can leverage transformers and MLX for inference and fine-tuning these new AI models, utilizing the existing SmolVLM code effectively. Additionally, these models are also available on Hugging Face’s platform.