Last week, Hugging Face unveiled two new variants of its SmolVLM vision language models. Available in 256 million and 500 million parameter sizes, the 256 million model has been designated by the company as the world’s smallest vision model. These new versions focus on maintaining the efficiency of the existing two-billion parameter model while significantly cutting down on size. Hugging Face emphasized that these models can run locally on devices with limited resources, including consumer laptops and possibly even through browser-based inference.
Hugging Face Introduces Smaller SmolVLM AI Models
According to a blog post from the company, the new SmolVLM-256M and SmolVLM-500M models complement the existing two billion parameter model. This latest release includes two base models and two instruction fine-tuned models within the mentioned parameter sizes.
The company reported that these models can be integrated directly into platforms such as transformers, Machine Learning Exchange (MLX), and Open Neural Network Exchange (ONNX), enabling developers to enhance the base models for their projects. Importantly, these models are open-source and licensed under Apache 2.0, allowing for personal and commercial applications.
With the introduction of these AI models, Hugging Face seeks to bring multimodal capabilities focused on computer vision to portable devices. The 256 million parameter model, for example, requires less than one GB of GPU memory and 15GB of RAM to process 16 images per second, given a batch size of 64.
Andrés Marafioti, a machine learning research engineer at Hugging Face, shared with VentureBeat that for a mid-sized company processing one million images monthly, this could lead to considerable annual savings in compute expenses.
To achieve a more compact model size, researchers transitioned from the previous SigLIP 400M vision encoder to a 93M-parameter SigLIP base patch. They also optimized the tokenization process. The new models are capable of encoding images at a rate of 4096 pixels per token, a notable increase from the 1820 pixels per token capacity of the two-billion parameter model.
Although the smaller models may slightly lag behind the 2B model in performance, Hugging Face indicated that the trade-offs have been minimized. The company noted that the 256M variant is suitable for tasks such as image or short video captioning, answering questions regarding documents, and performing basic visual reasoning.
Developers can leverage transformers and MLX for inference and fine-tuning based on the existing SmolVLM code without modifications. These models are also available on Hugging Face’s platform.