Last week, Hugging Face, a prominent platform in the fields of artificial intelligence (AI) and machine learning (ML), unveiled a new AI model designed for visual tasks. Named SmolVLM, which stands for vision language model, this compact model prioritizes efficiency. The company asserts that its reduced size and high-performance capabilities make it an ideal solution for businesses and AI enthusiasts seeking advanced functionalities without heavy infrastructure investments. Additionally, Hugging Face has made the SmolVLM model available as open-source under the Apache 2.0 license, making it suitable for both personal and commercial applications.
Hugging Face Introduces SmolVLM
In a detailed blog post, Hugging Face described the new open-source vision model, touting it as “state-of-the-art” due to its efficient memory usage and rapid inference capabilities. The company emphasized the growing trend among AI developers to downscale models for improved efficiency and cost-effectiveness, highlighting the practical advantages of a smaller vision model.
Small vision model ecosystem
Photo Credit: Hugging Face
The SmolVLM series consists of three model variants, each equipped with two billion parameters. The first variant, SmolVLM-Base, serves as the standard model. The second variant, SmolVLM-Synthetic, is a fine-tuned version trained on synthetic data—data generated by AI or computer simulations. Finally, SmolVLM Instruct is tailored for developing user-centric applications.
In terms of technical specifications, this vision model operates with only 5.02GB of GPU RAM, dramatically lower than competitors like Qwen2-VL 2B, which requires 13.7GB, and InternVL2 2B, with a demand of 10.52GB. This efficiency allows Hugging Face to assert that the model can be run natively on laptops.
SmolVLM is designed to process sequences of text and images in any order and can analyze these inputs to generate responses to user queries. The model encodes images at a resolution of 384 x 384 pixels into 81 visual data tokens, claiming the capability to encode test prompts and a single image within 1,200 tokens, a significant reduction compared to Qwen2-VL’s requirement of 16,000 tokens.
Given these features, Hugging Face highlights that SmolVLM is an accessible option for smaller businesses and AI enthusiasts, allowing for deployment in localized systems without the need for extensive tech upgrades. Enterprises will benefit from the ability to run the AI model for both text and image-based tasks without incurring substantial costs.