Alibaba’s Qwen research team has unveiled a new open-source artificial intelligence (AI) model, named QVQ-72B, which focuses on vision-based reasoning. This cutting-edge model is designed to analyze visual information from images and comprehend their underlying context. Alongside the model’s release, Alibaba published benchmark scores demonstrating that QVQ-72B surpassed OpenAI’s o1 model in a specific evaluation. The release adds to Alibaba’s growing portfolio of open-source AI models, which includes the QwQ-32B and Marco-o1, both of which prioritize reasoning capabilities.
Launch of Alibaba’s Vision-Based QVQ-72B AI Model
Described as an experimental research model in a listing on Hugging Face, the Qwen team emphasized the QVQ-72B’s advanced visual reasoning features. This new model integrates two distinct branches of performance, united to enhance its analytical capacity.
Numerous vision-based AI models exist, typically incorporating an image encoder that interprets visual data and context. Conversely, reasoning-oriented models, such as o1 and QwQ-32B, are equipped with compute scaling capabilities that allow for extended processing times during evaluations. This feature helps models to dissect problems step by step, evaluate outputs, and make corrections in collaboration with a verifying system.
The QVQ-72B combines both functionalities, enabling it to interpret visual information while answering intricate questions using reasoning frameworks. Researchers have noted marked improvements in the model’s performance metrics.
In internal evaluations, the Qwen team reported that QVQ-72B achieved a score of 71.4 percent on the MathVista (mini) benchmark, surpassing the o1 model’s score of 71.0 percent. The model also recorded a score of 70.3 percent on the Multimodal Massive Multi-task Understanding (MMMU) benchmark.
However, despite these advancements, the model is not without limitations. The Qwen team acknowledged that QVQ-72B occasionally experiences code-switching, transitioning between languages in unexpected ways. Additionally, the model can become trapped in recursive reasoning loops, which may impact the accuracy of its outputs.