On Wednesday, Alibaba announced the launch of a new set of artificial intelligence (AI) video generation models known as Wan 2.1. These models are open-source and are designed for versatile use in both academic and commercial settings. The Chinese e-commerce leader has introduced various parameter-based versions of these models, which were initially unveiled in January. The company asserts that Wan 2.1 can produce highly realistic video content. Currently, these models are being shared on the AI and machine learning (ML) platform, Hugging Face.
Alibaba Unveils Wan 2.1 Video Generation Models
The newly released video AI models by Alibaba can be accessed through the Wan team’s Hugging Face page. Alongside, the page features details about the broader suite of large language models (LLMs) associated with Wan 2.1. There are four primary models available: T2V-1.3B, T2V-14B, I2V-14B-720P, and I2V-14B-480P. “T2V” refers to text-to-video, while “I2V” indicates image-to-video functionalities.
The smallest variant, Wan 2.1 T2V-1.3B, is designed to operate on consumer-grade GPUs equipped with a minimum of 8.19GB of vRAM. According to the company’s announcement, this AI model can generate a five-second video at 480p resolution using an Nvidia RTX 4090 in approximately four minutes.
Although the Wan 2.1 suite is primarily optimized for video creation, the models also possess capabilities for image generation, as well as transitioning from video to audio and editing video content. However, it’s important to note that the currently released open-source models do not support these more advanced features. For video generation, the models accept input prompts in both Chinese and English, along with image data.
In terms of architecture, the research team revealed that the Wan 2.1 models incorporate a diffusion transformer framework, enhanced with innovative variational autoencoders (VAE) and various training techniques.
A standout feature of these AI models is the implementation of a novel 3D causal VAE architecture, referred to as Wan-VAE. This development improves spatiotemporal compression while minimizing memory consumption. The autoencoder is capable of encoding and decoding videos of unlimited length at 1080p resolution without losing crucial temporal information, which allows for more consistent video generation.
Based on internal assessments, Alibaba has claimed that the Wan 2.1 models outshine OpenAI’s Sora AI model in terms of consistency, scene generation quality, precision in single object recognition, and spatial positioning.
The models are distributed under the Apache 2.0 license. While this permits unrestricted use for academic and research endeavors, there are several limitations associated with commercial applications.