Last week, ByteDance introduced a new artificial intelligence (AI) model named Bagel, which operates as a visual language model (VLM). This innovative tool can understand, generate, and edit images. The Beijing-based firm has made the model available for public use by open-sourcing it through well-known AI repositories such as GitHub and Hugging Face. According to the company, Bagel offers advanced features for visual manipulation, multiview synthesis, and world navigation, setting it apart from existing open-source VLMs in terms of image editing capabilities.
Bagel Surpasses Gemini-2-exp in Image Editing
A listing on GitHub provides additional insights into the Bagel AI model, including its architecture and datasets, though specific details about the post-training processes were not disclosed. The model is currently released under a permissive Apache 2.0 license, facilitating both academic and commercial applications.
Bagel is engineered as a multimodal AI model that can process both text and image inputs. It consists of 14 billion parameters, with seven billion engaged simultaneously. ByteDance asserts that the model has been trained on a large-scale interleaved multimodal dataset, integrating various data types such as text and images during training. This approach enables Bagel to understand data contexts collectively rather than in isolation.
This integrative methodology allows the foundation model to draw connections between different types of information. For example, when provided with images and corresponding captions, Bagel can gain a more comprehensive understanding of what the text signifies in relation to the visual content. This holistic comprehension is expected to enhance the quality of its output, according to ByteDance.
The company further asserts that Bagel’s image editing capabilities surpass those of existing open-source VLMs. It can undertake intricate tasks including imparting emotional nuances to images, as well as adding, removing, or replacing elements within them. The model is also proficient in style transfer and performs free-form edits, leading to significantly enhanced outputs during world modeling tasks.
World modeling entails an AI system’s conceptual grasp of the visual mechanics of the real world, encompassing the relationships among objects and the impact of various physical elements like light, wind, and gravity.
Internal assessments conducted by ByteDance indicate that Bagel has outperformed Qwen2.5-VL-7B, another model of similar size, in image understanding tasks. It is reported to score higher in image generation benchmarks compared to Janus-Pro-7B and Flux-1-dev, while also exceeding Gemini-2-exp in image editing evaluations on the GEdit-Bench.
For those interested in experimenting with the AI model without local installation, ByteDance has created a cloud-based interface on Hugging Face, allowing users to test Bagel’s capabilities in image analysis, generation, and editing.