ByteDance Launches Bagel: A Game-Changer in AI Imaging

Last week, ByteDance unveiled a new multimodal artificial intelligence (AI) model known as Bagel. This visual language model (VLM) is designed to understand, generate, and edit images, and the company has made it available for public use through open-source platforms like GitHub and Hugging Face. According to ByteDance, Bagel’s capabilities include free-form visual manipulation, multiview synthesis, and world navigation, making it superior in image editing when compared to other existing open-source VLMs.

ByteDance’s Bagel Outperforms Gemini-2-exp in Image Editing

A detailed listing on GitHub provides additional insights into the Bagel AI model, including its weights and datasets. However, the specifics regarding its post-training processes and architecture have not been disclosed by the company. The model is currently licensed under the permissive Apache 2.0 license, allowing for both academic and commercial applications.

Bagel interprets both text and images as input, boasting an impressive 14 billion parameters, with seven billion activated at any given time. ByteDance asserts that the model has been trained using large-scale interleaved multimodal data, effectively merging text and images during the training process. This collaborative approach enables the model to gain a better contextual understanding between the two types of data.

For instance, when provided with images alongside their corresponding captions, Bagel enhances its comprehension of visual representations and the semantics of the related text. This integrated learning approach is expected to improve the accuracy and efficiency of the outputs, according to the company.

Furthermore, ByteDance contends that Bagel possesses advanced image editing functionalities that surpass those of prior open-source VLMs. It is capable of conducting intricate tasks such as infusing emotions into images, removing or altering elements, style transfer, and executing free-form edits. According to ByteDance, these advanced skills allow Bagel to produce significantly enhanced outputs in world-modelling.

World-modelling refers to an AI’s internal representation of how the real world visually operates, encompassing the relationships between various objects, the physical context, and the impacts of elements like light, wind, rain, and gravity.

Internal evaluations conducted by ByteDance reveal that Bagel outperformed Qwen2.5-VL-7B, a model of comparable size, in image comprehension tasks. Additionally, it reportedly surpassed Janus-Pro-7B and Flux-1-dev in image generation benchmarks and outperformed Gemini-2-exp in image editing within the GEdit-Bench framework.

For those interested in experimenting with the AI model without installation, ByteDance has established a cloud-based interface on Hugging Face, where users can test its capabilities in image analysis, generation, and editing.

ByteDance Launches Bagel: A Game-Changer in AI Imaging

Comment

ByteDance Launches Bagel: A Game-Changer in AI Imaging

Share This Post

or copy the link

ByteDance’s Bagel Outperforms Gemini-2-exp in Image Editing

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Related News

Claude AI Gains Memory for Seamless Conversations!

AI Takes Center Stage: Tech Talk on Vergecast!

Cruz Proposes ‘Sandbox’ for AI Companies, But At What Cost?

OpenAI, Oracle Strike $300B Cloud Computing Deal!

Ant Group Unveils Humanoid Robot at Major Tech Showcase

Write a Reply Cancel