Overview
BAGEL is an open-source unified multimodal foundation model released by ByteDance-Seed. It supports joint training and evaluation for image/video and text tasks, providing training, evaluation and deployment scripts, official examples, and pretrained weights. The project is suitable for research baselines and engineering prototypes.
Key features
- Unified multimodal pretraining and fine-tuning pipelines covering both understanding and generation.
- Provides training/evaluation scripts, pretrained weights and model exports, with integrations for Hugging Face and Gradio.
- Demonstrates strong performance on multiple benchmarks with detailed reproduction guides.
Use cases
- Multimodal benchmarks, model comparisons, and academic reproductions.
- Text-guided image generation and image editing applications.
- Engineering prototypes and demos (official demo and Hugging Face Space available).
Technical details
- Implemented in PyTorch with architecture choices such as Mixture-of-Transformer-Experts to increase capacity and efficiency.
- Supports large-scale training, quantization, and inference optimizations with provided training and evaluation toolchains.
- Rich set of model and data processing scripts for easy extension and downstream integration.