Detailed Introduction
VibeVoiceFusion is a full-stack web application for multi-speaker voice synthesis built on the VibeVoice architecture (autoregressive + diffusion). It uses a Qwen backbone with acoustic and semantic encoders to process reference audio, generates speech tokens autoregressively, and refines waveforms with a DPM-Solver diffusion head. The project offers a web UI and CLI, bilingual interface, and project management features suitable for local deployment and research experiments.
Main Features
- Complete web application: project and speaker management, dialog editor, generation history and live preview.
- Multi-speaker synthesis: supports 2–4+ speaker dialogs and voice cloning from reference samples.
- VRAM optimizations: layer offloading and Float8 quantization significantly reduce memory usage.
- Deployment-ready: Docker multi-stage builds, automatic model download and build scripts for local installation.
Use Cases
Suitable for podcast production, dubbing, dialog content creation, and research prototypes. Creators can generate multi-speaker audio locally or on private servers; teams can manage sessions and export WAV files. Researchers can compare performance and audio quality across precision and offloading strategies via the CLI.
Technical Features
- Model architecture: Qwen backbone + VAE acoustic tokenizer + diffusion generation head.
- Memory strategies: dynamic layer offloading (Balanced/Aggressive/Extreme) and Float8 (E4M3FN) quantization to cut VRAM roughly in half.
- Compatibility: backend in Python/Flask with PyTorch; frontend in Next.js and TailwindCSS; supports CUDA/mps/cpu devices.
- Responsible use: project targets research and development; obtain explicit consent before cloning voices to avoid misuse.