Overview
FireRedTTS-2 is a long-form streaming TTS system built for multi-speaker dialogue generation, focusing on stability, reliable speaker switching, and context-aware prosody. The project releases pretrained checkpoints and demo pages, and supports multilingual and zero-shot voice cloning scenarios suitable for podcasts, chatbots, and large-scale speech data synthesis.
Key Features
- Long conversational speech generation: supports multi-minute dialogues and reliable multi-speaker switching.
- Multilingual and zero-shot cloning: supports English, Chinese, Japanese, Korean, French, German, Russian, and cross-lingual voice cloning.
- Ultra-low latency streaming: uses a 12.5Hz streaming speech tokenizer and a dual-transformer architecture to reduce first-packet latency.
- Open and reproducible: code and pretrained models are available on GitHub and Hugging Face, with example scripts and a Gradio demo.
Use Cases
- Podcast and long-form dialogue generation and editing.
- Multi-role conversational synthesis for customer service, role-play, or virtual hosts.
- Generating large-scale synthetic speech datasets to improve ASR and dialogue systems.
Technical Characteristics
- PyTorch implementation with training and inference code, example scripts, and a Gradio demo.
- Dual-transformer architecture with text–speech interleaved sequences to improve context awareness and temporal consistency.
- Supports downloading pretrained weights via
git lfs
and running example scripts to generate audio.