FireRedTTS-2

A long-form streaming TTS system for multi-speaker dialogue generation, providing stable, natural speech with reliable speaker switching, multilingual support, and low-latency streaming output.

FireRedTeam · Since 2025-09-02

Loading score...

GitHub Website Demo

Overview

FireRedTTS-2 is a long-form streaming TTS system built for multi-speaker dialogue generation, focusing on stability, reliable speaker switching, and context-aware prosody. The project releases pretrained checkpoints and demo pages, and supports multilingual and zero-shot voice cloning scenarios suitable for podcasts, chatbots, and large-scale speech data synthesis.

Key Features

Long conversational speech generation: supports multi-minute dialogues and reliable multi-speaker switching.
Multilingual and zero-shot cloning: supports English, Chinese, Japanese, Korean, French, German, Russian, and cross-lingual voice cloning.
Ultra-low latency streaming: uses a 12.5Hz streaming speech tokenizer and a dual-transformer architecture to reduce first-packet latency.
Open and reproducible: code and pretrained models are available on GitHub and Hugging Face, with example scripts and a Gradio demo.

Use Cases

Podcast and long-form dialogue generation and editing.
Multi-role conversational synthesis for customer service, role-play, or virtual hosts.
Generating large-scale synthetic speech datasets to improve ASR and dialogue systems.

Technical Characteristics

PyTorch implementation with training and inference code, example scripts, and a Gradio demo.
Dual-transformer architecture with text–speech interleaved sequences to improve context awareness and temporal consistency.
Supports downloading pretrained weights via git lfs and running example scripts to generate audio.

FireRedTTS-2

Overview

Key Features

Use Cases

Technical Characteristics

Score Breakdown

Related Resources

nanoGPT

Dyad

GLM-TTS