A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

FireRedTTS-2

A long-form streaming TTS system for multi-speaker dialogue generation, providing stable, natural speech with reliable speaker switching, multilingual support, and low-latency streaming output.

Overview

FireRedTTS-2 is a long-form streaming TTS system built for multi-speaker dialogue generation, focusing on stability, reliable speaker switching, and context-aware prosody. The project releases pretrained checkpoints and demo pages, and supports multilingual and zero-shot voice cloning scenarios suitable for podcasts, chatbots, and large-scale speech data synthesis.

Key Features

  • Long conversational speech generation: supports multi-minute dialogues and reliable multi-speaker switching.
  • Multilingual and zero-shot cloning: supports English, Chinese, Japanese, Korean, French, German, Russian, and cross-lingual voice cloning.
  • Ultra-low latency streaming: uses a 12.5Hz streaming speech tokenizer and a dual-transformer architecture to reduce first-packet latency.
  • Open and reproducible: code and pretrained models are available on GitHub and Hugging Face, with example scripts and a Gradio demo.

Use Cases

  • Podcast and long-form dialogue generation and editing.
  • Multi-role conversational synthesis for customer service, role-play, or virtual hosts.
  • Generating large-scale synthetic speech datasets to improve ASR and dialogue systems.

Technical Characteristics

  • PyTorch implementation with training and inference code, example scripts, and a Gradio demo.
  • Dual-transformer architecture with text–speech interleaved sequences to improve context awareness and temporal consistency.
  • Supports downloading pretrained weights via git lfs and running example scripts to generate audio.

Comments

FireRedTTS-2
Resource Info
Author FireRedTeam
Added Date 2025-09-22
Tags
TTS OSS Product