Detailed Introduction
Dia2 is an open-source text-to-speech (TTS) model and inference implementation from Nari Labs focused on streaming conversational audio. The model can begin generating audio after receiving the initial input tokens and supports conditioning on audio prefixes to maintain speaker consistency and contextual continuity in multi-turn interactions. The repository provides 1B and 2B model checkpoints, example scripts, and quickstart instructions for research and deployment.
Main Features
- Streaming generation: starts synthesis without waiting for the full text, reducing response latency.
- Conditional generation: supports audio-prefix conditioning for speaker consistency and smoother conversation flow.
- Multiple scales: model checkpoints at different sizes (1B, 2B) to balance quality and resource use.
- Open license: released under Apache-2.0 for research and non-proprietary use.
Use Cases
- Real-time voice for conversational assistants and virtual characters, improving naturalness and responsiveness.
- Reply generation in voice-based dialog systems with multi-turn context handling.
- Research and teaching for TTS conditional generation, model comparison, and voice control experiments.
Technical Features
- Inference implementation based on Python and the
uvruntime, compatible with Hugging Face checkpoints and CUDA acceleration (recommended CUDA 12.8+). - Generation length is limited by context steps (around 2 minutes); outputs include audio tokens, waveform, and timestamps.
- Command-line examples and a Gradio demo are provided for quick verification and integration.