IndexTTS

Discover IndexTTS, a powerful zero-shot text-to-speech system with precise control over speech duration and emotion, perfect for audio-visual synchronization.

Author: IndexTTS

Added Date: 2025-09-09

Open Source Since: 2025-02-06

GitHub Demo

Introduction

IndexTTS is an industrial-level controllable and efficient zero-shot text-to-speech system, supporting precise speech duration control and disentangled emotion/timbre, ideal for high-demand audio-visual synchronization.

Key Features

Two generation modes: precise duration control and free autoregressive generation
Independent control of emotion and timbre, supporting diverse styles
Three-stage training paradigm for improved stability and clarity
Soft instruction mechanism enables emotion control via text description

Use Cases

Video dubbing and audio-visual synchronization
Intelligent voice assistants and personalized broadcasting
Multi-emotion speech generation and style transfer

Technical Highlights

Innovative duration control method, combined with GPT latent representations and soft instruction mechanism, greatly enhances flexibility and expressiveness. Open-source model, multi-dataset support, and performance surpassing mainstream zero-shot TTS systems.

IndexTTS

Introduction

Key Features

Use Cases

Technical Highlights

Resource Info

Related Resources

Nano-vLLM

DeepSeek-OCR

LeRobot