IndexTTS

Discover IndexTTS, a powerful zero-shot text-to-speech system with precise control over speech duration and emotion, perfect for audio-visual synchronization.

IndexTTS · Since 2025-02-06

Loading score...

GitHub Demo

Introduction

IndexTTS is an industrial-level controllable and efficient zero-shot text-to-speech system, supporting precise speech duration control and disentangled emotion/timbre, ideal for high-demand audio-visual synchronization.

Key Features

Two generation modes: precise duration control and free autoregressive generation
Independent control of emotion and timbre, supporting diverse styles
Three-stage training paradigm for improved stability and clarity
Soft instruction mechanism enables emotion control via text description

Use Cases

Video dubbing and audio-visual synchronization
Intelligent voice assistants and personalized broadcasting
Multi-emotion speech generation and style transfer

Technical Highlights

Innovative duration control method, combined with GPT latent representations and soft instruction mechanism, greatly enhances flexibility and expressiveness. Open-source model, multi-dataset support, and performance surpassing mainstream zero-shot TTS systems.

Core Content

Core Content

Technology

Technology

More

More

AI Infrastructure

AI Infrastructure

Explore

Explore

Connect

Connect

Quick Links

Quick Links

LinkedIn

LinkedIn

Follow on X

Follow on X

IndexTTS

Introduction

Key Features

Use Cases

Technical Highlights

Score Breakdown

Related Resources

AutoSubs

Axolotl

Cactus