Introduction
CosyVoice is a multilingual streaming text-to-speech (TTS) generation library supporting zero-shot voice cloning, low-latency streaming synthesis, and cross-language generation. It is suitable for both online and offline deployment.
Key Features
- Supports speech synthesis for Chinese, English, Japanese, Korean, and various dialects
- Zero-shot voice cloning and cross-language synthesis capabilities
- Provides training, inference, and Docker deployment examples
Use Cases
- Voice assistants, podcast dubbing, virtual characters, and content creation
- Online services requiring low-latency, high-quality TTS
- Research and model fine-tuning scenarios
Technical Highlights
- Offers streaming inference and optimization paths such as TRITON/TensorRT
- Rich models and demo pages, Apache-2.0 licensed
- Supports vLLM integration and GPU-accelerated deployment