Detailed Introduction
Ming-UniAudio is an open-source speech framework from Ant Group that centers on a unified continuous audio tokenizer. By merging semantic and acoustic features into a continuous representation (MingTok-Audio), the project trains an end-to-end speech LLM capable of both understanding and generation. Building on this foundation, Ming-UniAudio enables instruction-guided free-form speech editing without requiring manual timestamp annotations.
Main Features
- Unified continuous audio tokenizer that encodes semantic and acoustic signals for both understanding and generation.
- A unified speech LLM (LLM, Large Language Model) with a diffusion head to improve synthesis quality.
- Instruction-driven free-form speech editing supporting insertion, deletion, and substitution guided solely by natural language.
- Open releases of models, benchmarks and SFT (supervised fine-tuning) recipes with downloads available on Hugging Face and ModelScope.
Use Cases
Ming-UniAudio is suitable for ASR/transcription, text-to-speech (TTS), dialogue understanding, audio post-processing and interactive audio editing pipelines. Teams can use it to build editable voice assistants, automated audio-cleanup services, or integrate it into multimodal systems to improve the editability and expressiveness of speech content.
Technical Features
The project uses a VAE-based continuous tokenizer and a causal Transformer backbone, enabling hierarchical audio features that interface with LLMs. It publishes evaluation over understanding, generation and editing tasks, and releases the first free-form audio editing benchmark. SFT recipes and example code are provided for reproducing training and inference on heterogeneous hardware.