Ming-UniAudio

A speech LLM that unifies continuous audio representation to enable understanding, generation, and instruction-guided free-form editing.

Author: Ant Group

Since: 2025-09-29

Visit Website GitHub

Detailed Introduction

Ming-UniAudio is an open-source speech framework from Ant Group that centers on a unified continuous audio tokenizer. By merging semantic and acoustic features into a continuous representation (MingTok-Audio), the project trains an end-to-end speech LLM capable of both understanding and generation. Building on this foundation, Ming-UniAudio enables instruction-guided free-form speech editing without requiring manual timestamp annotations.

Main Features

Unified continuous audio tokenizer that encodes semantic and acoustic signals for both understanding and generation.
A unified speech LLM (LLM, Large Language Model) with a diffusion head to improve synthesis quality.
Instruction-driven free-form speech editing supporting insertion, deletion, and substitution guided solely by natural language.
Open releases of models, benchmarks and SFT (supervised fine-tuning) recipes with downloads available on Hugging Face and ModelScope.

Use Cases

Ming-UniAudio is suitable for ASR/transcription, text-to-speech (TTS), dialogue understanding, audio post-processing and interactive audio editing pipelines. Teams can use it to build editable voice assistants, automated audio-cleanup services, or integrate it into multimodal systems to improve the editability and expressiveness of speech content.

Technical Features

The project uses a VAE-based continuous tokenizer and a causal Transformer backbone, enabling hierarchical audio features that interface with LLMs. It publishes evaluation over understanding, generation and editing tasks, and releases the first free-form audio editing benchmark. SFT recipes and example code are provided for reproducing training and inference on heterogeneous hardware.

Ming-UniAudio

Detailed Introduction

Main Features

Use Cases

Technical Features

Resource Info

Related Resources

Awex

AReaL

FluidMarkdown