MiMo-Audio is a family of audio language models from the Xiaomi MiMo team that show few-shot learning and broad generalization for audio tasks. The project provides models (MiMo-Audio-7B-Base / Instruct), tokenizer, technical report, evaluation toolkit, and Hugging Face demos for audio understanding and generation research.
Features
- Few-shot audio language modeling capabilities
- Released base and instruction-tuned models with inference examples
- Evaluation toolkit (MiMo-Audio-Eval) for benchmarking and reproducibility
- Online demos and model downloads on Hugging Face
Use Cases
- Audio understanding and spoken-language tasks (ASR, speaker tasks, semantic understanding)
- Speech generation, style transfer, and audio editing
- Research reproducibility, model evaluation, and benchmark development
Technical Details
- Models combine a high-rate RVQ-based tokenizer with an LLM for efficient audio modeling
- Provides Gradio demo, inference scripts, and model weights via Hugging Face
- License: Apache-2.0 (see repository LICENSE)