MiMo-Audio

MiMo-Audio: audio language models demonstrating few-shot learning and strong generalization across audio understanding and generation tasks.

Author: Xiaomi

Added Date: 2025-09-20

Open Source Since: 2025-09-19

Visit Website GitHub

MiMo-Audio is a family of audio language models from the Xiaomi MiMo team that show few-shot learning and broad generalization for audio tasks. The project provides models (MiMo-Audio-7B-Base / Instruct), tokenizer, technical report, evaluation toolkit, and Hugging Face demos for audio understanding and generation research.

Features

Few-shot audio language modeling capabilities
Released base and instruction-tuned models with inference examples
Evaluation toolkit (MiMo-Audio-Eval) for benchmarking and reproducibility
Online demos and model downloads on Hugging Face

Use Cases

Audio understanding and spoken-language tasks (ASR, speaker tasks, semantic understanding)
Speech generation, style transfer, and audio editing
Research reproducibility, model evaluation, and benchmark development

Technical Details

Models combine a high-rate RVQ-based tokenizer with an LLM for efficient audio modeling
Provides Gradio demo, inference scripts, and model weights via Hugging Face
License: Apache-2.0 (see repository LICENSE)

MiMo-Audio

Features

Use Cases

Technical Details

Resource Info

Related Resources

Nano-vLLM

DeepSeek-OCR

LeRobot