A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

MiMo-Audio

MiMo-Audio: audio language models demonstrating few-shot learning and strong generalization across audio understanding and generation tasks.

MiMo-Audio is a family of audio language models from the Xiaomi MiMo team that show few-shot learning and broad generalization for audio tasks. The project provides models (MiMo-Audio-7B-Base / Instruct), tokenizer, technical report, evaluation toolkit, and Hugging Face demos for audio understanding and generation research.

Features

  • Few-shot audio language modeling capabilities
  • Released base and instruction-tuned models with inference examples
  • Evaluation toolkit (MiMo-Audio-Eval) for benchmarking and reproducibility
  • Online demos and model downloads on Hugging Face

Use Cases

  • Audio understanding and spoken-language tasks (ASR, speaker tasks, semantic understanding)
  • Speech generation, style transfer, and audio editing
  • Research reproducibility, model evaluation, and benchmark development

Technical Details

  • Models combine a high-rate RVQ-based tokenizer with an LLM for efficient audio modeling
  • Provides Gradio demo, inference scripts, and model weights via Hugging Face
  • License: Apache-2.0 (see repository LICENSE)

Comments

MiMo-Audio
Resource Info
Author Xiaomi
Added Date 2025-09-20
Tags
OSS Project LLM TTS