Introduction
Kimi-Audio is an open-source audio foundation model that unifies audio understanding, generation and conversational capabilities. It supports ASR, audio question-answering, audio captioning, emotion and event classification, and end-to-end speech conversation.
Key Features
- Universal multimodal pipeline: discrete semantic tokens + continuous acoustic features.
- Large-scale pretraining on diverse audio and text data for robust audio reasoning.
- Parallel heads for text and audio token generation enabling text+audio outputs.
- Efficient chunk-wise streaming detokenizer for low-latency audio generation.
Use Cases
- Automatic Speech Recognition (ASR) and transcription services.
- Audio-to-text chatbots and conversational agents with spoken responses.
- Audio captioning and understanding for multimedia indexing and search.
- Research and benchmarking of audio LLMs using the provided eval toolkit.
Technical Highlights
- Audio tokenizer with vector quantization producing discrete semantic tokens.
- Transformer-based Audio LLM initialized from text LLM backbones.
- Flow-matching detokenizer + vocoder (BigVGAN) for high-fidelity waveform synthesis.