Overview
MiniCPM-V is a series of efficient end-side multimodal LLMs (MLLMs) designed to handle single-image, multi-image and high-FPS video understanding, and extend to speech and real-time multimodal streaming on mobile and edge devices.
Key Features
- Support for multimodal inputs (image/video/text/speech) with unified encoding and long-video capabilities.
- Multiple model variants and quantized formats (GGUF, int4, AWQ) for cross-platform deployment and efficient inference.
- Comprehensive cookbook, documentation and demos covering inference, fine-tuning and deployment.
Use Cases
- On-device image/video understanding, OCR and document parsing.
- Real-time multimodal live streaming, speech-enabled assistants and multimedia retrieval.
- Research and product teams for evaluation, fine-tuning and edge deployment experiments.
Technical Details
- Introduces a 3D-Resampler and other techniques for high-density video token compression and long-sequence understanding.
- Integrates with ecosystems like llama.cpp, Ollama and vLLM for efficient inference.
- Released under Apache-2.0 license with technical reports and evaluation artifacts available.