MiniCPM-V

MiniCPM-V is a family of efficient end-side multimodal large models that support image, video, text and speech inputs for strong multimodal understanding and real-time streaming scenarios.

Author: OpenBMB

Added Date: 2025-10-03

Open Source Since: 2024-08-01

Visit Website GitHub

Overview

MiniCPM-V is a series of efficient end-side multimodal LLMs (MLLMs) designed to handle single-image, multi-image and high-FPS video understanding, and extend to speech and real-time multimodal streaming on mobile and edge devices.

Key Features

Support for multimodal inputs (image/video/text/speech) with unified encoding and long-video capabilities.
Multiple model variants and quantized formats (GGUF, int4, AWQ) for cross-platform deployment and efficient inference.
Comprehensive cookbook, documentation and demos covering inference, fine-tuning and deployment.

Use Cases

On-device image/video understanding, OCR and document parsing.
Real-time multimodal live streaming, speech-enabled assistants and multimedia retrieval.
Research and product teams for evaluation, fine-tuning and edge deployment experiments.

Technical Details

Introduces a 3D-Resampler and other techniques for high-density video token compression and long-sequence understanding.
Integrates with ecosystems like llama.cpp, Ollama and vLLM for efficient inference.
Released under Apache-2.0 license with technical reports and evaluation artifacts available.

MiniCPM-V

Overview

Key Features

Use Cases

Technical Details

Resource Info

Related Resources

ChatDev

Nano-vLLM

DeepSeek-OCR