MiniCPM-V

MiniCPM-V is a family of efficient end-side multimodal large models that support image, video, text and speech inputs for strong multimodal understanding and real-time streaming scenarios.

OpenBMB · Since 2024-07-31

Loading score...

GitHub

Overview

MiniCPM-V is a series of efficient end-side multimodal LLMs (MLLMs) designed to handle single-image, multi-image and high-FPS video understanding, and extend to speech and real-time multimodal streaming on mobile and edge devices.

Key Features

Support for multimodal inputs (image/video/text/speech) with unified encoding and long-video capabilities.
Multiple model variants and quantized formats (GGUF, int4, AWQ) for cross-platform deployment and efficient inference.
Comprehensive cookbook, documentation and demos covering inference, fine-tuning and deployment.

Use Cases

On-device image/video understanding, OCR and document parsing.
Real-time multimodal live streaming, speech-enabled assistants and multimedia retrieval.
Research and product teams for evaluation, fine-tuning and edge deployment experiments.

Technical Details

Introduces a 3D-Resampler and other techniques for high-density video token compression and long-sequence understanding.
Integrates with ecosystems like llama.cpp, Ollama and vLLM for efficient inference.
Released under Apache-2.0 license with technical reports and evaluation artifacts available.

Core Content

Core Content

Technology

Technology

More

More

AI Infrastructure

AI Infrastructure

Explore

Explore

Connect

Connect

Quick Links

Quick Links

LinkedIn

LinkedIn

Follow on X

Follow on X

MiniCPM-V

Overview

Key Features

Use Cases

Technical Details

Score Breakdown

Related Resources

ChatDev

UltraRAG

VoxCPM