A curated list of AI tools and resources for developers, see the AI Resources .

MiniCPM-V

MiniCPM-V is a family of efficient end-side multimodal large models that support image, video, text and speech inputs for strong multimodal understanding and real-time streaming scenarios.

Overview

MiniCPM-V is a series of efficient end-side multimodal LLMs (MLLMs) designed to handle single-image, multi-image and high-FPS video understanding, and extend to speech and real-time multimodal streaming on mobile and edge devices.

Key Features

  • Support for multimodal inputs (image/video/text/speech) with unified encoding and long-video capabilities.
  • Multiple model variants and quantized formats (GGUF, int4, AWQ) for cross-platform deployment and efficient inference.
  • Comprehensive cookbook, documentation and demos covering inference, fine-tuning and deployment.

Use Cases

  • On-device image/video understanding, OCR and document parsing.
  • Real-time multimodal live streaming, speech-enabled assistants and multimedia retrieval.
  • Research and product teams for evaluation, fine-tuning and edge deployment experiments.

Technical Details

  • Introduces a 3D-Resampler and other techniques for high-density video token compression and long-sequence understanding.
  • Integrates with ecosystems like llama.cpp, Ollama and vLLM for efficient inference.
  • Released under Apache-2.0 license with technical reports and evaluation artifacts available.

Comments

MiniCPM-V
Resource Info
🌱 Open Source 🎨 Multimodal 🖥️ ML Platform 🔮 Inference