InternVL

An open-source multimodal vision-language toolbox and baseline for image and video understanding and generation tasks.

Author: OpenGVLab

Added Date: 2025-10-03

Open Source Since: 2023-11-22

Overview

InternVL is an open-source multimodal vision-language toolbox released by OpenGVLab. It provides baselines and end-to-end pipelines for image and video understanding, retrieval, and generation tasks, including data preprocessing, model training and evaluation. The project offers reproducible implementations and practical baselines for both research and engineering.

Key features

Support for multimodal (image, video, and text) model training and evaluation.
Rich data preprocessing, augmentation, and training scripts for reproducible experiments.
Ready-to-use model implementations and examples to quickly validate downstream tasks.

Use cases

Research benchmarks for visual question answering, image-text retrieval, and image/video classification and segmentation.
Reproducing academic results or using as comparative baselines in research.
Rapid prototyping of multimodal proofs-of-concept in engineering projects.

Technical details

Implemented in PyTorch for extensibility and deployment.
Provides comprehensive training and evaluation pipelines, including support for distributed training.
Compatible with mainstream multimodal pretraining and fine-tuning strategies and model architectures.

InternVL

Overview

Key features

Use cases

Technical details

Resource Info

Related Resources

Nano-vLLM

DeepSeek-OCR

LeRobot