LLaVA-NeXT

An open-source large-scale multimodal model and toolkit that unifies training and inference for images, multi-image, video and 3D tasks.

Author: LLaVA-VL

Added Date: 2025-10-03

Open Source Since: 2024-03-08

Visit Website GitHub

Overview

LLaVA-NeXT is an open-source large-scale multimodal model and toolkit from the LLaVA team. It aims to unify training and inference across images, multi-image, video, and 3D tasks, and provides training scripts, evaluation tools and multiple model variants suitable for research and engineering.

Key features

Interleaved multimodal training format supporting multi-image and video inference.
Multiple model variants and reproduction scripts, including training, evaluation and benchmarking tools (lmms-eval).
Regularly released checkpoints and evaluation results, with demos and blog posts documenting updates.

Use cases

Multimodal benchmarks, model comparisons, and academic reproductions.
Video understanding, image question answering, image editing and multi-image scene understanding.
Research baselines and engineering prototypes.

Technical details

Implemented in PyTorch with support for large-scale training, quantization and inference optimizations.
Employs scalable architectures and training strategies, including critic models and DPO/RLHF training methods.
Provides comprehensive docs, demos (including Hugging Face Spaces) and dataset links for reproducibility and evaluation.

LLaVA-NeXT

Overview

Key features

Use cases

Technical details

Resource Info

Related Resources

Nano-vLLM

DeepSeek-OCR

LeRobot