Nano-vLLM

A lightweight vLLM implementation built from scratch, offering high-performance offline inference comparable to vLLM.

Author: GeeeekExplorer

Since: 2025-06-09

Detailed Introduction

Nano-vLLM is a lightweight implementation of vLLM developed from scratch. It aims to deliver high-performance offline inference comparable to vLLM while keeping the codebase readable and easy to customize. The project is implemented in concise Python, making it suitable for researchers and engineers to quickly iterate and integrate.

Main Features

High-performance offline inference comparable to vLLM on single-GPU setups.
A compact and readable codebase (~1,200 lines of Python) for easy customization and learning.
An optimization suite including prefix caching, tensor parallelism, Torch compilation, and CUDA Graph.

Use Cases

Local or edge inference for large models and throughput/latency benchmarking.
Prototyping inference stacks that require readable and customizable implementations.
Learning and verifying inference optimizations or comparing model runtime performance.

Technical Characteristics

API surface mirrors vLLM for straightforward swap-in of inference backends.
Uses Torch compilation and CUDA Graph to reduce latency and supports tensor parallelism for multi-GPU scaling.
Includes examples and benchmarks (example.py, bench.py) to help users reproduce results and evaluate performance.

Nano-vLLM

Detailed Introduction

Main Features

Use Cases

Technical Characteristics

Related Resources

Pixeltable

CoTyle

TOON