LMDeploy

LMDeploy is a toolkit for compressing, deploying and serving large language models, providing optimized inference engines, quantization and distribution features.

Author: InternLM

Since: 2023-06-15

Visit Website GitHub

Overview

LMDeploy provides end-to-end model compression, quantization and deployment capabilities, including high-performance engines (TurboMind), continuous batching and distribution services for latency-sensitive production workloads.

Key features

High-performance inference engines (TurboMind and optimized PyTorch backends).
Quantization and KV-cache optimization to reduce memory footprint and latency.
Deployment and distribution for offline batch and online multi-host serving.

Use cases

Convert research models into production inference services with minimal effort.
Serve high-concurrency, low-latency applications such as chat APIs.

Technical notes

Supports multiple backends and model formats; see project docs for compatible models and installation.
Includes benchmarking and visualization tooling for performance evaluation.

LMDeploy

Overview

Key features

Use cases

Technical notes

Resource Info

Related Resources

Kata Containers

Golem

Aspire