TensorRT-LLM

NVIDIA's open-source toolbox for optimized LLM inference, designed for efficient GPU serving and enterprise deployment.

Author: NVIDIA

Since: 2023-08-16

Introduction

TensorRT-LLM is NVIDIA’s open-source toolbox for optimizing large language model inference, designed for high-performance GPU serving and enterprise deployment. It supports mainstream models and advanced quantization techniques.

Key Features

Custom attention kernels, batch inference, distributed parallelism, and multiple quantization methods (FP8/FP4/INT4/INT8)
High-level Python API for single-GPU, multi-GPU, and multi-node deployment
Seamless integration with Triton Inference Server, PyTorch, and other ecosystems
Modular architecture, easy to extend and customize

Use Cases

Enterprise-scale LLM inference and deployment
Efficient GPU inference in cloud and on-premises
Rapid prototyping for LLM applications
Quantized model performance optimization

Technical Highlights

C++/Python/CUDA multi-language collaboration, extreme performance optimization
Built-in KV cache, inference scheduling, structured output, and other advanced features
Supports mainstream LLMs and quantized models, easy integration of new models

TensorRT-LLM

Introduction

Key Features

Use Cases

Technical Highlights

Resource Info

Related Resources

NVIDIA GPU Operator

Transformer Engine

CUTLASS