Text Generation Inference

Text Generation Inference (TGI) is Hugging Face's high-performance toolkit for serving text generation models in production, suitable for on-premises and cloud deployments.

Author: Hugging Face

Since: 2022-10-08

Visit Website GitHub

Overview

Text Generation Inference (TGI) is an open-source server and toolkit from Hugging Face designed for deploying and serving LLMs with high performance, offering streaming, batching, and production observability.

Key features

High performance: supports tensor parallelism, continuous batching, and streaming outputs.
Broad model & hardware support: compatible with Llama, Falcon, StarCoder and optimized for various accelerators.
Production-ready: built-in metrics, tracing, and observability integrations.

Use cases

On-premises inference services for enterprises demanding data privacy.
Backend for RAG, chat assistants, or code generation services.
High-throughput online and batch inference workloads.

Technical details

Rust/Python hybrid implementation with launcher, server, and client tooling.
GPU optimizations (FlashAttention, tensor parallelism), quantization support and multiple hardware backends.
REST/gRPC APIs with OpenAPI documentation for easy integration.

Text Generation Inference

Overview

Key features

Use cases

Technical details

Resource Info

Related Resources

Evaluation Guidebook

TRL

LeRobot