Overview
Text Generation Inference (TGI) is an open-source server and toolkit from Hugging Face designed for deploying and serving LLMs with high performance, offering streaming, batching, and production observability.
Key features
- High performance: supports tensor parallelism, continuous batching, and streaming outputs.
- Broad model & hardware support: compatible with Llama, Falcon, StarCoder and optimized for various accelerators.
- Production-ready: built-in metrics, tracing, and observability integrations.
Use cases
- On-premises inference services for enterprises demanding data privacy.
- Backend for RAG, chat assistants, or code generation services.
- High-throughput online and batch inference workloads.
Technical details
- Rust/Python hybrid implementation with launcher, server, and client tooling.
- GPU optimizations (FlashAttention, tensor parallelism), quantization support and multiple hardware backends.
- REST/gRPC APIs with OpenAPI documentation for easy integration.