Introduction
Gateway API Inference Extension (Inference Gateway) combines the Gateway API with Envoy’s External Processing to provide Kubernetes-native capabilities for routing, scheduling and optimizing inference requests for self-hosted generative AI workloads.
Key features
- Kubernetes-native declarative APIs (InferenceObjective / Inference Pool) for routing and traffic control.
- Pluggable schedulers and Endpoint Picker (EPP) supporting cost/performance-aware routing and prefix-cache aware load balancing.
- Operations and observability: Grafana dashboards, end-to-end tests, comprehensive docs and examples.
Use cases
- Multi-model inference platforms on Kubernetes that need cost/performance-aware request routing.
- Integrating model routing into an AI gateway with LoRA adapter support, A/B traffic splitting and safety isolation.
- Integrations with vLLM, llm-d and other model servers for disaggregated scalable serving.
Technical details
- Implemented primarily in Go, with Python tools/examples, documentation site and test suites included in the repo.
- Supports ext-proc, Envoy Gateway and adapters for multiple model server protocols.
- Provides CRDs, controllers and deployment scripts, and includes examples, benchmarks and E2E test workflows.