Gateway API Inference Extension

Combines Gateway API with Envoy External Processing to provide a Kubernetes-native inference gateway for optimizing GenAI inference deployments.

Author: Kubernetes SIGs

Since: 2024-08-28

Visit Website GitHub

Introduction

Gateway API Inference Extension (Inference Gateway) combines the Gateway API with Envoy’s External Processing to provide Kubernetes-native capabilities for routing, scheduling and optimizing inference requests for self-hosted generative AI workloads.

Key features

Kubernetes-native declarative APIs (InferenceObjective / Inference Pool) for routing and traffic control.
Pluggable schedulers and Endpoint Picker (EPP) supporting cost/performance-aware routing and prefix-cache aware load balancing.
Operations and observability: Grafana dashboards, end-to-end tests, comprehensive docs and examples.

Use cases

Multi-model inference platforms on Kubernetes that need cost/performance-aware request routing.
Integrating model routing into an AI gateway with LoRA adapter support, A/B traffic splitting and safety isolation.
Integrations with vLLM, llm-d and other model servers for disaggregated scalable serving.

Technical details

Implemented primarily in Go, with Python tools/examples, documentation site and test suites included in the repo.
Supports ext-proc, Envoy Gateway and adapters for multiple model server protocols.
Provides CRDs, controllers and deployment scripts, and includes examples, benchmarks and E2E test workflows.

Gateway API Inference Extension

Introduction

Key features

Use cases

Technical details

Resource Info

Related Resources

Agent Sandbox

Obot

LiteLLM