Beyond Gateway: Inference Traffic Control Practice with Gateway API Inference Extension

Exploring how Gateway API Inference Extension brings model-aware inference traffic control through InferencePool, InferenceObjective, and metrics-driven routing.

AI inference traffic governance is undergoing a paradigm shift. The Gateway API Inference Extension makes “model awareness” the new 主线 of traffic control.

Brief Review of Gateway API’s Current State

Kubernetes Gateway API has entered the stable v1 series and continued iterating after the 1.0 GA release, enhancing advanced traffic governance capabilities such as WebSocket, timeout and retry, Service Mesh integration, GRPCRoute, request mirroring, CORS, Retry Budget, and more. Major cloud providers and gateway implementations (such as Alibaba Cloud ACK, GKE Gateway, Envoy Gateway, NGINX Gateway Fabric) have all adopted Gateway API as the new generation north-south traffic model.

Building on this foundation, the community has proposed an extension specification specifically for AI inference traffic - the Gateway API Inference Extension. This specification is not about “reinventing an API” but rather supplementing the Gateway API core model with model-aware load balancing and traffic control capabilities for Large Language Model (LLM) and other inference scenarios.

Early documentation often mentioned the InferenceModel + InferencePool CRD combination; the latest specification has evolved to InferenceObjective + InferencePool (with optional InferencePoolImport). This article consistently uses the latest terminology.

Typical Challenges in GenAI Inference Traffic

Traditional gateway, Ingress, and Service Mesh load balancing models are essentially “request-agnostic + endpoint-agnostic”: they distribute traffic across a group of static backends through algorithms like round-robin, least requests, and hashing.

In GPU-powered Large Language Model (LLM) inference scenarios, this model exposes obvious problems:

  • Invisible GPU utilization and queuing

    A single LLM instance simultaneously maintains KV Cache, LoRA Adapter, and Token queues. Resource consumption varies significantly across the same batch of requests. Load balancing based solely on QPS or connection count can easily lead to extreme situations where “idle GPUs have no work while busy GPUs crash.”

  • Lack of semantic binding between models and requests

    From the business perspective, one typically only sees a POST /v1/chat/completions endpoint, making it difficult to express intentions like “high-priority model / test version / canary weight” at the routing layer.

  • Difficult unified management of multiple models, versions, and LoRAs

    Each model service implements its own routing and A/B testing solutions, making it difficult for the platform to achieve unified governance and observability at the control plane level.

The goal of Gateway API Inference Extension is to introduce AI-specific semantics and metrics into load balancing and traffic control decisions while maintaining the existing Gateway model.

Core Concepts and Resource Model of Gateway API Inference Extension

This section introduces the overall architecture and key resource objects of the Inference Extension.

Overall Architecture

The Inference Extension uses Envoy External Processing (ext-proc) mechanism to upgrade Gateway API + ext-proc capable gateways (such as Envoy Gateway, kgateway, GKE Gateway) into Inference Gateways. Requests still follow the standard Gateway + HTTPRoute path, but before being forwarded to the backend, they pass through an “Endpoint Picker” component that selects the most suitable backend instance based on real-time metrics exposed by the model server.

The flowchart below shows the overall architecture:

Figure 1: Inference Extension Architecture Flow
Figure 1: Inference Extension Architecture Flow

InferencePool: Platform-Side “Model Service Pool”

InferencePool is the core resource introduced by the Inference Extension, used to describe a group of inference Pods and routing plugin configurations. It is similar to a Service with a selector, responsible for selecting a group of model service Pods and specifying exposed ports, while also allowing attachment of Endpoint Picker plugins (such as Prefix-Cache-Aware, LoRA-aware, etc.).

In the Gateway API model, InferencePool is treated as a type of “Backend” that can be referenced by HTTPRoute.backendRefs.

The code block below shows a simplified example of InferencePool:

apiVersion: inference.networking.x-k8s.io/v1
kind: InferencePool
metadata:
  name: vllm-llama3-chat-pool
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-llama3-chat
  extensionRef:
    name: prefix-cache-aware-endpoint-picker

The meaning of the above configuration is as follows:

  • Select all Pods with app=vllm-llama3-chat, port 8000.
  • Use the plugin named prefix-cache-aware-endpoint-picker to make routing decisions based on metrics such as KV Cache hit rate, queue length, and GPU utilization.

InferenceObjective: Business-Side “Request Objective”

InferenceObjective is used to express the goal and priority of a single request, decoupled from the model service pool. One request corresponds to one InferenceObjective, and the same InferencePool can serve multiple different InferenceObjectives.

Typical fields include:

  • Business criticality (Critical / High / BestEffort)
  • Required model family / version preference
  • Acceptable latency / cost upper limits, etc.

The Endpoint Picker can combine InferenceObjective with backend metrics to make decisions: prioritize Critical requests when resources are constrained, and shed BestEffort requests when necessary.

InferencePoolImport: Cross-Cluster / Gateway Reuse

InferencePoolImport supports importing InferencePools defined in remote clusters into the local cluster, facilitating consistent governance of multi-cluster, multi-region inference services.

Maturity and Implementation Ecosystem of Inference Extension

The current project version is v1.1.x, overall in the Alpha stage. Official recommendation is not to use it directly in production yet; it’s more suitable for platform teams to experiment with the technology stack.

Multiple implementations and integrations already exist:

  • Inference Gateway implementations for Envoy Gateway / kgateway
  • GKE Inference Gateway: Enhanced capabilities based on GKE Gateway, including KV Cache-aware routing, LoRA reuse, priority scheduling, etc.
  • NGINX Gateway Fabric, cloud provider ACK, and others are also following up on related extensions

Therefore, when designing practical solutions, it should be treated as a “forward-looking solution / future main path.” For production environments, prioritize managed implementations (such as GKE Inference Gateway) or commercial products from gateway vendors.

Practice: Using Inference Extension for Inference Traffic Control

The following example uses a self-hosted LLM cluster on Kubernetes to demonstrate how to serve external traffic through an OpenAI-compatible interface and achieve:

  • Scheduling by request priority (real-time conversation vs. batch processing)
  • Multi-version model canary and rollback
  • Optimized routing using GPU metrics and KV Cache hit rates
  • Unified platform-side observability and rate limiting entry point

Deploying Gateway API and Inference Extension

The deployment process is as follows:

  1. Install a Gateway API-compatible gateway implementation (such as Envoy Gateway, kgateway, GKE Gateway) in the cluster.
  2. Install the Gateway API Inference Extension CRD and control plane components.
  3. Enable the metrics endpoints and plugin protocols required by Inference Extension on the model server side (such as vLLM, Triton, TGI).

Defining InferencePool: Abstracting LLM Pods as an “Inference Pool”

The code block below shows a typical InferencePool configuration:

apiVersion: inference.networking.x-k8s.io/v1
kind: InferencePool
metadata:
  name: chat-llama3-pool
spec:
  targetPortNumber: 8000
  selector:
    app: chat-llama3
  extensionRef:
    name: prefix-cache-aware
  # Optional: Plugin configuration ConfigMap / CR

Key points:

  • selector selects LLM Pods, targetPortNumber specifies the inference service port.
  • extensionRef binds the Endpoint Picker plugin, implementing KV Cache prefix-aware routing, selecting replicas with lighter load based on metrics like queue_length/gpu_utilization, and triggering load shedding when necessary.

Defining InferenceObjective: Connecting “Business Intent” to Routing Decisions

The code block below shows a sample InferenceObjective configuration (fields can be adjusted according to the actual version and implementation):

apiVersion: inference.networking.x-k8s.io/v1
kind: InferenceObjective
metadata:
  name: chat-critical
spec:
  criticality: Critical         # Real-time conversation
  preferredModel: llama3-70b
  fallbackModel: llama3-8b
---
apiVersion: inference.networking.x-k8s.io/v1
kind: InferenceObjective
metadata:
  name: chat-batch
spec:
  criticality: BestEffort       # Batch analysis, log summarization
  preferredModel: llama3-8b

The Endpoint Picker can combine InferenceObjective with InferencePool metrics to make the following decisions:

  • When GPU is constrained and queues are too long, prioritize chat-critical requests and discard chat-batch if necessary.
  • For Critical requests, prioritize the large model; if the target pool is unavailable, fall back to the small model pool.

Importing Business Traffic to InferencePool via HTTPRoute

The code block below shows the HTTPRoute configuration on the Gateway API side:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llm-route
spec:
  parentRefs:
    - name: public-gateway
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /v1/chat/completions
      backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: chat-llama3-pool

Some implementations (such as GKE Inference Gateway) can route based on the model field in the request body, mapping OpenAI-style model names to different InferencePool / InferenceObjective combinations.

Fine-Grained Inference Traffic Control Using Metrics

Inference Extension provides platform teams with a unified metrics system, including kv_cache_hits, gpu_utilization, request_queue_length, per-request inference duration, token count, and more.

Based on these metrics, multi-level traffic control strategies can be built:

  • Priority + Capacity

    Set priority and capacity upper limits for different InferenceObjectives, automatically guaranteeing critical business when resources are constrained.

  • Rate Limiting by Cost / Token

    Aggregate token / latency metrics exposed by Inference Extension to Prometheus, then add cost-based rate limiting logic at the gateway / API Gateway level (such as total tokens per minute per user / application). The specification itself doesn’t mandate “Token-level rate limiting,” but provides observability and hooks for easy policy extension.

  • Prefix Cache Aware Routing

    For requests with shared context (such as RAG, template generation), enable the Prefix Cache Aware plugin to route requests with the same prefix to the same replica, maximizing KV Cache hit rates and significantly reducing TTFT.

  • Auto-Scaling Integration

    Use metrics output by Inference Extension as HPA input to achieve “model-aware” auto-scaling, rather than relying solely on CPU / memory.

Relationship and Trade-offs with Traditional Gateway / Service Mesh

  • Control Plane: Continue using Gateway API as the unified north-south / east-west traffic modeling specification. Service Mesh can still perform fine-grained circuit breaking, retry, mTLS, etc. within the cluster.
  • Data Plane: Inference Extension pushes “model-aware routing” down to the ext-proc path implemented by the Gateway, avoiding redundant business-side wheel reinvention.
  • Adoption Strategy: In the current Alpha stage, prioritize productionized implementations (such as GKE Inference Gateway, commercial gateway vendor Inference Gateways), start with “bypass pilots” outside the critical path, and gradually migrate existing AI gateway routing rules to the Inference Extension model.

Summary

Combining official documentation and community implementations, we can see:

  • Gateway API has become the Kubernetes standard north-south traffic model and continues to enhance in the 1.x series.
  • Gateway API Inference Extension introduces GPU metrics, KV Cache, LoRA, and other inference semantics into load balancing decisions through InferencePool, InferenceObjective, and Endpoint Picker.
  • The project is still in Alpha stage, but has achieved experimental or productionized adoption in implementations such as GKE, kgateway, and NGINX Gateway Fabric. It is one of the important future directions for inference traffic control.

Previous descriptions of Inference Extension as “built-in Token rate limiting CRD, AIInferencePolicy, and other objects” are no longer accurate and should all be replaced with the design based on InferencePool / InferenceObjective + metrics-driven approach.

References

Jimmy Song

Jimmy Song

Focusing on research and open source practices in AI-Native Infrastructure and cloud native application architecture.

Post Navigation

Comments