The Impact of Istio 1.28 on LLM Inference Infrastructure

Recently, I’ve been deeply involved in the AI-native space, so my attention to Istio had waned. However, the release of Istio 1.28 reignited my interest.

As enterprises deploy LLM online inference services (vLLM, TGI, SGLang, llama.cpp, etc.) at scale, network requirements are shifting from “traditional microservice governance” to “high-throughput, strong consistency, and highly observable AI inference data planes.” With the official release of Istio 1.28, we see for the first time that Service Mesh is providing native support for large model inference.

This is a significant milestone for AI Infra architects: → Service Mesh is no longer just for microservices—it is now a foundational component of LLM inference platforms.

Service Mesh is becoming a key part of AI inference infrastructure. Istio 1.28’s native LLM support marks the network layer’s official entry into the AI era.

Article Guide

This article systematically analyzes the key impacts of Istio 1.28 on large language model (LLM, Large Language Model) inference infrastructure, covering InferencePool, Ambient Multicluster, nftables, Dual-stack, and enhancements in observability and security.

Main topics include:

InferencePool v1: Service Mesh’s first native embrace of AI inference
Ambient Multicluster: L7 governance across GPU networks
nftables support: Modern networking for high-concurrency inference
Dual-stack: LLM clusters in the IPv6 era
Enhanced observability and security
One diagram: Istio’s role in LLM inference clusters

InferencePool v1: Service Mesh Officially Enters the AI Inference Era

The most notable update in Istio 1.28 is the Gateway API Inference Extension → InferencePool v1 reaching stable status. For LLM inference infrastructure, this is a “qualitative leap” rather than a mere incremental change.

Inference traffic in enterprise deployments faces many challenges:

Gray routing for multiple model versions (e.g., v1/v2)
Load balancing across heterogeneous GPU clusters (A100, H20, Mi300)
Lifecycle management of multiple inference pool replicas
Automatic removal of unstable inference nodes (OOM, H2 connection drops)
Network governance for remote GPU clusters (independent VPCs)

Previously, these issues required fragmented solutions across business logic, inference platforms, Ingress, Gateway, and Operators, resulting in complex architectures and high operational costs.

With InferencePool, GPU inference nodes become first-class resources in the service mesh. Istio 1.28 introduces capabilities such as:

Unified abstraction of model inference endpoints (Endpoint Pool)
Intelligent load balancing (version, health, latency)
Smart scheduling across multi-cluster / multi-GPU resource pools
Automatic failover (card drops, OOM auto-removal)
Native integration with Gateway API (stable API)

InferencePool’s significance for LLM inference is comparable to DestinationRule’s role for microservices—only at a larger scale and with more complex strategies.

Below is a flowchart illustrating the technical mechanism of InferencePool:

Figure 1: InferencePool Traffic Scheduling Mechanism

This makes Istio the unified entry point for LLM inference platforms, regardless of whether the backend is vLLM, TGI, SGLang, llama.cpp, or proprietary GPU inference engines. For AI Infra teams, this is a crucial evolution.

Ambient Multicluster: L7 Inference Governance Across GPU Networks

LLM inference clusters are often distributed across different network environments, such as:

Dedicated GPU zones (high bandwidth, isolated subnet/VPC)
CPU + RAG + VectorDB in another network
Multi-datacenter inference pools

Istio 1.28’s Ambient Multicluster brings two key capabilities:

Inference pools can be deployed in any network
Applications can enjoy full L7 policies without needing a Sidecar
GPU clusters can be deployed independently, without affecting the main network

Additionally, L7 Outlier Detection works across networks:

If a GPU Pod’s inference latency increases (due to memory fragmentation or deep request queues), it is automatically removed
TGI/vLLM errors (OOM, H2Error) trigger automatic failover
Remote inference replicas with high latency are automatically deprioritized

This self-healing is vital for LLM online inference systems.

Ambient Multicluster’s value for AI Infra includes:

High sensitivity to latency
Unstable replica states (large models prone to OOM, connection drops)
Expensive GPU resources requiring fine-grained scheduling
Increasing prevalence of multi-machine inference (Mixture-of-Experts, Tensor Parallelism)

Ambient Multicluster delivers autonomous network capabilities.

nftables Support: Modern Networking for High-Concurrency LLM Inference

Typical LLM inference workloads include:

Long-lived connections (HTTP/2, gRPC)
High traffic (prompt/data token output)
Frequent short calls (embedding)

In high-concurrency scenarios, iptables can suffer from:

Performance degradation with large rule sets
Difficult rule maintenance
Conntrack plugin bottlenecks under large model traffic

Istio 1.28 officially supports native nftables mode in Ambient mode. This brings faster rule matching, better concurrency performance, and is better suited for large model long connections. For large-scale inference clusters, the performance gains are significant.

Dual-stack Beta: LLM Inference Networks in the IPv6 Era

Many compute centers (such as domestic GPU clusters and AI data centers) have begun deploying IPv6 networks.

LLM inference demands far more IP addresses than traditional microservices:

Massive address space for GPU nodes
High density of training and inference nodes
Huge number of long-lived connections (one token stream per user)

Istio 1.28 upgrades Dual-stack to Beta, offering:

Simultaneous IPv4/IPv6 support
Full adaptation of traffic governance logic
Suitable for large data center LLM inference platforms

This is an infrastructure-level evolution.

Enhanced Observability and Security: Value for AI Inference Platforms

B3 + W3C Trace dual protocols are suitable for scenarios such as:

Complete call chains from LLM → RAG → VectorDB → Cache → User.

Especially useful for building:

End-to-end token-level call tracing
Prompt-based latency profiling
Model version comparison analysis

BackendTLSPolicy v1 is used for:

Calling external large models (Gemini, OpenAI, AWS Bedrock)
Configuring stricter TLS

Custom JWT claim support is ideal for enterprises:

Permission control based on model version/capabilities
Fine-grained access control over “who can access which model”

One Diagram: Istio’s Role in LLM Inference Infrastructure

The following diagram illustrates Istio’s overall architecture within LLM inference infrastructure:

Figure 2: Istio’s Role in LLM Inference Infrastructure

This demonstrates a new reality: In the AI era, Istio not only governs microservices but also serves as the unified data plane for LLM inference services.

Summary

The release of Istio 1.28 marks Service Mesh’s evolution from the microservices era’s network layer to the AI inference era’s compute network layer. InferencePool v1 greatly strengthens AI inference infrastructure, Ambient Multicluster simplifies GPU network management, and nftables and dual-stack capabilities enhance platform scalability.

If you’re building an enterprise-grade LLM inference platform, multi-cluster GPU scheduling system, highly available RAG platform, or edge-cloud collaborative model services, Istio 1.28 is a must-watch release.