Recently, I’ve been deeply involved in the AI-native space, so my attention to Istio had waned. However, the release of Istio 1.28 reignited my interest.
As enterprises deploy LLM online inference services (vLLM, TGI, SGLang, llama.cpp, etc.) at scale, network requirements are shifting from “traditional microservice governance” to “high-throughput, strong consistency, and highly observable AI inference data planes.” With the official release of Istio 1.28, we see for the first time that Service Mesh is providing native support for large model inference.
This is a significant milestone for AI Infra architects: → Service Mesh is no longer just for microservices—it is now a foundational component of LLM inference platforms.
Service Mesh is becoming a key part of AI inference infrastructure. Istio 1.28’s native LLM support marks the network layer’s official entry into the AI era.
Article Guide
This article systematically analyzes the key impacts of Istio 1.28 on large language model (LLM, Large Language Model) inference infrastructure, covering InferencePool, Ambient Multicluster, nftables, Dual-stack, and enhancements in observability and security.
Main topics include:
- InferencePool v1: Service Mesh’s first native embrace of AI inference
- Ambient Multicluster: L7 governance across GPU networks
- nftables support: Modern networking for high-concurrency inference
- Dual-stack: LLM clusters in the IPv6 era
- Enhanced observability and security
- One diagram: Istio’s role in LLM inference clusters
InferencePool v1: Service Mesh Officially Enters the AI Inference Era
The most notable update in Istio 1.28 is the Gateway API Inference Extension → InferencePool v1 reaching stable status. For LLM inference infrastructure, this is a “qualitative leap” rather than a mere incremental change.
Inference traffic in enterprise deployments faces many challenges:
- Gray routing for multiple model versions (e.g., v1/v2)
- Load balancing across heterogeneous GPU clusters (A100, H20, Mi300)
- Lifecycle management of multiple inference pool replicas
- Automatic removal of unstable inference nodes (OOM, H2 connection drops)
- Network governance for remote GPU clusters (independent VPCs)
Previously, these issues required fragmented solutions across business logic, inference platforms, Ingress, Gateway, and Operators, resulting in complex architectures and high operational costs.
With InferencePool, GPU inference nodes become first-class resources in the service mesh. Istio 1.28 introduces capabilities such as:
- Unified abstraction of model inference endpoints (Endpoint Pool)
- Intelligent load balancing (version, health, latency)
- Smart scheduling across multi-cluster / multi-GPU resource pools
- Automatic failover (card drops, OOM auto-removal)
- Native integration with Gateway API (stable API)
InferencePool’s significance for LLM inference is comparable to DestinationRule’s role for microservices—only at a larger scale and with more complex strategies.
Below is a flowchart illustrating the technical mechanism of InferencePool:
This makes Istio the unified entry point for LLM inference platforms, regardless of whether the backend is vLLM, TGI, SGLang, llama.cpp, or proprietary GPU inference engines. For AI Infra teams, this is a crucial evolution.
Ambient Multicluster: L7 Inference Governance Across GPU Networks
LLM inference clusters are often distributed across different network environments, such as:
- Dedicated GPU zones (high bandwidth, isolated subnet/VPC)
- CPU + RAG + VectorDB in another network
- Multi-datacenter inference pools
Istio 1.28’s Ambient Multicluster brings two key capabilities:
- Inference pools can be deployed in any network
- Applications can enjoy full L7 policies without needing a Sidecar
- GPU clusters can be deployed independently, without affecting the main network
Additionally, L7 Outlier Detection works across networks:
- If a GPU Pod’s inference latency increases (due to memory fragmentation or deep request queues), it is automatically removed
- TGI/vLLM errors (OOM, H2Error) trigger automatic failover
- Remote inference replicas with high latency are automatically deprioritized
This self-healing is vital for LLM online inference systems.
Ambient Multicluster’s value for AI Infra includes:
- High sensitivity to latency
- Unstable replica states (large models prone to OOM, connection drops)
- Expensive GPU resources requiring fine-grained scheduling
- Increasing prevalence of multi-machine inference (Mixture-of-Experts, Tensor Parallelism)
Ambient Multicluster delivers autonomous network capabilities.
nftables Support: Modern Networking for High-Concurrency LLM Inference
Typical LLM inference workloads include:
- Long-lived connections (HTTP/2, gRPC)
- High traffic (prompt/data token output)
- Frequent short calls (embedding)
In high-concurrency scenarios, iptables can suffer from:
- Performance degradation with large rule sets
- Difficult rule maintenance
- Conntrack plugin bottlenecks under large model traffic
Istio 1.28 officially supports native nftables mode in Ambient mode. This brings faster rule matching, better concurrency performance, and is better suited for large model long connections. For large-scale inference clusters, the performance gains are significant.
Dual-stack Beta: LLM Inference Networks in the IPv6 Era
Many compute centers (such as domestic GPU clusters and AI data centers) have begun deploying IPv6 networks.
LLM inference demands far more IP addresses than traditional microservices:
- Massive address space for GPU nodes
- High density of training and inference nodes
- Huge number of long-lived connections (one token stream per user)
Istio 1.28 upgrades Dual-stack to Beta, offering:
- Simultaneous IPv4/IPv6 support
- Full adaptation of traffic governance logic
- Suitable for large data center LLM inference platforms
This is an infrastructure-level evolution.
Enhanced Observability and Security: Value for AI Inference Platforms
B3 + W3C Trace dual protocols are suitable for scenarios such as:
- Complete call chains from LLM → RAG → VectorDB → Cache → User.
Especially useful for building:
- End-to-end token-level call tracing
- Prompt-based latency profiling
- Model version comparison analysis
BackendTLSPolicy v1 is used for:
- Calling external large models (Gemini, OpenAI, AWS Bedrock)
- Configuring stricter TLS
Custom JWT claim support is ideal for enterprises:
- Permission control based on model version/capabilities
- Fine-grained access control over “who can access which model”
One Diagram: Istio’s Role in LLM Inference Infrastructure
The following diagram illustrates Istio’s overall architecture within LLM inference infrastructure:
This demonstrates a new reality: In the AI era, Istio not only governs microservices but also serves as the unified data plane for LLM inference services.
Summary
The release of Istio 1.28 marks Service Mesh’s evolution from the microservices era’s network layer to the AI inference era’s compute network layer. InferencePool v1 greatly strengthens AI inference infrastructure, Ambient Multicluster simplifies GPU network management, and nftables and dual-stack capabilities enhance platform scalability.
If you’re building an enterprise-grade LLM inference platform, multi-cluster GPU scheduling system, highly available RAG platform, or edge-cloud collaborative model services, Istio 1.28 is a must-watch release.