The standardization and modularization of cloud native inference systems are making large model deployment as simple and efficient as web services.
Large language model (LLM, Large Language Model) inference is evolving from the era of single-machine accelerators to distributed cloud native systems. The most representative combination today is KServe, vLLM, llm-d, and WG Serving. Each plays a distinct role—standard interface, execution engine, scheduling layer, and collaboration specification—forming a scalable, observable, and governable inference foundation.
The following timeline outlines the key milestones in the evolution of the quartet:
Architecture Overview
The diagram below illustrates the layered collaboration of the quartet within the inference system:
KServe: The Core of Cloud Native Model Serving
KServe is a Kubernetes-native inference control plane. It abstracts model services as CRDs (Custom Resource Definitions), making AI inference as deployable, scalable, and upgradable as microservices.
The table below summarizes KServe’s core capabilities and new features:
| Dimension | Description |
|---|---|
| Core Goal | Provides Kubernetes-native inference standards and control plane |
| Core Features | CRD standardization, elastic scaling, traffic governance, unified gateway entry |
| New Features | LeaderWorkerSet support, AI Gateway integration, llm-d integration |
KServe’s key capabilities include:
- Unified Interface: The InferenceService CRD defines input/output protocols, compatible with REST/GRPC and OpenAI API.
- Elastic Scheduling: Supports automatic GPU scaling and ModelMesh multi-model hosting.
- Traffic Governance: Canary releases, A/B testing, and InferenceGraph.
The latest version introduces the LeaderWorkerSet (LWS) mechanism and Envoy AI Gateway extension, making multi-Pod large model inference a native capability. KServe is transitioning from a traditional ML service platform to the standard control plane for generative AI.
vLLM: High-Performance Inference Execution Engine
vLLM focuses on extreme throughput and memory efficiency, setting the benchmark for open-source performance.
The sequence diagram below shows the main vLLM inference process:
The table below summarizes vLLM’s core technical mechanisms and effects:
| Feature | Technical Mechanism | Effect |
|---|---|---|
| PagedAttention | Memory paging | Longer context, less fragmentation |
| Continuous Batching | Dynamic batch scheduling | Higher GPU utilization |
| Prefix Cache | Prefix reuse | Lower latency and cost |
vLLM is compatible with the OpenAI API, supports INT8/FP8 quantization and various parallel modes, and adapts to NVIDIA, AMD, TPU, and Gaudi hardware. In single-machine or small-scale scenarios, vLLM can run independently; in cluster environments, it serves as the execution foundation for KServe/llm-d.
llm-d: Distributed Inference Scheduling Layer
llm-d is a large model scheduling and orchestration system for Kubernetes, enabling multi-instance collaboration for vLLM. Its design goal: make clusters infer like a single machine.
The table below summarizes llm-d’s core mechanisms and technical highlights:
| Module | Function | Technical Highlight |
|---|---|---|
| Scheduler | Cache-aware routing | Prefix affinity scheduling |
| Prefill/Decode Separation | Heterogeneous hardware optimization | A100 Prefill + L40 Decode |
| Cache Manager | Global cache index | Hierarchical GPU/CPU/NVMe cache |
The following diagram illustrates llm-d’s distributed scheduling and caching mechanism:
llm-d runs under the KServe control plane in a Leader/Worker pattern. The scheduler can be embedded in Envoy or deployed independently, making real-time routing decisions based on cache and load information. Its emergence enables autonomous scheduling and elastic parallelism for multi-node LLM inference.
WG Serving: Collaboration Standards and Ecosystem Hub
WG Serving is an AI Serving working group promoted by the Kubernetes community, defining unified inference semantics in K8s.
The table below summarizes WG Serving’s core achievements and standardization contributions:
| Achievement/Standard | Description |
|---|---|
| Gateway Inference Extension (GIE) | Envoy-based inference gateway protocol supporting model identification, streaming forwarding, priority, and cache affinity routing |
| LeaderWorkerSet CRD | Explicitly describes Leader–Worker collaboration structure, foundational for llm-d and KServe multi-Pod inference |
| Interface Alignment | Advocates OpenAI-style API integration with K8s resource objects, promoting cross-framework interoperability |
GIE is the ‘unified gateway language’ for cloud native AI inference, just as Ingress defines HTTP service entry, it defines the standard semantics and gateway behavior for inference requests within Kubernetes, enabling composable, observable, and extensible inference systems.
The diagram below shows WG Serving’s standardized collaboration within the inference system:
WG Serving is not a product, but a standard layer forming industry consensus, driving the unified language for cloud native AI inference.
Combined Architecture
The table below summarizes the division of labor and roles of the quartet in the system:
| Layer | Component | Role |
|---|---|---|
| Entry Layer | Envoy + GIE | Unified API gateway and traffic hooks |
| Control Layer | KServe + LWS | Lifecycle management, elastic scaling, traffic orchestration |
| Scheduling Layer | llm-d | Prefix-aware routing, cross-Pod collaboration, cache management |
| Execution Layer | vLLM | Efficient inference execution and cache reuse |
The diagram below illustrates the synergy among the quartet:
Clients send requests in OpenAI API format, routed by the GIE gateway to the optimal Leader, with Prefill completed and cache passed to Worker Decode, finally streaming the result back. The entire chain features standard interfaces, high throughput, and elastic scaling.
Ecosystem Convergence Trends
The table below summarizes the convergence trends and feature comparisons in the cloud native LLM inference ecosystem:
| Trend/Feature | Description |
|---|---|
| API Unification | OpenAI-style interfaces have become the de facto standard; KServe and vLLM natively compatible |
| Module Decoupling | Gateway, scheduling, and inference are layered for independent evolution and replacement |
| Hierarchical Caching | GPU–CPU–NVMe three-level KV cache is mainstream |
| Community Collaboration | WG Serving, PyTorch Foundation, CNCF jointly promote cross-project integration |
The following matrix compares the core capabilities of each project:
| Project | Control Plane | Inference Performance | Distributed Capability | Interface Compatibility | Cache Mechanism | Elastic Scaling |
|---|---|---|---|---|---|---|
| KServe | ✅ CRD / LWS | ⚪ Medium | ⭐ Multi-model management | ✅ OpenAI API | ⚪ None | ✅ |
| vLLM | ⚪ None | 🌟 Very High | ⭐ Multi-GPU | ✅ OpenAI API | ✅ Paged KV | ⚪ None |
| llm-d | ⭐ K8s-native scheduling | 🌟 High | 🌟 Multi-instance collaboration | ✅ Inherits upper-layer interface | 🌟 Global cache | ✅ |
| WG Serving | 🌟 Standard abstraction | ⚪ None | 🌟 Cross-project collaboration | 🌟 Unified specification | ⚪ Not involved | ⚪ |
The future inference stack will center on standard APIs and pluggable modules, enabling deployment of large language models (LLM, Large Language Model) as easily as web services.
Deployment Paradigm Example
In a Kubernetes cluster, the deployment paradigm for the quartet is as follows:
- Prefill Layer: 4 × A100 Pods, responsible for long-context computation.
- Decode Layer: 16 × L4 Pods, performing streaming generation.
- llm-d Scheduler: Dynamically routes based on cache hit rate.
- KServe Control Plane: Manages LWS resources and scaling.
- Envoy GIE Gateway: Unified OpenAI interface entry.
The topology diagram below shows the deployment structure:
This combination achieves high concurrency, low cost, and observability for large model services.
Conclusion: The Future of Standardization
The table below summarizes the layers, roles, and core contributions of the quartet in the inference system:
| Layer | Role | Core Contribution |
|---|---|---|
| Entry | WG Serving (GIE) | Unified traffic entry and interface specification |
| Control | KServe | Kubernetes-native deployment and management |
| Scheduling | llm-d | Prefix cache-aware distributed inference scheduling |
| Execution | vLLM | High-performance, low-cost inference engine |
Conclusion:
This “quartet” marks the beginning of a standardized and composable era for large model inference. Future trends will focus on:
- API standardization (OpenAI / OpenInference)
- Hierarchical and shared caching
- Decoupling of control and data planes
- Integrated orchestration on cloud native platforms
Summary
The cloud native LLM inference quartet—KServe, vLLM, llm-d, and WG Serving—is driving standardization, modularization, and ecosystem convergence in inference systems. Through layered collaboration and standard interfaces, developers can achieve high-performance, low-cost, and observable large language model inference services, accelerating the adoption and innovation of AI-native architectures.