Building Efficient LLM Inference with the Cloud Native Quartet: KServe, vLLM, llm-d, and WG Serving

The standardization and modularization of cloud native inference systems are making large model deployment as simple and efficient as web services.

Large language model (LLM, Large Language Model) inference is evolving from the era of single-machine accelerators to distributed cloud native systems. The most representative combination today is KServe, vLLM, llm-d, and WG Serving. Each plays a distinct role—standard interface, execution engine, scheduling layer, and collaboration specification—forming a scalable, observable, and governable inference foundation.

The following timeline outlines the key milestones in the evolution of the quartet:

Figure 1: Cloud Native LLM Inference Quartet Evolution Timeline

Architecture Overview

The diagram below illustrates the layered collaboration of the quartet within the inference system:

KServe: The Core of Cloud Native Model Serving

KServe is a Kubernetes-native inference control plane. It abstracts model services as CRDs (Custom Resource Definitions), making AI inference as deployable, scalable, and upgradable as microservices.

The table below summarizes KServe’s core capabilities and new features:

Dimension	Description
Core Goal	Provides Kubernetes-native inference standards and control plane
Core Features	CRD standardization, elastic scaling, traffic governance, unified gateway entry
New Features	LeaderWorkerSet support, AI Gateway integration, llm-d integration

Table 1: KServe Core Capabilities and New Features

KServe’s key capabilities include:

Unified Interface: The InferenceService CRD defines input/output protocols, compatible with REST/GRPC and OpenAI API.
Elastic Scheduling: Supports automatic GPU scaling and ModelMesh multi-model hosting.
Traffic Governance: Canary releases, A/B testing, and InferenceGraph.

The latest version introduces the LeaderWorkerSet (LWS) mechanism and Envoy AI Gateway extension, making multi-Pod large model inference a native capability. KServe is transitioning from a traditional ML service platform to the standard control plane for generative AI.

vLLM: High-Performance Inference Execution Engine

vLLM focuses on extreme throughput and memory efficiency, setting the benchmark for open-source performance.

The sequence diagram below shows the main vLLM inference process:

The table below summarizes vLLM’s core technical mechanisms and effects:

Feature	Technical Mechanism	Effect
PagedAttention	Memory paging	Longer context, less fragmentation
Continuous Batching	Dynamic batch scheduling	Higher GPU utilization
Prefix Cache	Prefix reuse	Lower latency and cost

Table 2: vLLM Core Technical Mechanisms and Effects

vLLM is compatible with the OpenAI API, supports INT8/FP8 quantization and various parallel modes, and adapts to NVIDIA, AMD, TPU, and Gaudi hardware. In single-machine or small-scale scenarios, vLLM can run independently; in cluster environments, it serves as the execution foundation for KServe/llm-d.

llm-d: Distributed Inference Scheduling Layer

llm-d is a large model scheduling and orchestration system for Kubernetes, enabling multi-instance collaboration for vLLM. Its design goal: make clusters infer like a single machine.

The table below summarizes llm-d’s core mechanisms and technical highlights:

Module	Function	Technical Highlight
Scheduler	Cache-aware routing	Prefix affinity scheduling
Prefill/Decode Separation	Heterogeneous hardware optimization	A100 Prefill + L40 Decode
Cache Manager	Global cache index	Hierarchical GPU/CPU/NVMe cache

Table 3: llm-d Core Mechanisms and Technical Highlights

The following diagram illustrates llm-d’s distributed scheduling and caching mechanism:

Figure 4: llm-d Distributed Scheduling and Caching Mechanism

llm-d runs under the KServe control plane in a Leader/Worker pattern. The scheduler can be embedded in Envoy or deployed independently, making real-time routing decisions based on cache and load information. Its emergence enables autonomous scheduling and elastic parallelism for multi-node LLM inference.

WG Serving: Collaboration Standards and Ecosystem Hub

WG Serving is an AI Serving working group promoted by the Kubernetes community, defining unified inference semantics in K8s.

The table below summarizes WG Serving’s core achievements and standardization contributions:

Achievement/Standard	Description
Gateway Inference Extension (GIE)	Envoy-based inference gateway protocol supporting model identification, streaming forwarding, priority, and cache affinity routing
LeaderWorkerSet CRD	Explicitly describes Leader–Worker collaboration structure, foundational for llm-d and KServe multi-Pod inference
Interface Alignment	Advocates OpenAI-style API integration with K8s resource objects, promoting cross-framework interoperability

Table 4: WG Serving Core Achievements and Standardization Contributions

GIE is the ‘unified gateway language’ for cloud native AI inference, just as Ingress defines HTTP service entry, it defines the standard semantics and gateway behavior for inference requests within Kubernetes, enabling composable, observable, and extensible inference systems.

The diagram below shows WG Serving’s standardized collaboration within the inference system:

Figure 5: WG Serving Standardized Collaboration

WG Serving is not a product, but a standard layer forming industry consensus, driving the unified language for cloud native AI inference.

Combined Architecture

The table below summarizes the division of labor and roles of the quartet in the system:

Layer	Component	Role
Entry Layer	Envoy + GIE	Unified API gateway and traffic hooks
Control Layer	KServe + LWS	Lifecycle management, elastic scaling, traffic orchestration
Scheduling Layer	llm-d	Prefix-aware routing, cross-Pod collaboration, cache management
Execution Layer	vLLM	Efficient inference execution and cache reuse

Table 5: Cloud Native LLM Inference Quartet Division of Labor and Roles

The diagram below illustrates the synergy among the quartet:

Figure 6: Cloud Native LLM Inference Quartet Synergy

Clients send requests in OpenAI API format, routed by the GIE gateway to the optimal Leader, with Prefill completed and cache passed to Worker Decode, finally streaming the result back. The entire chain features standard interfaces, high throughput, and elastic scaling.

Ecosystem Convergence Trends

The table below summarizes the convergence trends and feature comparisons in the cloud native LLM inference ecosystem:

Trend/Feature	Description
API Unification	OpenAI-style interfaces have become the de facto standard; KServe and vLLM natively compatible
Module Decoupling	Gateway, scheduling, and inference are layered for independent evolution and replacement
Hierarchical Caching	GPU–CPU–NVMe three-level KV cache is mainstream
Community Collaboration	WG Serving, PyTorch Foundation, CNCF jointly promote cross-project integration

Table 6: Cloud Native LLM Inference Ecosystem Convergence Trends and Feature Comparison

The following matrix compares the core capabilities of each project:

Project	Control Plane	Inference Performance	Distributed Capability	Interface Compatibility	Cache Mechanism	Elastic Scaling
KServe	✅ CRD / LWS	⚪ Medium	⭐ Multi-model management	✅ OpenAI API	⚪ None	✅
vLLM	⚪ None	🌟 Very High	⭐ Multi-GPU	✅ OpenAI API	✅ Paged KV	⚪ None
llm-d	⭐ K8s-native scheduling	🌟 High	🌟 Multi-instance collaboration	✅ Inherits upper-layer interface	🌟 Global cache	✅
WG Serving	🌟 Standard abstraction	⚪ None	🌟 Cross-project collaboration	🌟 Unified specification	⚪ Not involved	⚪

Table 7: Cloud Native LLM Inference Quartet Feature Comparison Matrix

The future inference stack will center on standard APIs and pluggable modules, enabling deployment of large language models (LLM, Large Language Model) as easily as web services.

Deployment Paradigm Example

In a Kubernetes cluster, the deployment paradigm for the quartet is as follows:

Prefill Layer: 4 × A100 Pods, responsible for long-context computation.
Decode Layer: 16 × L4 Pods, performing streaming generation.
llm-d Scheduler: Dynamically routes based on cache hit rate.
KServe Control Plane: Manages LWS resources and scaling.
Envoy GIE Gateway: Unified OpenAI interface entry.

The topology diagram below shows the deployment structure:

Figure 7: Cloud Native LLM Inference Quartet Deployment Topology

This combination achieves high concurrency, low cost, and observability for large model services.

Conclusion: The Future of Standardization

The table below summarizes the layers, roles, and core contributions of the quartet in the inference system:

Layer	Role	Core Contribution
Entry	WG Serving (GIE)	Unified traffic entry and interface specification
Control	KServe	Kubernetes-native deployment and management
Scheduling	llm-d	Prefix cache-aware distributed inference scheduling
Execution	vLLM	High-performance, low-cost inference engine

Table 8: Cloud Native LLM Inference Quartet Layers and Core Contributions

Conclusion:
This “quartet” marks the beginning of a standardized and composable era for large model inference. Future trends will focus on:

API standardization (OpenAI / OpenInference)
Hierarchical and shared caching
Decoupling of control and data planes
Integrated orchestration on cloud native platforms

Summary

The cloud native LLM inference quartet—KServe, vLLM, llm-d, and WG Serving—is driving standardization, modularization, and ecosystem convergence in inference systems. Through layered collaboration and standard interfaces, developers can achieve high-performance, low-cost, and observable large language model inference services, accelerating the adoption and innovation of AI-native architectures.