A curated list of AI tools and resources for developers, see the AI Resources .

Building Efficient LLM Inference with the Cloud Native Quartet: KServe, vLLM, llm-d, and WG Serving

Essential reading for cloud native and AI-native architects: how KServe, vLLM, llm-d, and WG Serving form the cloud native ‘quartet’ for large model inference, their roles, synergy, and ecosystem trends.

The standardization and modularization of cloud native inference systems are making large model deployment as simple and efficient as web services.

Large language model (LLM, Large Language Model) inference is evolving from the era of single-machine accelerators to distributed cloud native systems. The most representative combination today is KServe, vLLM, llm-d, and WG Serving. Each plays a distinct role—standard interface, execution engine, scheduling layer, and collaboration specification—forming a scalable, observable, and governable inference foundation.

The following timeline outlines the key milestones in the evolution of the quartet:

Figure 1: Cloud Native LLM Inference Quartet Evolution Timeline
Figure 1: Cloud Native LLM Inference Quartet Evolution Timeline

Architecture Overview

The diagram below illustrates the layered collaboration of the quartet within the inference system:

Figure 2: Cloud Native LLM Inference Quartet Architecture Overview
Figure 2: Cloud Native LLM Inference Quartet Architecture Overview

KServe: The Core of Cloud Native Model Serving

KServe is a Kubernetes-native inference control plane. It abstracts model services as CRDs (Custom Resource Definitions), making AI inference as deployable, scalable, and upgradable as microservices.

The table below summarizes KServe’s core capabilities and new features:

DimensionDescription
Core GoalProvides Kubernetes-native inference standards and control plane
Core FeaturesCRD standardization, elastic scaling, traffic governance, unified gateway entry
New FeaturesLeaderWorkerSet support, AI Gateway integration, llm-d integration
Table 1: KServe Core Capabilities and New Features

KServe’s key capabilities include:

  • Unified Interface: The InferenceService CRD defines input/output protocols, compatible with REST/GRPC and OpenAI API.
  • Elastic Scheduling: Supports automatic GPU scaling and ModelMesh multi-model hosting.
  • Traffic Governance: Canary releases, A/B testing, and InferenceGraph.

The latest version introduces the LeaderWorkerSet (LWS) mechanism and Envoy AI Gateway extension, making multi-Pod large model inference a native capability. KServe is transitioning from a traditional ML service platform to the standard control plane for generative AI.

vLLM: High-Performance Inference Execution Engine

vLLM focuses on extreme throughput and memory efficiency, setting the benchmark for open-source performance.

The sequence diagram below shows the main vLLM inference process:

Figure 3: vLLM Inference Process
Figure 3: vLLM Inference Process

The table below summarizes vLLM’s core technical mechanisms and effects:

FeatureTechnical MechanismEffect
PagedAttentionMemory pagingLonger context, less fragmentation
Continuous BatchingDynamic batch schedulingHigher GPU utilization
Prefix CachePrefix reuseLower latency and cost
Table 2: vLLM Core Technical Mechanisms and Effects

vLLM is compatible with the OpenAI API, supports INT8/FP8 quantization and various parallel modes, and adapts to NVIDIA, AMD, TPU, and Gaudi hardware. In single-machine or small-scale scenarios, vLLM can run independently; in cluster environments, it serves as the execution foundation for KServe/llm-d.

llm-d: Distributed Inference Scheduling Layer

llm-d is a large model scheduling and orchestration system for Kubernetes, enabling multi-instance collaboration for vLLM. Its design goal: make clusters infer like a single machine.

The table below summarizes llm-d’s core mechanisms and technical highlights:

ModuleFunctionTechnical Highlight
SchedulerCache-aware routingPrefix affinity scheduling
Prefill/Decode SeparationHeterogeneous hardware optimizationA100 Prefill + L40 Decode
Cache ManagerGlobal cache indexHierarchical GPU/CPU/NVMe cache
Table 3: llm-d Core Mechanisms and Technical Highlights

The following diagram illustrates llm-d’s distributed scheduling and caching mechanism:

Figure 4: llm-d Distributed Scheduling and Caching Mechanism
Figure 4: llm-d Distributed Scheduling and Caching Mechanism

llm-d runs under the KServe control plane in a Leader/Worker pattern. The scheduler can be embedded in Envoy or deployed independently, making real-time routing decisions based on cache and load information. Its emergence enables autonomous scheduling and elastic parallelism for multi-node LLM inference.

WG Serving: Collaboration Standards and Ecosystem Hub

WG Serving is an AI Serving working group promoted by the Kubernetes community, defining unified inference semantics in K8s.

The table below summarizes WG Serving’s core achievements and standardization contributions:

Achievement/StandardDescription
Gateway Inference Extension (GIE)Envoy-based inference gateway protocol supporting model identification, streaming forwarding, priority, and cache affinity routing
LeaderWorkerSet CRDExplicitly describes Leader–Worker collaboration structure, foundational for llm-d and KServe multi-Pod inference
Interface AlignmentAdvocates OpenAI-style API integration with K8s resource objects, promoting cross-framework interoperability
Table 4: WG Serving Core Achievements and Standardization Contributions

GIE is the ‘unified gateway language’ for cloud native AI inference, just as Ingress defines HTTP service entry, it defines the standard semantics and gateway behavior for inference requests within Kubernetes, enabling composable, observable, and extensible inference systems.

The diagram below shows WG Serving’s standardized collaboration within the inference system:

Figure 5: WG Serving Standardized Collaboration
Figure 5: WG Serving Standardized Collaboration

WG Serving is not a product, but a standard layer forming industry consensus, driving the unified language for cloud native AI inference.

Combined Architecture

The table below summarizes the division of labor and roles of the quartet in the system:

LayerComponentRole
Entry LayerEnvoy + GIEUnified API gateway and traffic hooks
Control LayerKServe + LWSLifecycle management, elastic scaling, traffic orchestration
Scheduling Layerllm-dPrefix-aware routing, cross-Pod collaboration, cache management
Execution LayervLLMEfficient inference execution and cache reuse
Table 5: Cloud Native LLM Inference Quartet Division of Labor and Roles

The diagram below illustrates the synergy among the quartet:

Figure 6: Cloud Native LLM Inference Quartet Synergy
Figure 6: Cloud Native LLM Inference Quartet Synergy

Clients send requests in OpenAI API format, routed by the GIE gateway to the optimal Leader, with Prefill completed and cache passed to Worker Decode, finally streaming the result back. The entire chain features standard interfaces, high throughput, and elastic scaling.

The table below summarizes the convergence trends and feature comparisons in the cloud native LLM inference ecosystem:

Trend/FeatureDescription
API UnificationOpenAI-style interfaces have become the de facto standard; KServe and vLLM natively compatible
Module DecouplingGateway, scheduling, and inference are layered for independent evolution and replacement
Hierarchical CachingGPU–CPU–NVMe three-level KV cache is mainstream
Community CollaborationWG Serving, PyTorch Foundation, CNCF jointly promote cross-project integration
Table 6: Cloud Native LLM Inference Ecosystem Convergence Trends and Feature Comparison

The following matrix compares the core capabilities of each project:

ProjectControl PlaneInference PerformanceDistributed CapabilityInterface CompatibilityCache MechanismElastic Scaling
KServe✅ CRD / LWS⚪ Medium⭐ Multi-model management✅ OpenAI API⚪ None
vLLM⚪ None🌟 Very High⭐ Multi-GPU✅ OpenAI API✅ Paged KV⚪ None
llm-d⭐ K8s-native scheduling🌟 High🌟 Multi-instance collaboration✅ Inherits upper-layer interface🌟 Global cache
WG Serving🌟 Standard abstraction⚪ None🌟 Cross-project collaboration🌟 Unified specification⚪ Not involved
Table 7: Cloud Native LLM Inference Quartet Feature Comparison Matrix

The future inference stack will center on standard APIs and pluggable modules, enabling deployment of large language models (LLM, Large Language Model) as easily as web services.

Deployment Paradigm Example

In a Kubernetes cluster, the deployment paradigm for the quartet is as follows:

  • Prefill Layer: 4 × A100 Pods, responsible for long-context computation.
  • Decode Layer: 16 × L4 Pods, performing streaming generation.
  • llm-d Scheduler: Dynamically routes based on cache hit rate.
  • KServe Control Plane: Manages LWS resources and scaling.
  • Envoy GIE Gateway: Unified OpenAI interface entry.

The topology diagram below shows the deployment structure:

Figure 7: Cloud Native LLM Inference Quartet Deployment Topology
Figure 7: Cloud Native LLM Inference Quartet Deployment Topology

This combination achieves high concurrency, low cost, and observability for large model services.

Conclusion: The Future of Standardization

The table below summarizes the layers, roles, and core contributions of the quartet in the inference system:

LayerRoleCore Contribution
EntryWG Serving (GIE)Unified traffic entry and interface specification
ControlKServeKubernetes-native deployment and management
Schedulingllm-dPrefix cache-aware distributed inference scheduling
ExecutionvLLMHigh-performance, low-cost inference engine
Table 8: Cloud Native LLM Inference Quartet Layers and Core Contributions

Conclusion:
This “quartet” marks the beginning of a standardized and composable era for large model inference. Future trends will focus on:

  • API standardization (OpenAI / OpenInference)
  • Hierarchical and shared caching
  • Decoupling of control and data planes
  • Integrated orchestration on cloud native platforms

Summary

The cloud native LLM inference quartet—KServe, vLLM, llm-d, and WG Serving—is driving standardization, modularization, and ecosystem convergence in inference systems. Through layered collaboration and standard interfaces, developers can achieve high-performance, low-cost, and observable large language model inference services, accelerating the adoption and innovation of AI-native architectures.

Post Navigation

Comments