Declarative and multi-cluster capabilities of cloud native are becoming the foundation for standardized AI inference infrastructure.
AI inference is rapidly emerging as the next frontier for cloud native infrastructure. As large language models (LLM, Large Language Model) grow in capability and scale, traditional single-cluster inference architectures struggle to meet global, high-availability, and cost optimization requirements. In late October 2025, CNCF announced two new hosted projects—KAITO (Kubernetes AI Toolchain Operator) and KubeFleet—marking the cloud native community’s official entry into standardized AI inference infrastructure.
This article provides a systematic analysis of both projects and explores their strategic significance for the AI Infra ecosystem.
The Complexity of AI Inference: From Single Cluster to Multi-Cluster
As inference workloads for large models evolve, enterprises are adopting multi-cluster architectures. Below are three major challenges introduced by multi-cluster setups:
- Deployment consistency: Managing model versions, dependencies, and configuration drift across clusters is difficult.
- Scarce compute resources: Intelligent scheduling of available GPUs is required to avoid resource waste or hotspots.
- Service reliability: Inference endpoints must deliver low latency, high availability, and cross-region SLAs.
KAITO and KubeFleet are designed to address these challenges.
The following diagram illustrates the architecture of KAITO and KubeFleet.
The architecture can be summarized as follows:
- The top layer is the KubeFleet Hub Cluster, which controls multi-cluster placement logic.
- The lower layer consists of three regional clusters (US / EU / APAC), each with Active Nodes and Spare GPUs.
- The Inference Gateway provides a unified global inference entry point.
- Arrow directions represent the control flow of placement and aggregation.
KAITO: Declarative Orchestration for AI Inference
KAITO (Kubernetes AI Toolchain Operator), initiated by the Microsoft team, is a declarative AI workload management framework. It abstracts model lifecycle management via CRDs (Custom Resource Definitions), making LLM inference as configurable and reusable as microservice deployment.
Project URL: github.com/kaito-project/kaito
The table below summarizes KAITO’s core features and design principles:
| Feature/Principle | Description |
|---|---|
| Workspace Model Mgmt | Supports both pre-trained and BYO (Bring Your Own) models |
| Automatic Resource Allocation | Dynamically requests nodes and volumes based on model size and GPU availability |
| Multi-node Optimization | Supports distributed storage and compute scheduling |
| Built-in Observability | Directly outputs inference latency, throughput, and error metrics |
| Declarative Deployment | Models are treated as native Kubernetes resources, supporting YAML config and GitOps |
For example, an inference pipeline can be declared in YAML:
apiVersion: aitoolchain.io/v1
kind: ModelDeployment
metadata:
name: qwen2-7b
spec:
model: qwen2-7b
engine: vllm
replicas: 3
resources:
gpu: 2
This enables AI platforms to achieve the same deployment consistency and GitOps capabilities as application services.
KubeFleet: Intelligent Multi-Cluster Scheduling and Placement
KubeFleet, led by the Azure Kubernetes Service (AKS) team, is a multi-cluster orchestrator focused on intelligent placement of inference workloads.
Project URL: github.com/kubefleet-dev/kubefleet
The table below highlights KubeFleet’s key features and use cases:
| Feature/Use Case | Description |
|---|---|
| Cluster Capability Discovery | Evaluates each cluster’s GPU type, quantity, cost, and location |
| Intelligent Placement | Deploys inference tasks to the most suitable cluster based on policies |
| Staged Updates | Supports canary releases across test, staging, and production clusters |
| Consistency Control | Ensures unified deployment templates across clusters |
| Global Inference Service | Supports geo-distributed inference |
| Heterogeneous GPU Pool Scheduling | Enables enterprise-grade unified deployment across environments |
KAITO × KubeFleet: Layered Design of AI Inference Infrastructure
The following table summarizes the layered positioning of KAITO and KubeFleet in AI inference infrastructure:
| Layer | Responsibility | Representative Project |
|---|---|---|
| Global Placement Layer | Decides which cluster | KubeFleet |
| Cluster Orchestration Layer | Defines model deployment | KAITO |
| Runtime Layer | Executes inference engine | vLLM / TGI / SGLang / Triton |
| Infra Layer | Provides compute and scheduling | Kubernetes / GPU / CNI / Storage |
This layered approach reflects CNCF’s consistent philosophy: abstracting complex infrastructure through declarative and pluggable methods to lower the entry barrier for AI inference platforms.
Ecosystem Significance and Trend Analysis
AI Infra is undergoing cloud native transformation, with CNCF integrating AI workloads into its governance system. This will drive AI platforms to gradually adopt a standardized stack aligned with cloud native principles. Multi-cluster scheduling is becoming the new battleground, and GPU heterogeneity and cross-region compliance are pushing enterprises toward multi-cluster inference architectures. KubeFleet may become the “AI Federation” successor to Karmada / Clusternet. Declarative AI operations will replace manual script-based deployments, and KAITO’s CRD model could become the standard semantic layer for future ML serving. The strategic collaboration between Microsoft and CNCF is strengthening, as both projects originate from the Azure team, signaling that cloud vendors are participating in the AI ecosystem through open infrastructure standards.
Comparison with Existing Projects
The table below compares KAITO, KubeFleet, and mainstream AI inference infrastructure projects:
| Feature | KAITO | KubeFleet | Kubeflow | KServe | HAMI |
|---|---|---|---|---|---|
| Declarative Model Deployment | ✅ | – | ✅ | ✅ | – |
| Multi-cluster Scheduling | – | ✅ | – | Partial | ✅ |
| GPU Heterogeneity Awareness | ✅ | ✅ | Partial | ✅ | ✅ |
| Telemetry / Metrics | ✅ | ✅ | ✅ | ✅ | ✅ |
| Cloud Vendor Support | Microsoft / CNCF | Microsoft / CNCF | IBM / RedHat | AWS |
Summary
The emergence of KAITO and KubeFleet marks a pivotal moment in the evolution of AI Infra. They represent the cloud native community’s formal engagement with AI inference and reveal future trends:
- The complexity of AI inference will be absorbed by Kubernetes’ declarative and multi-cluster systems.
- Both projects should be considered essential references for anyone researching AI-native infrastructure.
- For developers and platform teams, they are not just new tools but signals of AI infrastructure standardization.