KAITO and KubeFleet: CNCF Is Reshaping AI Inference Infrastructure

Declarative and multi-cluster capabilities of cloud native are becoming the foundation for standardized AI inference infrastructure.

AI inference is rapidly emerging as the next frontier for cloud native infrastructure. As large language models (LLM, Large Language Model) grow in capability and scale, traditional single-cluster inference architectures struggle to meet global, high-availability, and cost optimization requirements. In late October 2025, CNCF announced two new hosted projects—KAITO (Kubernetes AI Toolchain Operator) and KubeFleet—marking the cloud native community’s official entry into standardized AI inference infrastructure.

This article provides a systematic analysis of both projects and explores their strategic significance for the AI Infra ecosystem.

The Complexity of AI Inference: From Single Cluster to Multi-Cluster

As inference workloads for large models evolve, enterprises are adopting multi-cluster architectures. Below are three major challenges introduced by multi-cluster setups:

Deployment consistency: Managing model versions, dependencies, and configuration drift across clusters is difficult.
Scarce compute resources: Intelligent scheduling of available GPUs is required to avoid resource waste or hotspots.
Service reliability: Inference endpoints must deliver low latency, high availability, and cross-region SLAs.

KAITO and KubeFleet are designed to address these challenges.

The following diagram illustrates the architecture of KAITO and KubeFleet.

Figure 1: KAITO and KubeFleet Architecture

The architecture can be summarized as follows:

The top layer is the KubeFleet Hub Cluster, which controls multi-cluster placement logic.
The lower layer consists of three regional clusters (US / EU / APAC), each with Active Nodes and Spare GPUs.
The Inference Gateway provides a unified global inference entry point.
Arrow directions represent the control flow of placement and aggregation.

KAITO: Declarative Orchestration for AI Inference

KAITO (Kubernetes AI Toolchain Operator), initiated by the Microsoft team, is a declarative AI workload management framework. It abstracts model lifecycle management via CRDs (Custom Resource Definitions), making LLM inference as configurable and reusable as microservice deployment.

Project URL: github.com/kaito-project/kaito

The table below summarizes KAITO’s core features and design principles:

Feature/Principle	Description
Workspace Model Mgmt	Supports both pre-trained and BYO (Bring Your Own) models
Automatic Resource Allocation	Dynamically requests nodes and volumes based on model size and GPU availability
Multi-node Optimization	Supports distributed storage and compute scheduling
Built-in Observability	Directly outputs inference latency, throughput, and error metrics
Declarative Deployment	Models are treated as native Kubernetes resources, supporting YAML config and GitOps

Table 1: KAITO Core Features and Design Principles

For example, an inference pipeline can be declared in YAML:

apiVersion: aitoolchain.io/v1
kind: ModelDeployment
metadata:
  name: qwen2-7b
spec:
  model: qwen2-7b
  engine: vllm
  replicas: 3
  resources:
    gpu: 2

This enables AI platforms to achieve the same deployment consistency and GitOps capabilities as application services.

KubeFleet: Intelligent Multi-Cluster Scheduling and Placement

KubeFleet, led by the Azure Kubernetes Service (AKS) team, is a multi-cluster orchestrator focused on intelligent placement of inference workloads.

Project URL: github.com/kubefleet-dev/kubefleet

The table below highlights KubeFleet’s key features and use cases:

Feature/Use Case	Description
Cluster Capability Discovery	Evaluates each cluster’s GPU type, quantity, cost, and location
Intelligent Placement	Deploys inference tasks to the most suitable cluster based on policies
Staged Updates	Supports canary releases across test, staging, and production clusters
Consistency Control	Ensures unified deployment templates across clusters
Global Inference Service	Supports geo-distributed inference
Heterogeneous GPU Pool Scheduling	Enables enterprise-grade unified deployment across environments

Table 2: KubeFleet Key Features and Use Cases

KAITO × KubeFleet: Layered Design of AI Inference Infrastructure

The following table summarizes the layered positioning of KAITO and KubeFleet in AI inference infrastructure:

Layer	Responsibility	Representative Project
Global Placement Layer	Decides which cluster	KubeFleet
Cluster Orchestration Layer	Defines model deployment	KAITO
Runtime Layer	Executes inference engine	vLLM / TGI / SGLang / Triton
Infra Layer	Provides compute and scheduling	Kubernetes / GPU / CNI / Storage

Table 3: Layered Design of AI Inference Infrastructure

This layered approach reflects CNCF’s consistent philosophy: abstracting complex infrastructure through declarative and pluggable methods to lower the entry barrier for AI inference platforms.

Ecosystem Significance and Trend Analysis

AI Infra is undergoing cloud native transformation, with CNCF integrating AI workloads into its governance system. This will drive AI platforms to gradually adopt a standardized stack aligned with cloud native principles. Multi-cluster scheduling is becoming the new battleground, and GPU heterogeneity and cross-region compliance are pushing enterprises toward multi-cluster inference architectures. KubeFleet may become the “AI Federation” successor to Karmada / Clusternet. Declarative AI operations will replace manual script-based deployments, and KAITO’s CRD model could become the standard semantic layer for future ML serving. The strategic collaboration between Microsoft and CNCF is strengthening, as both projects originate from the Azure team, signaling that cloud vendors are participating in the AI ecosystem through open infrastructure standards.

Comparison with Existing Projects

The table below compares KAITO, KubeFleet, and mainstream AI inference infrastructure projects:

Feature	KAITO	KubeFleet	Kubeflow	KServe	HAMI
Declarative Model Deployment	✅	–	✅	✅	–
Multi-cluster Scheduling	–	✅	–	Partial	✅
GPU Heterogeneity Awareness	✅	✅	Partial	✅	✅
Telemetry / Metrics	✅	✅	✅	✅	✅
Cloud Vendor Support	Microsoft / CNCF	Microsoft / CNCF	Google	IBM / RedHat	AWS

Table 4: Feature Comparison of AI Inference Infrastructure Projects

Summary

The emergence of KAITO and KubeFleet marks a pivotal moment in the evolution of AI Infra. They represent the cloud native community’s formal engagement with AI inference and reveal future trends:

The complexity of AI inference will be absorbed by Kubernetes’ declarative and multi-cluster systems.
Both projects should be considered essential references for anyone researching AI-native infrastructure.
For developers and platform teams, they are not just new tools but signals of AI infrastructure standardization.

KAITO and KubeFleet: CNCF Is Reshaping AI Inference Infrastructure

The Complexity of AI Inference: From Single Cluster to Multi-Cluster

KAITO: Declarative Orchestration for AI Inference

KubeFleet: Intelligent Multi-Cluster Scheduling and Placement

KAITO × KubeFleet: Layered Design of AI Inference Infrastructure

Ecosystem Significance and Trend Analysis

Comparison with Existing Projects

Summary

References

Share via WeChat

The Natural Fit Between AI Inference and Kubernetes

In-depth Analysis of CNCF's Cloud Native AI Whitepaper

Building Efficient LLM Inference with the Cloud Native Quartet: KServe, vLLM, llm-d, and WG Serving