A curated list of AI tools and resources for developers, see the AI Resources .

KAITO and KubeFleet: CNCF Is Reshaping AI Inference Infrastructure

CNCF is standardizing AI inference infrastructure for scalable deployment in multi-cluster Kubernetes environments through KAITO and KubeFleet.

Declarative and multi-cluster capabilities of cloud native are becoming the foundation for standardized AI inference infrastructure.

AI inference is rapidly emerging as the next frontier for cloud native infrastructure. As large language models (LLM, Large Language Model) grow in capability and scale, traditional single-cluster inference architectures struggle to meet global, high-availability, and cost optimization requirements. In late October 2025, CNCF announced two new hosted projects—KAITO (Kubernetes AI Toolchain Operator) and KubeFleet—marking the cloud native community’s official entry into standardized AI inference infrastructure.

This article provides a systematic analysis of both projects and explores their strategic significance for the AI Infra ecosystem.

The Complexity of AI Inference: From Single Cluster to Multi-Cluster

As inference workloads for large models evolve, enterprises are adopting multi-cluster architectures. Below are three major challenges introduced by multi-cluster setups:

  • Deployment consistency: Managing model versions, dependencies, and configuration drift across clusters is difficult.
  • Scarce compute resources: Intelligent scheduling of available GPUs is required to avoid resource waste or hotspots.
  • Service reliability: Inference endpoints must deliver low latency, high availability, and cross-region SLAs.

KAITO and KubeFleet are designed to address these challenges.

The following diagram illustrates the architecture of KAITO and KubeFleet.

Figure 1: KAITO and KubeFleet Architecture
Figure 1: KAITO and KubeFleet Architecture

The architecture can be summarized as follows:

  • The top layer is the KubeFleet Hub Cluster, which controls multi-cluster placement logic.
  • The lower layer consists of three regional clusters (US / EU / APAC), each with Active Nodes and Spare GPUs.
  • The Inference Gateway provides a unified global inference entry point.
  • Arrow directions represent the control flow of placement and aggregation.

KAITO: Declarative Orchestration for AI Inference

KAITO (Kubernetes AI Toolchain Operator), initiated by the Microsoft team, is a declarative AI workload management framework. It abstracts model lifecycle management via CRDs (Custom Resource Definitions), making LLM inference as configurable and reusable as microservice deployment.

Project URL: github.com/kaito-project/kaito

The table below summarizes KAITO’s core features and design principles:

Feature/PrincipleDescription
Workspace Model MgmtSupports both pre-trained and BYO (Bring Your Own) models
Automatic Resource AllocationDynamically requests nodes and volumes based on model size and GPU availability
Multi-node OptimizationSupports distributed storage and compute scheduling
Built-in ObservabilityDirectly outputs inference latency, throughput, and error metrics
Declarative DeploymentModels are treated as native Kubernetes resources, supporting YAML config and GitOps
Table 1: KAITO Core Features and Design Principles

For example, an inference pipeline can be declared in YAML:

apiVersion: aitoolchain.io/v1
kind: ModelDeployment
metadata:
  name: qwen2-7b
spec:
  model: qwen2-7b
  engine: vllm
  replicas: 3
  resources:
    gpu: 2

This enables AI platforms to achieve the same deployment consistency and GitOps capabilities as application services.

KubeFleet: Intelligent Multi-Cluster Scheduling and Placement

KubeFleet, led by the Azure Kubernetes Service (AKS) team, is a multi-cluster orchestrator focused on intelligent placement of inference workloads.

Project URL: github.com/kubefleet-dev/kubefleet

The table below highlights KubeFleet’s key features and use cases:

Feature/Use CaseDescription
Cluster Capability DiscoveryEvaluates each cluster’s GPU type, quantity, cost, and location
Intelligent PlacementDeploys inference tasks to the most suitable cluster based on policies
Staged UpdatesSupports canary releases across test, staging, and production clusters
Consistency ControlEnsures unified deployment templates across clusters
Global Inference ServiceSupports geo-distributed inference
Heterogeneous GPU Pool SchedulingEnables enterprise-grade unified deployment across environments
Table 2: KubeFleet Key Features and Use Cases

KAITO × KubeFleet: Layered Design of AI Inference Infrastructure

The following table summarizes the layered positioning of KAITO and KubeFleet in AI inference infrastructure:

LayerResponsibilityRepresentative Project
Global Placement LayerDecides which clusterKubeFleet
Cluster Orchestration LayerDefines model deploymentKAITO
Runtime LayerExecutes inference enginevLLM / TGI / SGLang / Triton
Infra LayerProvides compute and schedulingKubernetes / GPU / CNI / Storage
Table 3: Layered Design of AI Inference Infrastructure

This layered approach reflects CNCF’s consistent philosophy: abstracting complex infrastructure through declarative and pluggable methods to lower the entry barrier for AI inference platforms.

Ecosystem Significance and Trend Analysis

AI Infra is undergoing cloud native transformation, with CNCF integrating AI workloads into its governance system. This will drive AI platforms to gradually adopt a standardized stack aligned with cloud native principles. Multi-cluster scheduling is becoming the new battleground, and GPU heterogeneity and cross-region compliance are pushing enterprises toward multi-cluster inference architectures. KubeFleet may become the “AI Federation” successor to Karmada / Clusternet. Declarative AI operations will replace manual script-based deployments, and KAITO’s CRD model could become the standard semantic layer for future ML serving. The strategic collaboration between Microsoft and CNCF is strengthening, as both projects originate from the Azure team, signaling that cloud vendors are participating in the AI ecosystem through open infrastructure standards.

Comparison with Existing Projects

The table below compares KAITO, KubeFleet, and mainstream AI inference infrastructure projects:

FeatureKAITOKubeFleetKubeflowKServeHAMI
Declarative Model Deployment
Multi-cluster SchedulingPartial
GPU Heterogeneity AwarenessPartial
Telemetry / Metrics
Cloud Vendor SupportMicrosoft / CNCFMicrosoft / CNCFGoogleIBM / RedHatAWS
Table 4: Feature Comparison of AI Inference Infrastructure Projects

Summary

The emergence of KAITO and KubeFleet marks a pivotal moment in the evolution of AI Infra. They represent the cloud native community’s formal engagement with AI inference and reveal future trends:

  • The complexity of AI inference will be absorbed by Kubernetes’ declarative and multi-cluster systems.
  • Both projects should be considered essential references for anyone researching AI-native infrastructure.
  • For developers and platform teams, they are not just new tools but signals of AI infrastructure standardization.

References

Post Navigation

Comments