KubeAI

An AI inferencing operator for Kubernetes that simplifies deploying LLMs, embeddings, and speech-to-text services.

Author: SubstratusAI

Since: 2023-10-21

Overview

KubeAI is a Kubernetes-native inferencing operator designed to streamline deploying and running LLMs, embeddings, and speech-to-text services at scale. It combines a model proxy, an operator for model lifecycle management, and routing/caching optimizations to improve throughput and latency. Note: the project is marked as no longer actively maintained; evaluate continuity needs before production use.

Key Features

OpenAI-compatible API endpoints for chat, completions, and embeddings.
Optimized routing and cache-aware load balancing to improve KV cache utilization.
Automated model management with support for downloading, mounting, and dynamic LoRA adapters.

Use Cases

Hosting low-latency model inference services and chat UIs on Kubernetes.
Large-scale batch inference and embedding pipelines across clusters.
Researching cache-aware routing and distributed inference strategies (noting maintenance status).

Technical Details

Written primarily in Go with supporting Jupyter/Notebook examples and Python tooling.
Deploys via Helm charts and uses Bazel/Makefile for builds and testing.
Includes quickstart examples and comprehensive docs at https://www.kubeai.org/ .

KubeAI

Overview

Key Features

Use Cases

Technical Details

Resource Info

Related Resources

Awex

AReaL

CellARC