Kthena

Kthena is a Kubernetes-native LLM inference platform designed for production deployments and lifecycle management of large language models.

Volcano · Since 2025-05-08

Loading score...

GitHub Website

Overview

Kthena is a Kubernetes-native platform for LLM inference that provides declarative model lifecycle management and intelligent request routing. It separates control plane operations from data-plane routing, enabling teams to deploy, scale, and update models with cloud-native workflows while supporting multiple backends and heterogeneous accelerators.

Key features

Production-ready LLM serving with support for vLLM, SGLang, Triton, and other inference engines.
Prefill–decode disaggregation to optimize hardware utilization and meet latency SLOs.
Cost-driven autoscaling, canary releases, weighted traffic distribution, and token-based rate limiting.

Use cases

Serving large language models in production with high throughput and low latency requirements.
Hybrid multi-backend deployments where intelligent routing and traffic policies are required.
Kubernetes clusters that integrate topology-aware scheduling and gang scheduling for distributed inference workloads.

Technical highlights

Kubernetes CRD-based control plane for declarative model lifecycle and zero-downtime updates.
Dedicated router for high-performance request classification and multi-model routing.
Pluggable scheduling and topology-aware placement, with LoRA adapter hot-swap support.

Kthena

Overview

Key features

Use cases

Technical highlights

Score Breakdown

Related Resources

Mini-SGLang

Osaurus

vLLM Production Stack