Overview
Kthena is a Kubernetes-native platform for LLM inference that provides declarative model lifecycle management and intelligent request routing. It separates control plane operations from data-plane routing, enabling teams to deploy, scale, and update models with cloud-native workflows while supporting multiple backends and heterogeneous accelerators.
Key features
- Production-ready LLM serving with support for vLLM, SGLang, Triton, and other inference engines.
- Prefill–decode disaggregation to optimize hardware utilization and meet latency SLOs.
- Cost-driven autoscaling, canary releases, weighted traffic distribution, and token-based rate limiting.
Use cases
- Serving large language models in production with high throughput and low latency requirements.
- Hybrid multi-backend deployments where intelligent routing and traffic policies are required.
- Kubernetes clusters that integrate topology-aware scheduling and gang scheduling for distributed inference workloads.
Technical highlights
- Kubernetes CRD-based control plane for declarative model lifecycle and zero-downtime updates.
- Dedicated router for high-performance request classification and multi-model routing.
- Pluggable scheduling and topology-aware placement, with LoRA adapter hot-swap support.