A curated list of AI tools and resources for developers, see the AI Resources .

Kthena

Kthena is a Kubernetes-native LLM inference platform designed for production deployments and lifecycle management of large language models.

Overview

Kthena is a Kubernetes-native platform for LLM inference that provides declarative model lifecycle management and intelligent request routing. It separates control plane operations from data-plane routing, enabling teams to deploy, scale, and update models with cloud-native workflows while supporting multiple backends and heterogeneous accelerators.

Key features

  • Production-ready LLM serving with support for vLLM, SGLang, Triton, and other inference engines.
  • Prefill–decode disaggregation to optimize hardware utilization and meet latency SLOs.
  • Cost-driven autoscaling, canary releases, weighted traffic distribution, and token-based rate limiting.

Use cases

  • Serving large language models in production with high throughput and low latency requirements.
  • Hybrid multi-backend deployments where intelligent routing and traffic policies are required.
  • Kubernetes clusters that integrate topology-aware scheduling and gang scheduling for distributed inference workloads.

Technical highlights

  • Kubernetes CRD-based control plane for declarative model lifecycle and zero-downtime updates.
  • Dedicated router for high-performance request classification and multi-model routing.
  • Pluggable scheduling and topology-aware placement, with LoRA adapter hot-swap support.

Comments

Kthena
Resource Info
🌱 Open Source 🔮 Inference 🚀 Deployment 🛠️ Dev Tools