Overview
llmaz (pronounced /lima:z/) is a Kubernetes-native inference platform from InftyAI. It provides production-ready tooling and control plane components to deploy, orchestrate and serve large language models at scale.
Key features
- Support for many inference backends (vLLM, Text-Generation-Inference, llama.cpp, TensorRT-LLM, etc.).
- Heterogeneous cluster and device support with model routing and scheduling.
- Built-in integrations such as Open WebUI for chat, RAG and other common workflows.
Use cases
- Deploy LLM inference services on Kubernetes with standardized APIs for applications.
- Distributed and elastic inference across GPUs/CPUs and mixed environments.
- Multi-provider model sourcing and automatic model loading for operational workflows.
Technical notes
- CRD-based control plane for declarative model and service definitions.
- Integrations for model hubs and secret management for private model access.
- Production-oriented features: HPA integration, Karpenter autoscaling and observability hooks.