A curated list of AI tools and resources for developers, see the AI Resources .

llmaz

llmaz is an advanced inference platform for large language models on Kubernetes that simplifies model deployment, routing and autoscaling across heterogeneous clusters.

Overview

llmaz (pronounced /lima:z/) is a Kubernetes-native inference platform from InftyAI. It provides production-ready tooling and control plane components to deploy, orchestrate and serve large language models at scale.

Key features

  • Support for many inference backends (vLLM, Text-Generation-Inference, llama.cpp, TensorRT-LLM, etc.).
  • Heterogeneous cluster and device support with model routing and scheduling.
  • Built-in integrations such as Open WebUI for chat, RAG and other common workflows.

Use cases

  • Deploy LLM inference services on Kubernetes with standardized APIs for applications.
  • Distributed and elastic inference across GPUs/CPUs and mixed environments.
  • Multi-provider model sourcing and automatic model loading for operational workflows.

Technical notes

  • CRD-based control plane for declarative model and service definitions.
  • Integrations for model hubs and secret management for private model access.
  • Production-oriented features: HPA integration, Karpenter autoscaling and observability hooks.

Comments

llmaz
Resource Info
🌱 Open Source 🛠️ Dev Tools 🛰️ Inference Service