llmaz

llmaz is an advanced inference platform for large language models on Kubernetes that simplifies model deployment, routing and autoscaling across heterogeneous clusters.

Author: InftyAI

Since: 2023-11-20

Visit Website GitHub

Overview

llmaz (pronounced /lima:z/) is a Kubernetes-native inference platform from InftyAI. It provides production-ready tooling and control plane components to deploy, orchestrate and serve large language models at scale.

Key features

Support for many inference backends (vLLM, Text-Generation-Inference, llama.cpp, TensorRT-LLM, etc.).
Heterogeneous cluster and device support with model routing and scheduling.
Built-in integrations such as Open WebUI for chat, RAG and other common workflows.

Use cases

Deploy LLM inference services on Kubernetes with standardized APIs for applications.
Distributed and elastic inference across GPUs/CPUs and mixed environments.
Multi-provider model sourcing and automatic model loading for operational workflows.

Technical notes

CRD-based control plane for declarative model and service definitions.
Integrations for model hubs and secret management for private model access.
Production-oriented features: HPA integration, Karpenter autoscaling and observability hooks.

llmaz

Overview

Key features

Use cases

Technical notes

Resource Info

Related Resources

Kata Containers

Golem

Aspire