A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

llm-d

A Kubernetes-native distributed inference stack providing well‑lit paths for high-performance LLM serving across diverse accelerators.

Overview

llm-d is a Kubernetes-native distributed inference stack that offers tested “well‑lit paths” for serving large generative models at scale. It integrates vLLM, Inference Gateway, and optimized routing and scheduling to reduce time-to-first-token and improve throughput across multi-vendor accelerators.

Key Features

  • Intelligent scheduling that is cache- and workload-aware to maximize KV cache utilization.
  • Disaggregated serving patterns (prefill/decode) to reduce latency and improve predictability.
  • Support for multiple accelerators and production-ready Helm charts and guides.

Use Cases

  • High-throughput, low-latency online LLM serving and conversational interfaces.
  • Large-scale batch inference and embedding pipelines.
  • Research and benchmarking of distributed inference strategies and cache-aware routing.

Technical Details

  • Integrates with vLLM and IGW, leveraging high-performance transports (e.g., NIXL) for inter-component communication.
  • Provides Helm charts, guides, and reproducible examples for quick production adoption.
  • Maintains active documentation and a CI-driven engineering workflow to support multiple deployment scales.

Comments

llm-d
Resource Info
Open Source Inference ML Platform