LoRAX

A high-performance LoRA inference server that supports on-demand loading of thousands of fine-tuned adapters for production deployment.

Author: Predibase

Since: 2023-10-20

Visit Website GitHub

Overview

LoRAX (LoRA eXchange) is a LoRA-focused inference server that supports dynamic loading and merging of adapters, enabling efficient inference for thousands of fine-tuned models in GPU/CPU mixed environments.

Key features

Dynamic adapter loading: load adapters from Hugging Face, Predibase, or local files on demand and merge adapters per request.
High throughput and low latency: asynchronous batching, adapter prefetch/offload, and CUDA performance optimizations (flash-attention, paged attention).
Production-ready: prebuilt Docker images, Helm charts, Prometheus metrics, and distributed tracing support.

Use cases

Unified inference and management platform for many fine-tuned models in multi-tenant or personalization scenarios.
Cost-effective online serving when needing to support many adapters or task-specific models concurrently.

Technical notes

Supports FP16 and multiple quantization backends (bitsandbytes, GPT-Q, AWQ) and is compatible with mainstream base models.
Provides an OpenAI-compatible API and a Python client, supports token streaming and structured JSON outputs.

LoRAX

Overview

Key features

Use cases

Technical notes

Resource Info

Related Resources

Kata Containers

Golem

Aspire