Introduction
LLaMA Box is a lightweight inference server (V2) based on llama.cpp and stable-diffusion.cpp. It provides OpenAI-compatible RESTful APIs, supports both text and multimodal (image/audio) models, and can run on various hardware backends (CUDA / ROCm / Apple Metal / CPU).
Key Features
- OpenAI-compatible endpoints: Supports /v1/chat/completions, /v1/embeddings, /v1/images, and more.
- Multi-model and multi-device: Supports GGUF models, multi-GPU sharding, RPC server mode, and remote offload.
- Multimodal support: Image generation, image understanding, and audio processing (requires enabling corresponding modules).
- Inference optimization: Supports speculative decoding, KV cache, and various samplers.
- Rich tooling scripts: Built-in scripts like chat.sh, image_generate.sh, image_edit.sh, batch_chat for quick validation and testing.
Use Cases
- Local or private cloud model inference services and microservice integration (as a local replacement for OpenAI API).
- Distributed inference across multiple devices and model serving on resource-constrained hardware (RPC offload).
- Wrapping model capabilities as APIs for internal applications (chat, retrieval-augmented generation, image generation, etc.).
Technical Highlights
- Language & Implementation: Mainly C++/Shell, built with CMake, tightly integrated with llama.cpp and stable-diffusion.cpp.
- Backend compatibility: Supports NVIDIA CUDA, AMD ROCm, Apple Metal, Intel oneAPI, and various runtimes/devices.
- Flexible configuration: Rich command-line parameters to control context size, concurrency, memory allocation, sampling strategies, and more.