LLaMA Box (V2)

LLaMA Box is an inference service based on llama.cpp, providing an OpenAI-compatible API, supporting multi-model, multi-device, and image generation capabilities.

Author: gpustack

Added Date: 2025-09-27

Open Source Since: 2024-06-19

Visit Website GitHub

Introduction

LLaMA Box is a lightweight inference server (V2) based on llama.cpp and stable-diffusion.cpp. It provides OpenAI-compatible RESTful APIs, supports both text and multimodal (image/audio) models, and can run on various hardware backends (CUDA / ROCm / Apple Metal / CPU).

Key Features

OpenAI-compatible endpoints: Supports /v1/chat/completions, /v1/embeddings, /v1/images, and more.
Multi-model and multi-device: Supports GGUF models, multi-GPU sharding, RPC server mode, and remote offload.
Multimodal support: Image generation, image understanding, and audio processing (requires enabling corresponding modules).
Inference optimization: Supports speculative decoding, KV cache, and various samplers.
Rich tooling scripts: Built-in scripts like chat.sh, image_generate.sh, image_edit.sh, batch_chat for quick validation and testing.

Use Cases

Local or private cloud model inference services and microservice integration (as a local replacement for OpenAI API).
Distributed inference across multiple devices and model serving on resource-constrained hardware (RPC offload).
Wrapping model capabilities as APIs for internal applications (chat, retrieval-augmented generation, image generation, etc.).

Technical Highlights

Language & Implementation: Mainly C++/Shell, built with CMake, tightly integrated with llama.cpp and stable-diffusion.cpp.
Backend compatibility: Supports NVIDIA CUDA, AMD ROCm, Apple Metal, Intel oneAPI, and various runtimes/devices.
Flexible configuration: Rich command-line parameters to control context size, concurrency, memory allocation, sampling strategies, and more.

LLaMA Box (V2)

Introduction

Key Features

Use Cases

Technical Highlights

Resource Info

Related Resources

gpustack

Nano-vLLM

DeepSeek-OCR