A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

LLaMA Box (V2)

LLaMA Box is an inference service based on llama.cpp, providing an OpenAI-compatible API, supporting multi-model, multi-device, and image generation capabilities.

Introduction

LLaMA Box is a lightweight inference server (V2) based on llama.cpp and stable-diffusion.cpp. It provides OpenAI-compatible RESTful APIs, supports both text and multimodal (image/audio) models, and can run on various hardware backends (CUDA / ROCm / Apple Metal / CPU).

Key Features

  • OpenAI-compatible endpoints: Supports /v1/chat/completions, /v1/embeddings, /v1/images, and more.
  • Multi-model and multi-device: Supports GGUF models, multi-GPU sharding, RPC server mode, and remote offload.
  • Multimodal support: Image generation, image understanding, and audio processing (requires enabling corresponding modules).
  • Inference optimization: Supports speculative decoding, KV cache, and various samplers.
  • Rich tooling scripts: Built-in scripts like chat.sh, image_generate.sh, image_edit.sh, batch_chat for quick validation and testing.

Use Cases

  • Local or private cloud model inference services and microservice integration (as a local replacement for OpenAI API).
  • Distributed inference across multiple devices and model serving on resource-constrained hardware (RPC offload).
  • Wrapping model capabilities as APIs for internal applications (chat, retrieval-augmented generation, image generation, etc.).

Technical Highlights

  • Language & Implementation: Mainly C++/Shell, built with CMake, tightly integrated with llama.cpp and stable-diffusion.cpp.
  • Backend compatibility: Supports NVIDIA CUDA, AMD ROCm, Apple Metal, Intel oneAPI, and various runtimes/devices.
  • Flexible configuration: Rich command-line parameters to control context size, concurrency, memory allocation, sampling strategies, and more.

Comments

LLaMA Box (V2)
Resource Info
Author gpustack
Added Date 2025-09-27
Open Source Since 2024-06-19
Tags
Open Source Inference Inference Service Dev Tools CLI Image Generation