Overview
llama.cpp is a portable C/C++ LLM inference library that enables running large models locally and in the cloud across CPUs, GPUs and other accelerators. It supports GGUF format, multiple quantization schemes, and includes tools for serving, benchmarking and running models.
Key features
- Minimal dependencies and portable C/C++ implementation.
- Broad backend support: AVX/NEON/AMX (CPU), CUDA, HIP, Metal, Vulkan, MUSA.
- Multiple quantization options and GGUF compatibility.
- OpenAI-compatible
llama-server
and utilities (llama-cli, llama-bench, llama-run).
Use cases
- Local experimentation and offline inference.
- Private on-premise deployment for data-sensitive scenarios.
- Benchmarking and research on different backends and quantization setups.
Technical notes
- Implementation: primarily C/C++ with auxiliary Python tooling.
- Models & format: native GGUF support and conversion/quantization scripts in the repo.
- Extensibility: modular tools, extensive CLI options, RPC server, KV cache, speculative decoding.