Mooncake

Mooncake is a KVCache-centric disaggregated architecture for LLM serving, providing a high-performance Transfer Engine and distributed KVCache storage.

Author: kvcache-ai

Since: 2024-06-25

Visit Website GitHub

Mooncake is a KVCache-centric disaggregated architecture for LLM serving. It separates the prefill and decode clusters and leverages underutilized CPU/DRAM/SSD resources to improve throughput and resource utilization for large-model inference. The project includes a high-performance Transfer Engine, P2P Store and Mooncake Store, and provides integrations with systems like vLLM and SGLang.

Key features

Transfer Engine: unified data transfer interface supporting TCP, RDMA, CXL/shared-memory, NVMe-oF, optimized for low latency and high bandwidth in AI workloads.
Mooncake Store: distributed KVCache storage for LLM inference, supporting multi-replica, striping and parallel I/O for large-object performance.
P2P Store: decentralized temporary object sharing, useful for checkpoint transfer and avoiding single-node bandwidth saturation.
Integration: integrations with vLLM, SGLang and LMCache to enable disaggregated prefill-decode scenarios.

Use cases

Distributed large-scale LLM online inference and resource orchestration.
High bandwidth, low latency KVCache sharing and migration scenarios.
Research and reproducing experiments from the Mooncake paper and benchmark traces (open-sourced).

Technical details

Languages & bindings: primarily C++ with Python bindings and examples; optional CUDA support.
Deployment & requirements: RDMA networks recommended for best performance; Docker images and pip package (mooncake-transfer-engine) are available.
Performance: Transfer Engine achieves very high transfer bandwidth under high-bandwidth networks (e.g., 4×200 Gbps), significantly outperforming TCP-based transports.
Resources: See the project website and documentation at https://kvcache-ai.github.io/Mooncake/ for more details.

For more information, refer to the project repository and technical report.

Mooncake

Key features

Use cases

Technical details

Resource Info

Related Resources

Kata Containers

Golem

Aspire