DeepEP

An efficient expert-parallel communication library that provides low-overhead communication primitives for large-scale distributed training.

Author: DeepSeek

Since: 2025-02-17

GitHub

Overview

DeepEP is an expert-parallel oriented communication library designed to reduce communication latency and bandwidth overhead in large-scale distributed training. It offers a compact set of communication primitives optimized for expert-parallel and hybrid-parallel scenarios.

Key Features

Communication primitives tailored for expert-parallel setups to reduce redundant transfers and aggregation overhead.
CUDA-optimized implementations that enable asynchronous transfers and compute-communication overlap for higher throughput.
Lightweight, integration-friendly API surface for use with PyTorch and custom training stacks.

Use Cases

Efficient transfer of expert parameters and activations in expert-parallel training schemes.
Improving scalability and training efficiency across multi-GPU and multi-node clusters under constrained bandwidth.
Acting as a drop-in or replacement communication layer to optimize training pipelines.

Technical Details

Focuses on compute-communication overlap and bandwidth efficiency via packing, distribution, and async transport strategies.
Uses CUDA primitives and memory layouts optimized for GPU communication parallelism.
Supports composable parallel strategies and considerations for deployment on diverse hardware topologies.

DeepEP

Overview

Key Features

Use Cases

Technical Details

Resource Info

Related Resources

DeepSeek-OCR

FlashMLA

EPLB