Overview
DeepEP is an expert-parallel oriented communication library designed to reduce communication latency and bandwidth overhead in large-scale distributed training. It offers a compact set of communication primitives optimized for expert-parallel and hybrid-parallel scenarios.
Key Features
- Communication primitives tailored for expert-parallel setups to reduce redundant transfers and aggregation overhead.
- CUDA-optimized implementations that enable asynchronous transfers and compute-communication overlap for higher throughput.
- Lightweight, integration-friendly API surface for use with PyTorch and custom training stacks.
Use Cases
- Efficient transfer of expert parameters and activations in expert-parallel training schemes.
- Improving scalability and training efficiency across multi-GPU and multi-node clusters under constrained bandwidth.
- Acting as a drop-in or replacement communication layer to optimize training pipelines.
Technical Details
- Focuses on compute-communication overlap and bandwidth efficiency via packing, distribution, and async transport strategies.
- Uses CUDA primitives and memory layouts optimized for GPU communication parallelism.
- Supports composable parallel strategies and considerations for deployment on diverse hardware topologies.