NCCL

High-performance collective communication primitives for GPUs, optimized for PCIe, NVLink, NVSwitch and RDMA networks.

Author: NVIDIA

Since: 2015-06-01

Introduction

NCCL (NVIDIA Collective Communication Library) provides high-performance collective communication primitives for GPUs, including all-reduce, all-gather, reduce, broadcast and reduce-scatter, as well as point-to-point patterns. It is optimized for high bandwidth across PCIe, NVLink, NVSwitch and RDMA-based networks, enabling efficient data exchange and model-parallel communication across single-node and multi-node GPU configurations.

Key features

High bandwidth communication optimized for GPU interconnects.
Comprehensive primitives for distributed training and communication.
Scalable across an arbitrary number of GPUs; supports single- and multi-process (MPI) workflows.
Integration examples and test suites (e.g., nccl-tests) and straightforward build scripts for packaging.

Use cases

Distributed training: as a low-level communication layer for gradient aggregation and parameter synchronization in data/model parallel training.
Multi-GPU inference: coordinate data movement for model-parallel or distributed inference at scale.
High-performance computing: scientific and engineering workloads that require low-latency, high-throughput GPU communication.

Technical characteristics

GPU-centric optimizations for CUDA and interconnect topologies.
Topology-aware routing to exploit NVLink/NVSwitch when available.
Lightweight C/C++ API and Make/CMake-based build and packaging.

NCCL

Introduction

Key features

Use cases

Technical characteristics

Resource Info

Related Resources

NVIDIA GPU Operator

Transformer Engine

CUTLASS