A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

NCCL

High-performance collective communication primitives for GPUs, optimized for PCIe, NVLink, NVSwitch and RDMA networks.

Introduction

NCCL (NVIDIA Collective Communication Library) provides high-performance collective communication primitives for GPUs, including all-reduce, all-gather, reduce, broadcast and reduce-scatter, as well as point-to-point patterns. It is optimized for high bandwidth across PCIe, NVLink, NVSwitch and RDMA-based networks, enabling efficient data exchange and model-parallel communication across single-node and multi-node GPU configurations.

Key features

  • High bandwidth communication optimized for GPU interconnects.
  • Comprehensive primitives for distributed training and communication.
  • Scalable across an arbitrary number of GPUs; supports single- and multi-process (MPI) workflows.
  • Integration examples and test suites (e.g., nccl-tests) and straightforward build scripts for packaging.

Use cases

  • Distributed training: as a low-level communication layer for gradient aggregation and parameter synchronization in data/model parallel training.
  • Multi-GPU inference: coordinate data movement for model-parallel or distributed inference at scale.
  • High-performance computing: scientific and engineering workloads that require low-latency, high-throughput GPU communication.

Technical characteristics

  • GPU-centric optimizations for CUDA and interconnect topologies.
  • Topology-aware routing to exploit NVLink/NVSwitch when available.
  • Lightweight C/C++ API and Make/CMake-based build and packaging.

Comments

NCCL
Resource Info
🌱 Open Source AI Kernel Library 🔮 Inference