Detailed Introduction
PatrickStar is a PyTorch-based parallel training framework for large pretrained models (PTMs). It uses a chunk-based dynamic memory management and heterogeneous training strategy to offload non-critical data to CPU memory, allowing training of larger models with fewer GPUs and reducing OOM risk. Maintained by Tencent’s NLP / WeChat AI teams, PatrickStar aims to democratize access to large-scale model training.
Main Features
- Chunk-based dynamic memory scheduling: manage activations and parameters by computation windows to lower GPU memory usage.
- Heterogeneous offloading: move non-immediate data to CPU to support mixed CPU/GPU memory usage.
- Efficient communication and scalability: optimized collective operations for multi-GPU and multi-node setups.
- PyTorch compatibility: configuration style similar to DeepSpeed for easier migration.
Use Cases
Suitable for pretraining and large-scale fine-tuning, especially when hardware is constrained and teams need to train models in the tens to hundreds of billions of parameters. Also useful for benchmarking, framework research, and academic or research teaching environments.
Technical Features
PatrickStar implements chunk-based memory management and runtime dynamic scheduling to keep only the chunks needed for current computation while asynchronously migrating others. It optimizes collective communication for multi-card efficiency and provides benchmarks and examples for V100/A100 clusters. The project is released under BSD-3-Clause.