DLRover

DLRover is an automatic distributed deep learning system that provides elastic scheduling, flash checkpointing and auto-scaling to simplify large-scale model training on Kubernetes and Ray.

Intelligent Machine Learning · Since 2022-06-24

Loading score...

GitHub Website

Introduction

DLRover is an industrial-grade automatic distributed deep learning system designed to reduce training downtime, improve resource utilization, and accelerate failure recovery for large-scale model training on Kubernetes or Ray clusters.

Key features

Fault tolerance and recovery: automatic diagnosis and process restart to minimize training interruption.
Flash Checkpoint: asynchronous checkpoint persistence and in-memory fast recovery for seconds-level resume of large models.
Auto-scaling and scheduling: dynamic scaling and data sharding to mitigate stragglers and improve throughput.

Use cases

Production orchestration and operations of large-scale LLM/model training.
Distributed training tasks on K8s/Ray that require elasticity, fault-tolerance, and fast recovery.
Scenarios that need to reduce I/O overhead and speed up checkpoint/recovery processes.

Technical details

Primarily implemented in Python with supporting Go/C++ components; integrates with DDP, FSDP, DeepSpeed, and Megatron-LM.
Provides tutorials and examples (elastic scheduling, node health checks, Flash Checkpoint) for easy integration into existing training pipelines.

Core Content

Core Content

Technology

Technology

More

More

AI Infrastructure

AI Infrastructure

Explore

Explore

Connect

Connect

Quick Links

Quick Links

LinkedIn

LinkedIn

Follow on X

Follow on X

DLRover

Introduction

Key features

Use cases

Technical details

Score Breakdown

Related Resources

AI-Trader

AReaL

AXLearn