Overview
Dask is a Python library for parallel and distributed computing. It provides task scheduling and delayed execution that scale NumPy, Pandas, and Scikit-learn workflows from a single machine to a cluster. Dask is widely used for data processing, feature engineering, and preparing large datasets for model training.
Key features
- Task graphs and distributed schedulers for parallel execution.
- Seamless integration with the PyData ecosystem (NumPy, Pandas, Scikit-learn).
- Scales from single-node to large clusters with flexible deployment options.
Use cases
- Large-scale data preprocessing and feature engineering.
- Distributed training data preparation and batch jobs.
- Parallel scientific computing and analytics workloads.
License
- BSD-3-Clause — a permissive open-source license suitable for many commercial and academic uses.