A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

Dask

Dask is a Python library for parallel computing and task scheduling, suited for scaling NumPy, Pandas and machine learning workloads across clusters.

Overview

Dask is a Python library for parallel and distributed computing. It provides task scheduling and delayed execution that scale NumPy, Pandas, and Scikit-learn workflows from a single machine to a cluster. Dask is widely used for data processing, feature engineering, and preparing large datasets for model training.

Key features

  • Task graphs and distributed schedulers for parallel execution.
  • Seamless integration with the PyData ecosystem (NumPy, Pandas, Scikit-learn).
  • Scales from single-node to large clusters with flexible deployment options.

Use cases

  • Large-scale data preprocessing and feature engineering.
  • Distributed training data preparation and batch jobs.
  • Parallel scientific computing and analytics workloads.

License

  • BSD-3-Clause — a permissive open-source license suitable for many commercial and academic uses.

Comments

Dask
Resource Info
🛠️ Dev Tools 🖥️ ML Platform 🌱 Open Source