DataTrove

DataTrove provides composable, platform-agnostic pipelines for large-scale text data processing, including extraction, filtering, deduplication and saving.

Hugging Face · Since 2023-06-14

Loading score...

GitHub

Overview

DataTrove is an open-source library offering composable pipeline blocks to process, filter and deduplicate large-scale text datasets. It supports various executors and runtime backends to scale from local runs to cluster deployments.

Key features

Modular pipeline blocks: readers, writers, extractors, filters and stats.
Multiple executors: LocalPipelineExecutor, SlurmPipelineExecutor, RayPipelineExecutor for different scales.
Examples and quickstarts for Common Crawl processing, deduplication, and synthetic data generation.
Integrations with Hugging Face datasets and tooling; detailed docs and active contributor community.

Use cases

Preparing and cleaning corpora for model pretraining or fine-tuning.
Building preprocessing pipelines for retrieval datasets used in RAG systems.
Large-scale deduplication and data profiling for dataset hygiene.

Technical details

Primary language: Python (small Rust components).
License: Apache-2.0.
Installable via pip with optional extras: datatrove[io], datatrove[processing], datatrove[ray], datatrove[cli].

Core Content

Core Content

Technology

Technology

More

More

AI Infrastructure

AI Infrastructure

Explore

Explore

Connect

Connect

Quick Links

Quick Links

LinkedIn

LinkedIn

Follow on X

Follow on X

DataTrove

Overview

Key features

Use cases

Technical details

Score Breakdown

Related Resources

Candle

huggingface diffusers

LeRobot