RedPajama Data

RedPajama Data provides open tools and prepared corpora for training large language models and reproducible data processing.

Author: Together Computer

Added Date: 2025-11-02

Open Source Since: 2023-04-14

Visit Website GitHub

Overview

RedPajama Data is an open project that supplies curated corpora and processing pipelines for large-scale language model training. It aims to make dataset preparation reproducible and auditable for researchers and practitioners.

Key features

End-to-end preprocessing scripts for cleaning, deduplication, and sharding.
Formats and outputs compatible with major training frameworks and dataset hubs.
Apache-2.0 license enabling community reuse and downstream research.

Use cases

Source data for pretraining or fine-tuning large language models.
Reference implementation for reproducible dataset preparation and auditing.
Educational and analysis scenarios to study large-corpus composition and quality.

Technical highlights

Modular, parallelizable pipeline design that supports large-scale processing.
Clear metadata and provenance information for auditability and compliance.

RedPajama Data

Overview

Key features

Use cases

Technical highlights

Resource Info

Related Resources

Proton

MineContext

PandaWiki