Detailed Introduction
Open-dLLM is an open-source project for diffusion-based large language models (LLMs). It provides an end-to-end stack covering raw data processing, pretraining, evaluation, inference, and distribution of checkpoints. The repository includes Open-dCoder — a code-generation variant — along with training pipelines, evaluation harnesses, and published model weights on Hugging Face for reproducibility.
Main Features
- End-to-end, reproducible training pipelines from data preparation to large-scale training.
- An open evaluation suite covering HumanEval, MBPP, code infilling and custom metrics for diffusion LLMs.
- Easy-to-use inference and sampling scripts for experimentation and deployment.
- Published checkpoints on Hugging Face to enable reproduction and transfer learning.
Use Cases
- Researchers exploring diffusion generative techniques for LLMs and their optimization.
- Engineering teams reproducing experiments, training custom models, or fine-tuning existing checkpoints for specific tasks.
- Teaching and benchmarking: a reproducible open benchmark for code generation and infilling tasks.
Technical Features
- Training objective based on Masked Diffusion Model (MDM) adapted for code generation and infilling.
- Integration with VeOmni and lm-eval-harness for dataset handling and benchmarking.
- Transparent configs and experiment specifications that simplify migration across compute environments and checkpoint uploads to Hugging Face.