Open-dLLM: Open Diffusion Large Language Models

An open-source stack for diffusion-based large language models that covers pretraining, evaluation, inference and checkpoints.

Author: Open-dLLM

Since: 2025-09-02

Visit Website GitHub

Detailed Introduction

Open-dLLM is an open-source project for diffusion-based large language models (LLMs). It provides an end-to-end stack covering raw data processing, pretraining, evaluation, inference, and distribution of checkpoints. The repository includes Open-dCoder — a code-generation variant — along with training pipelines, evaluation harnesses, and published model weights on Hugging Face for reproducibility.

Main Features

End-to-end, reproducible training pipelines from data preparation to large-scale training.
An open evaluation suite covering HumanEval, MBPP, code infilling and custom metrics for diffusion LLMs.
Easy-to-use inference and sampling scripts for experimentation and deployment.
Published checkpoints on Hugging Face to enable reproduction and transfer learning.

Use Cases

Researchers exploring diffusion generative techniques for LLMs and their optimization.
Engineering teams reproducing experiments, training custom models, or fine-tuning existing checkpoints for specific tasks.
Teaching and benchmarking: a reproducible open benchmark for code generation and infilling tasks.

Technical Features

Training objective based on Masked Diffusion Model (MDM) adapted for code generation and infilling.
Integration with VeOmni and lm-eval-harness for dataset handling and benchmarking.
Transparent configs and experiment specifications that simplify migration across compute environments and checkpoint uploads to Hugging Face.

Open-dLLM: Open Diffusion Large Language Models

Detailed Introduction

Main Features

Use Cases

Technical Features

Resource Info

Related Resources

Pixeltable

CoTyle

TOON