Read: From using AI to building AI systems, a defining note on what I’m exploring.

dInfer

dInfer is an efficient inference framework for diffusion language models, focusing on decoding algorithms and KV-cache management to improve throughput and quality.

InclusionAI · Since 2025-09-29
Loading score...

Overview

dInfer is an efficient and extensible inference framework for diffusion language models (dLLMs). It modularizes inference into model, diffusion iteration manager, decoding strategy and KV-cache management, offering flexible APIs to combine different algorithms and system optimizations to maximize GPU utilization and throughput.

Key features

  • Multiple decoding algorithms: soft diffusion iterations, hierarchical and parallel decoding strategies for higher throughput while maintaining quality.
  • KV-cache strategies: vicinity refresh and cache management to mitigate staleness and improve cache hit rates.
  • System-level optimizations: support for tensor and expert parallelism, PyTorch compilation, CUDA Graphs and loop unrolling to reduce kernel overhead.

Use cases

  • High-performance inference services that require improved throughput and lower latency compared to standard autoregressive decoding.
  • Benchmarking and system-level optimization when comparing model variants or deploying new decoding algorithms.
  • Integration into containerized and distributed inference pipelines for production deployment.

Technical notes

  • Implemented in Python with modular APIs to support different model backends and parallel configurations.
  • Designed to leverage both algorithmic and system-level improvements for practical deployment on GPU clusters.

Comments

dInfer
Score Breakdown
🔮 Inference 🛠️ Dev Tools