Overview
dInfer is an efficient and extensible inference framework for diffusion language models (dLLMs). It modularizes inference into model, diffusion iteration manager, decoding strategy and KV-cache management, offering flexible APIs to combine different algorithms and system optimizations to maximize GPU utilization and throughput.
Key features
- Multiple decoding algorithms: soft diffusion iterations, hierarchical and parallel decoding strategies for higher throughput while maintaining quality.
- KV-cache strategies: vicinity refresh and cache management to mitigate staleness and improve cache hit rates.
- System-level optimizations: support for tensor and expert parallelism, PyTorch compilation, CUDA Graphs and loop unrolling to reduce kernel overhead.
Use cases
- High-performance inference services that require improved throughput and lower latency compared to standard autoregressive decoding.
- Benchmarking and system-level optimization when comparing model variants or deploying new decoding algorithms.
- Integration into containerized and distributed inference pipelines for production deployment.
Technical notes
- Implemented in Python with modular APIs to support different model backends and parallel configurations.
- Designed to leverage both algorithmic and system-level improvements for practical deployment on GPU clusters.