MInference

MInference is a framework for long-context LLM inference that accelerates pre-filling and large-context processing using dynamic sparse attention and optimized kernels.

Author: Microsoft

Added Date: 2025-09-27

Open Source Since: 2024-05-22

Visit Website GitHub

Overview

MInference provides million-token-scale prompt inference optimizations for long-context LLMs. It uses dynamic sparse attention, custom kernels, and KV-cache strategies to reduce pre-fill latency while preserving accuracy.

Key features

Dynamic sparse attention and pattern-based kernel selection for fast pre-filling.
Compatible with HF and vLLM ecosystems; includes SCBench for standardized evaluation.

Use cases

Long-document QA, repository/code understanding, and other tasks requiring very large context windows.

Technical notes

Implements offline/online sparse pattern detection and offers CUDA-accelerated kernels, KV-cache compression, and retrieval utilities for efficient long-context inference.

MInference

Overview

Key features

Use cases

Technical notes

Resource Info

Related Resources

Amplifier

EdgeAI for Beginners

ONNX Runtime