Overview
MInference provides million-token-scale prompt inference optimizations for long-context LLMs. It uses dynamic sparse attention, custom kernels, and KV-cache strategies to reduce pre-fill latency while preserving accuracy.
Key features
- Dynamic sparse attention and pattern-based kernel selection for fast pre-filling.
- Compatible with HF and vLLM ecosystems; includes SCBench for standardized evaluation.
Use cases
- Long-document QA, repository/code understanding, and other tasks requiring very large context windows.
Technical notes
- Implements offline/online sparse pattern detection and offers CUDA-accelerated kernels, KV-cache compression, and retrieval utilities for efficient long-context inference.