A curated list of AI tools and resources for developers, see the AI Resources .

MInference

MInference is a framework for long-context LLM inference that accelerates pre-filling and large-context processing using dynamic sparse attention and optimized kernels.

Overview

MInference provides million-token-scale prompt inference optimizations for long-context LLMs. It uses dynamic sparse attention, custom kernels, and KV-cache strategies to reduce pre-fill latency while preserving accuracy.

Key features

  • Dynamic sparse attention and pattern-based kernel selection for fast pre-filling.
  • Compatible with HF and vLLM ecosystems; includes SCBench for standardized evaluation.

Use cases

  • Long-document QA, repository/code understanding, and other tasks requiring very large context windows.

Technical notes

  • Implements offline/online sparse pattern detection and offers CUDA-accelerated kernels, KV-cache compression, and retrieval utilities for efficient long-context inference.

Comments

MInference
Resource Info
🌱 Open Source 🚀 Deployment