Overview
oLLM is a lightweight Python library for large-context LLM inference built on Hugging Face Transformers and PyTorch. It focuses on enabling inference for very long contexts on resource-constrained GPUs by loading weights from disk, offloading KV cache, and using FlashAttention-2 and chunked MLP optimizations.
Key Features
- Support for multiple models and ultra-long contexts (examples include qwen3-next, gpt-oss, Llama3).
- On-demand weight loading, disk/CPU offloading for KV cache and layers to reduce GPU memory footprint.
- Memory- and performance-oriented techniques: FlashAttention-2, chunked MLP, DiskCache for KV storage.
- Examples, connectors and batch/streaming modes included in the repository.
Use Cases
- Local inference of large-context models on consumer GPUs (e.g., 8GB devices).
- Analyzing large documents, logs, or clinical records in one pass for summarization or extraction.
- Research and engineering workflows that require controllable offload strategies and reproducible offline inference.
Technical Highlights
- Language: Python, built on Hugging Face Transformers and PyTorch.
- Memory strategies: layer-wise loading, KV cache offload to disk/CPU, chunked MLP and FlashAttention-2.
- See the repository README for examples and detailed configuration.