oLLM

Lightweight offline LLM inference library optimized for very long context inference on low-memory GPUs.

Author: Mega4alik

Since: 2025-08-16

Overview

oLLM is a lightweight Python library for large-context LLM inference built on Hugging Face Transformers and PyTorch. It focuses on enabling inference for very long contexts on resource-constrained GPUs by loading weights from disk, offloading KV cache, and using FlashAttention-2 and chunked MLP optimizations.

Key Features

Support for multiple models and ultra-long contexts (examples include qwen3-next, gpt-oss, Llama3).
On-demand weight loading, disk/CPU offloading for KV cache and layers to reduce GPU memory footprint.
Memory- and performance-oriented techniques: FlashAttention-2, chunked MLP, DiskCache for KV storage.
Examples, connectors and batch/streaming modes included in the repository.

Use Cases

Local inference of large-context models on consumer GPUs (e.g., 8GB devices).
Analyzing large documents, logs, or clinical records in one pass for summarization or extraction.
Research and engineering workflows that require controllable offload strategies and reproducible offline inference.

Technical Highlights

Language: Python, built on Hugging Face Transformers and PyTorch.
Memory strategies: layer-wise loading, KV cache offload to disk/CPU, chunked MLP and FlashAttention-2.
See the repository README for examples and detailed configuration.

oLLM

Overview

Key Features

Use Cases

Technical Highlights

Resource Info

Related Resources

Pixeltable

CoTyle

TOON