A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

oLLM

Lightweight offline LLM inference library optimized for very long context inference on low-memory GPUs.

Overview

oLLM is a lightweight Python library for large-context LLM inference built on Hugging Face Transformers and PyTorch. It focuses on enabling inference for very long contexts on resource-constrained GPUs by loading weights from disk, offloading KV cache, and using FlashAttention-2 and chunked MLP optimizations.

Key Features

  • Support for multiple models and ultra-long contexts (examples include qwen3-next, gpt-oss, Llama3).
  • On-demand weight loading, disk/CPU offloading for KV cache and layers to reduce GPU memory footprint.
  • Memory- and performance-oriented techniques: FlashAttention-2, chunked MLP, DiskCache for KV storage.
  • Examples, connectors and batch/streaming modes included in the repository.

Use Cases

  • Local inference of large-context models on consumer GPUs (e.g., 8GB devices).
  • Analyzing large documents, logs, or clinical records in one pass for summarization or extraction.
  • Research and engineering workflows that require controllable offload strategies and reproducible offline inference.

Technical Highlights

  • Language: Python, built on Hugging Face Transformers and PyTorch.
  • Memory strategies: layer-wise loading, KV cache offload to disk/CPU, chunked MLP and FlashAttention-2.
  • See the repository README for examples and detailed configuration.

Comments

oLLM
Resource Info
Author Mega4alik
Added Date 2025-09-29
Tags
LLM Inference Dev Tools OSS