olmOCR

A toolkit for linearizing PDFs and image-based documents into readable plain text and Markdown, aimed at LLM dataset creation and large-scale document processing.

Allen Institute for AI (AI2) · Since 2024-09-17

Loading score...

GitHub Website

Introduction

olmOCR is a toolkit developed by the Allen Institute for AI to linearize PDFs and image-based documents into structured plain text or Markdown. It targets LLM dataset creation and industrial-scale document processing, preserving equations, tables and layout while producing natural reading-order text suitable for downstream use.

Key features

Support for multiple input formats (PDF, PNG, JPEG) with recognition for tables, equations and handwriting.
Automatic header/footer removal and reading-order reconstruction, with Markdown output for easy tooling integration.
Benchmark suite (olmOCR-Bench) and Docker images to evaluate performance and scale deployments.

Use cases

Converting academic papers and reports into corpora for LLM training or retrieval.
Batch-processing institutional archives and compliance documents for indexing and archival.
Preprocessing and baseline generation in data annotation and quality evaluation workflows.

Technical highlights

Vision-language model based decoding with vLLM/SGLang inference backends.
Multi-node, S3-coordinated pipelines to process millions of PDFs at scale.
Reproducible training/finetuning code, synthetic data generation, and benchmark tooling.

Core Content

Core Content

Technology

Technology

More

More

AI Infrastructure

AI Infrastructure

Explore

Explore

Connect

Connect

Quick Links

Quick Links

LinkedIn

LinkedIn

Follow on X

Follow on X

olmOCR

Introduction

Key features

Use cases

Technical highlights

Score Breakdown

Related Resources

AutoSubs

Axolotl

Cactus