Introduction
dots.ocr is a multilingual document parser built on a compact 1.7B vision-language model. It unifies layout detection and content recognition while preserving reading order, providing strong end-to-end performance on benchmarks such as OmniDocBench. The project includes CLI tools, model weight download scripts, and multiple deployment options (vLLM, Hugging Face, Docker).
Key Features
- Single-model approach for both layout detection and recognition, simplifying pipelines
- Strong end-to-end performance on layout and text recognition benchmarks
- Support for multilingual parsing, table and formula recognition
- Provides a Web Gradio demo, Docker image and multiple inference backends (vLLM, transformers)
Use Cases
- Research and benchmarking for document understanding models
- Building RAG pipelines by converting PDFs and scans into retrievable chunks
- Bulk extraction of metadata and sections from academic papers or reports
- Local/private deployment for privacy-sensitive document processing
Technical Highlights
- Python implementation, cross-platform, pip-installable and Docker-friendly
- Based on a single 1.7B VLM, tasks can be switched via prompts to the model
- Integrates with vLLM for high-throughput inference and supports HF transformer backends