Overview
Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a lightweight multimodal model for document image parsing. It follows an analyze-then-parse paradigm to balance accuracy and efficiency, supporting both page-level and element-level parsing tasks.
Key Features
- Two-stage analyze-then-parse workflow: generate element sequences in natural reading order, then parse elements in parallel.
- Heterogeneous anchor prompting enables task-specific parsing strategies for paragraphs, tables, formulas, and images.
- Efficient parallel decoding to improve throughput for large-scale document processing.
- Native integration with Hugging Face; pretrained models and demos are provided.
Use Cases
- Convert scanned or photographed documents to structured JSON/Markdown (OCR + structural parsing).
- Extract tables and formulas from academic papers, reports, and invoices.
- Batch processing and information extraction for large PDF collections (indexing, dataset building).
Technical Highlights
- Uses a single visual-language model (VLM) to generate element sequences and task-specific prompts.
- Parallel parsing mechanism significantly increases throughput for document pipelines.
- Offers multiple deployment options (original framework, Hugging Face model format, TensorRT/vLLM acceleration).