Dolphin

Document image parsing model using heterogeneous anchor prompting; provides efficient page-level and element-level parsing capabilities.

Author: ByteDance

Added Date: 2025-07-09

Open Source Since: 2025-05-13

Visit Website GitHub

Overview

Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a lightweight multimodal model for document image parsing. It follows an analyze-then-parse paradigm to balance accuracy and efficiency, supporting both page-level and element-level parsing tasks.

Key Features

Two-stage analyze-then-parse workflow: generate element sequences in natural reading order, then parse elements in parallel.
Heterogeneous anchor prompting enables task-specific parsing strategies for paragraphs, tables, formulas, and images.
Efficient parallel decoding to improve throughput for large-scale document processing.
Native integration with Hugging Face; pretrained models and demos are provided.

Use Cases

Convert scanned or photographed documents to structured JSON/Markdown (OCR + structural parsing).
Extract tables and formulas from academic papers, reports, and invoices.
Batch processing and information extraction for large PDF collections (indexing, dataset building).

Technical Highlights

Uses a single visual-language model (VLM) to generate element sequences and task-specific prompts.
Parallel parsing mechanism significantly increases throughput for document pipelines.
Offers multiple deployment options (original framework, Hugging Face model format, TensorRT/vLLM acceleration).

Dolphin

Overview

Key Features

Use Cases

Technical Highlights

Resource Info

Related Resources

MineContext

Eino

BAGEL