A curated list of AI tools and resources for developers, see the AI Resources .

Dolphin

Document image parsing model using heterogeneous anchor prompting; provides efficient page-level and element-level parsing capabilities.

Overview

Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a lightweight multimodal model for document image parsing. It follows an analyze-then-parse paradigm to balance accuracy and efficiency, supporting both page-level and element-level parsing tasks.

Key Features

  • Two-stage analyze-then-parse workflow: generate element sequences in natural reading order, then parse elements in parallel.
  • Heterogeneous anchor prompting enables task-specific parsing strategies for paragraphs, tables, formulas, and images.
  • Efficient parallel decoding to improve throughput for large-scale document processing.
  • Native integration with Hugging Face; pretrained models and demos are provided.

Use Cases

  • Convert scanned or photographed documents to structured JSON/Markdown (OCR + structural parsing).
  • Extract tables and formulas from academic papers, reports, and invoices.
  • Batch processing and information extraction for large PDF collections (indexing, dataset building).

Technical Highlights

  • Uses a single visual-language model (VLM) to generate element sequences and task-specific prompts.
  • Parallel parsing mechanism significantly increases throughput for document pipelines.
  • Offers multiple deployment options (original framework, Hugging Face model format, TensorRT/vLLM acceleration).

Comments

Dolphin
Resource Info
🌱 Open Source 🧲 Utility