Dolphin

A novel multimodal document image parsing model following an analyze-then-parse paradigm.

Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a novel multimodal document image parsing model following an analyze-then-parse paradigm. This repository contains the demo code and pre-trained models for Dolphin.

Overview

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach:

  • Stage 1: Comprehensive page-level layout analysis by generating element sequence in natural reading order
  • Stage 2: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts

Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.

Comments

Dolphin
Resource Info
Author ByteDance
Added Date 2025-08-09
Type
Tool
Tags
Development