A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

Marker

Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files to markdown, JSON, chunks, and HTML quickly and accurately.

Marker converts documents to markdown, JSON, chunks, and HTML quickly and accurately.

Tool Features

Marker converts various document formats including:

  • PDF files
  • Image files
  • PPTX, DOCX, XLSX files
  • HTML files
  • EPUB files
  • Files in all languages

Formatting Capabilities

Marker handles various document elements:

  • Formats tables, forms, equations, inline math
  • Extracts links, references, and code blocks
  • Extracts and saves images
  • Removes headers/footers and other artifacts

Extensibility

Marker offers excellent extensibility:

  • Extensible with your own formatting and logic
  • Does structured extraction, given a JSON schema (beta)
  • Optionally boost accuracy with LLMs (and your own prompt)
  • Works on GPU, CPU, or MPS

Use Cases

Marker is suitable for scenarios that require converting various document formats to structured text, such as:

  • Converting PDF documents to editable Markdown format
  • Extracting structured data from documents
  • Preparing training data for machine learning projects
  • Document digitization and archiving
  • Automating document processing workflows

Comments

Marker
Resource Info
Author Datalab.to
Added Date 2025-08-21
Tags
Utility