Marker
Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files to markdown, JSON, chunks, and HTML quickly and accurately.
Marker converts documents to markdown, JSON, chunks, and HTML quickly and accurately.
Tool Features
Marker converts various document formats including:
- PDF files
- Image files
- PPTX, DOCX, XLSX files
- HTML files
- EPUB files
- Files in all languages
Formatting Capabilities
Marker handles various document elements:
- Formats tables, forms, equations, inline math
- Extracts links, references, and code blocks
- Extracts and saves images
- Removes headers/footers and other artifacts
Extensibility
Marker offers excellent extensibility:
- Extensible with your own formatting and logic
- Does structured extraction, given a JSON schema (beta)
- Optionally boost accuracy with LLMs (and your own prompt)
- Works on GPU, CPU, or MPS
Use Cases
Marker is suitable for scenarios that require converting various document formats to structured text, such as:
- Converting PDF documents to editable Markdown format
- Extracting structured data from documents
- Preparing training data for machine learning projects
- Document digitization and archiving
- Automating document processing workflows