pdfplumber

An open-source Python library built on pdfminer.six that exposes detailed PDF objects, table extraction, and visual debugging features.

jsvine · Since 2015-08-24

Loading score...

GitHub

Overview

pdfplumber is an open-source Python library built on top of pdfminer.six that provides access to low-level PDF objects (chars, lines, rects, images) and higher-level utilities for text extraction, table detection/extraction, and visual debugging. It is optimized for machine-generated PDFs rather than scanned documents.

Key features

Fine-grained object access to characters, lines, rectangles, and their coordinates.
Robust table extraction with configurable strategies and settings to handle diverse layouts.
Visual debugging tools that render pages with overlays for detected tables and objects to aid tuning and development.

Use cases

Extracting structured table data from machine-generated PDFs for ETL pipelines.
Analyzing PDF layout and coordinates for downstream text processing and annotation extraction.
Batch-processing large corpora of PDFs in scripting workflows and integrating into data pipelines.

Technical highlights

Leverages pdfminer.six for layout analysis and implements custom table-detection algorithms.
Offers both CLI and Python API usage with flexible parameters for advanced extraction scenarios.
Well-documented repository with examples, notebooks, and active community maintenance.

Core Content

Core Content

Technology

Technology

More

More

AI Infrastructure

AI Infrastructure

Explore

Explore

Connect

Connect

Quick Links

Quick Links

LinkedIn

LinkedIn

Follow on X

Follow on X

pdfplumber

Overview

Key features

Use cases

Technical highlights

Score Breakdown

Related Resources

PicoClaw

Agent Development Kit Web (ADK Web)

Claude Code Agents & Plugins