Overview
Crawlee is an open-source library for production-grade web crawling and browser automation. It provides unified interfaces for HTTP and headless browser crawlers, supports concurrency, proxy rotation, retries, and persistent queues.
Key Features
- Multiple crawler types: high-performance HTTP crawlers and Playwright-based browser crawlers.
- Async-first with type hints for improved developer experience and IDE support.
- Built-in retries, proxy/session management, and request routing to reduce blocking.
- Persistent storage options for datasets and key-value stores.
Use Cases
- Large-scale web scraping for training data, RAG pipelines, or analytics.
- JavaScript-heavy pages and user-interaction simulation (PlaywrightCrawler).
- Running long-running crawlers on Apify platform or in self-hosted environments.
Technical Details
- Python implementation that integrates with Playwright, BeautifulSoup and modern async libraries.
- CLI templates and quickstart tools to bootstrap crawler projects.
- Extensible storage backends and robust error handling for production deployments.