A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

Crawlee

An open-source Python library for building reliable crawlers and browser automation with async support, proxy rotation, and persistent storage.

Overview

Crawlee is an open-source library for production-grade web crawling and browser automation. It provides unified interfaces for HTTP and headless browser crawlers, supports concurrency, proxy rotation, retries, and persistent queues.

Key Features

  • Multiple crawler types: high-performance HTTP crawlers and Playwright-based browser crawlers.
  • Async-first with type hints for improved developer experience and IDE support.
  • Built-in retries, proxy/session management, and request routing to reduce blocking.
  • Persistent storage options for datasets and key-value stores.

Use Cases

  • Large-scale web scraping for training data, RAG pipelines, or analytics.
  • JavaScript-heavy pages and user-interaction simulation (PlaywrightCrawler).
  • Running long-running crawlers on Apify platform or in self-hosted environments.

Technical Details

  • Python implementation that integrates with Playwright, BeautifulSoup and modern async libraries.
  • CLI templates and quickstart tools to bootstrap crawler projects.
  • Extensible storage backends and robust error handling for production deployments.

Comments

Crawlee
Resource Info
Author Apify
Added Date 2025-10-01
Tags
Dev Tools OSS