CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-crawlee

A comprehensive web scraping and browser automation library for Python with human-like behavior and bot protection bypass

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

index.mddocs/

Crawlee

A comprehensive web scraping and browser automation library for Python designed to help developers build reliable scrapers that appear human-like and bypass modern bot protections. Crawlee provides end-to-end crawling and scraping capabilities with tools to crawl the web for links, scrape data, and persistently store it in machine-readable formats.

Package Information

  • Package Name: crawlee
  • Language: Python
  • Installation: pip install 'crawlee[all]' (full features) or pip install crawlee (core only)
  • Python Version: ≥3.9

Core Imports

import crawlee
from crawlee import Request, service_locator

Common patterns for crawlers:

from crawlee.crawlers import (
    BasicCrawler, HttpCrawler,
    BeautifulSoupCrawler, ParselCrawler, PlaywrightCrawler,
    AdaptivePlaywrightCrawler
)

For specific functionality:

from crawlee.storages import Dataset, KeyValueStore, RequestQueue
from crawlee.sessions import Session, SessionPool
from crawlee.http_clients import HttpxHttpClient, CurlImpersonateHttpClient
from crawlee import ConcurrencySettings, HttpHeaders, EnqueueStrategy

Basic Usage

import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main() -> None:
    crawler = BeautifulSoupCrawler(
        max_requests_per_crawl=10,
    )

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }

        # Push data to storage
        await context.push_data(data)

        # Enqueue all links found on the page
        await context.enqueue_links()

    # Run the crawler
    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

Architecture

Crawlee follows a modular architecture with clear separation of concerns:

  • Crawlers: Main orchestrators that manage crawling workflows and provide specialized parsing capabilities
  • Storage: Persistent data management with support for datasets, key-value stores, and request queues
  • HTTP Clients: Pluggable HTTP implementations with support for different libraries and browser impersonation
  • Sessions: Session management with cookie persistence and rotation
  • Request Management: Advanced request queuing, deduplication, and lifecycle management
  • Browser Automation: Optional Playwright integration for JavaScript-heavy sites
  • Fingerprinting: Browser fingerprint generation for enhanced anti-detection capabilities

This design enables Crawlee to handle everything from simple HTTP scraping to complex browser automation while maintaining human-like behavior patterns.

Core API

Core Types and Request Handling

Essential types and request management functionality that forms the foundation of all crawling operations.

class Request:
    @classmethod
    def from_url(cls, url: str, **options) -> Request: ...

class ConcurrencySettings:
    def __init__(
        self,
        min_concurrency: int = 1,
        max_concurrency: int = 200,
        max_tasks_per_minute: float = float('inf'),
        desired_concurrency: int | None = None
    ): ...

class HttpHeaders(Mapping[str, str]):
    def __init__(self, headers: dict[str, str] | None = None): ...

service_locator: ServiceLocator

Core Types

Capabilities

Web Crawlers

Specialized crawler implementations for different scraping needs, from HTTP-only to full browser automation with intelligent adaptation between modes.

class BasicCrawler:
    def __init__(self, **options): ...
    async def run(self, requests: list[str | Request]): ...

class BeautifulSoupCrawler(AbstractHttpCrawler):
    def __init__(self, **options): ...

class PlaywrightCrawler:
    def __init__(self, **options): ...

class AdaptivePlaywrightCrawler:
    def __init__(self, **options): ...

Crawlers

Data Storage

Persistent storage solutions for structured data, key-value pairs, and request queue management with built-in export capabilities.

class Dataset:
    def push_data(self, data: dict | list[dict]): ...
    def export_to(self, format: str, path: str): ...

class KeyValueStore:
    def set_value(self, key: str, value: any): ...
    def get_value(self, key: str): ...

class RequestQueue:
    def add_request(self, request: Request): ...
    def fetch_next_request(self) -> Request | None: ...

Storage

HTTP Clients

Pluggable HTTP client implementations supporting different libraries and browser impersonation for enhanced anti-detection capabilities.

class HttpxHttpClient(HttpClient):
    def __init__(self, **options): ...

class CurlImpersonateHttpClient(HttpClient):
    def __init__(self, **options): ...

class HttpResponse:
    status_code: int
    headers: HttpHeaders
    text: str
    content: bytes

HTTP Clients

Session Management

Session and cookie management with rotation capabilities for maintaining state across requests and avoiding detection.

class Session:
    def __init__(self, session_pool: SessionPool): ...

class SessionPool:
    def __init__(self, max_pool_size: int = 1000): ...
    def get_session(self) -> Session: ...

class SessionCookies:
    def add_cookie(self, cookie: CookieParam): ...

Sessions

Browser Automation

Optional Playwright integration for full browser automation with support for JavaScript-heavy sites and complex user interactions.

class BrowserPool:
    def __init__(self, **options): ...

class PlaywrightBrowserController:
    def __init__(self, **options): ...

Browser Automation

Fingerprinting and Anti-Detection

Browser fingerprint generation and header randomization for enhanced stealth capabilities and bot protection bypass.

class FingerprintGenerator:
    def generate_fingerprint(self) -> dict: ...

class HeaderGenerator:
    def get_headers(self, **options: HeaderGeneratorOptions) -> HttpHeaders: ...

class DefaultFingerprintGenerator(FingerprintGenerator):
    def __init__(self, **options): ...

Fingerprinting

Configuration and Routing

Global configuration management and request routing systems for fine-tuned control over crawling behavior.

class Configuration:
    def __init__(self, **settings): ...

class Router:
    def default_handler(self, handler): ...
    def route(self, label: str, handler): ...

class ProxyConfiguration:
    def __init__(self, proxy_urls: list[str]): ...

Configuration

Statistics and Monitoring

Performance monitoring and statistics collection for tracking crawling progress and system resource usage.

class Statistics:
    def __init__(self): ...
    def get_state(self) -> StatisticsState: ...

class FinalStatistics:
    requests_finished: int
    requests_failed: int
    retry_histogram: list[int]

Statistics

Error Handling

Comprehensive exception hierarchy for handling various crawling scenarios and failure modes.

class HttpStatusCodeError(Exception): ...
class ProxyError(Exception): ...
class SessionError(Exception): ...
class RequestHandlerError(Exception): ...

Error Handling

Request Management

Advanced request lifecycle management with support for static lists, dynamic queues, and tandem operations.

class RequestList:
    def __init__(self, requests: list[str | Request]): ...

class RequestManager:
    def __init__(self, **options): ...

class RequestManagerTandem:
    def __init__(self, request_list: RequestList, request_queue: RequestQueue): ...

Request Management

Events System

Event-driven architecture for hooking into crawler lifecycle events and implementing custom behaviors.

class EventManager:
    def emit(self, event: Event, data: EventData): ...
    def on(self, event: Event, listener: EventListener): ...

class LocalEventManager(EventManager): ...

Events

CLI Tools

Command-line interface for project scaffolding and development workflow automation.

# Command line usage:
# crawlee create my-project
# crawlee --version

CLI Tools

docs

browser-automation.md

cli-tools.md

configuration.md

core-types.md

crawlers.md

error-handling.md

events.md

fingerprinting.md

http-clients.md

index.md

request-management.md

sessions.md

statistics.md

storage.md

tile.json