tessl/pypi-crawlee

A comprehensive web scraping and browser automation library for Python with human-like behavior and bot protection bypass

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Pending

The risk profile of this skill

Overview

Eval results

Files

Crawlee

Name: tessl/pypi-crawlee
Author: tessl

A comprehensive web scraping and browser automation library for Python designed to help developers build reliable scrapers that appear human-like and bypass modern bot protections. Crawlee provides end-to-end crawling and scraping capabilities with tools to crawl the web for links, scrape data, and persistently store it in machine-readable formats.

Package Information

Package Name: crawlee
Language: Python
Installation: pip install 'crawlee[all]' (full features) or pip install crawlee (core only)
Python Version: ≥3.9

Core Imports

import crawlee
from crawlee import Request, service_locator

Common patterns for crawlers:

from crawlee.crawlers import (
    BasicCrawler, HttpCrawler,
    BeautifulSoupCrawler, ParselCrawler, PlaywrightCrawler,
    AdaptivePlaywrightCrawler
)

For specific functionality:

from crawlee.storages import Dataset, KeyValueStore, RequestQueue
from crawlee.sessions import Session, SessionPool
from crawlee.http_clients import HttpxHttpClient, CurlImpersonateHttpClient
from crawlee import ConcurrencySettings, HttpHeaders, EnqueueStrategy

Basic Usage

import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main() -> None:
    crawler = BeautifulSoupCrawler(
        max_requests_per_crawl=10,
    )

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }

        # Push data to storage
        await context.push_data(data)

        # Enqueue all links found on the page
        await context.enqueue_links()

    # Run the crawler
    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

Architecture

Crawlee follows a modular architecture with clear separation of concerns:

Crawlers: Main orchestrators that manage crawling workflows and provide specialized parsing capabilities
Storage: Persistent data management with support for datasets, key-value stores, and request queues
HTTP Clients: Pluggable HTTP implementations with support for different libraries and browser impersonation
Sessions: Session management with cookie persistence and rotation
Request Management: Advanced request queuing, deduplication, and lifecycle management
Browser Automation: Optional Playwright integration for JavaScript-heavy sites
Fingerprinting: Browser fingerprint generation for enhanced anti-detection capabilities

This design enables Crawlee to handle everything from simple HTTP scraping to complex browser automation while maintaining human-like behavior patterns.

Core API

Core Types and Request Handling

Essential types and request management functionality that forms the foundation of all crawling operations.

class Request:
    @classmethod
    def from_url(cls, url: str, **options) -> Request: ...

class ConcurrencySettings:
    def __init__(
        self,
        min_concurrency: int = 1,
        max_concurrency: int = 200,
        max_tasks_per_minute: float = float('inf'),
        desired_concurrency: int | None = None
    ): ...

class HttpHeaders(Mapping[str, str]):
    def __init__(self, headers: dict[str, str] | None = None): ...

service_locator: ServiceLocator

Core Types

Capabilities

Web Crawlers

Specialized crawler implementations for different scraping needs, from HTTP-only to full browser automation with intelligent adaptation between modes.

class BasicCrawler:
    def __init__(self, **options): ...
    async def run(self, requests: list[str | Request]): ...

class BeautifulSoupCrawler(AbstractHttpCrawler):
    def __init__(self, **options): ...

class PlaywrightCrawler:
    def __init__(self, **options): ...

class AdaptivePlaywrightCrawler:
    def __init__(self, **options): ...

Crawlers

Data Storage

Persistent storage solutions for structured data, key-value pairs, and request queue management with built-in export capabilities.

class Dataset:
    def push_data(self, data: dict | list[dict]): ...
    def export_to(self, format: str, path: str): ...

class KeyValueStore:
    def set_value(self, key: str, value: any): ...
    def get_value(self, key: str): ...

class RequestQueue:
    def add_request(self, request: Request): ...
    def fetch_next_request(self) -> Request | None: ...

Storage

HTTP Clients

Pluggable HTTP client implementations supporting different libraries and browser impersonation for enhanced anti-detection capabilities.

class HttpxHttpClient(HttpClient):
    def __init__(self, **options): ...

class CurlImpersonateHttpClient(HttpClient):
    def __init__(self, **options): ...

class HttpResponse:
    status_code: int
    headers: HttpHeaders
    text: str
    content: bytes

HTTP Clients

Session Management

Session and cookie management with rotation capabilities for maintaining state across requests and avoiding detection.

class Session:
    def __init__(self, session_pool: SessionPool): ...

class SessionPool:
    def __init__(self, max_pool_size: int = 1000): ...
    def get_session(self) -> Session: ...

class SessionCookies:
    def add_cookie(self, cookie: CookieParam): ...

Sessions

Browser Automation

Optional Playwright integration for full browser automation with support for JavaScript-heavy sites and complex user interactions.

class BrowserPool:
    def __init__(self, **options): ...

class PlaywrightBrowserController:
    def __init__(self, **options): ...

Browser Automation

Fingerprinting and Anti-Detection

Browser fingerprint generation and header randomization for enhanced stealth capabilities and bot protection bypass.

class FingerprintGenerator:
    def generate_fingerprint(self) -> dict: ...

class HeaderGenerator:
    def get_headers(self, **options: HeaderGeneratorOptions) -> HttpHeaders: ...

class DefaultFingerprintGenerator(FingerprintGenerator):
    def __init__(self, **options): ...

Fingerprinting

Configuration and Routing

Global configuration management and request routing systems for fine-tuned control over crawling behavior.

class Configuration:
    def __init__(self, **settings): ...

class Router:
    def default_handler(self, handler): ...
    def route(self, label: str, handler): ...

class ProxyConfiguration:
    def __init__(self, proxy_urls: list[str]): ...

Configuration

Statistics and Monitoring

Performance monitoring and statistics collection for tracking crawling progress and system resource usage.

class Statistics:
    def __init__(self): ...
    def get_state(self) -> StatisticsState: ...

class FinalStatistics:
    requests_finished: int
    requests_failed: int
    retry_histogram: list[int]

Statistics

Error Handling

Comprehensive exception hierarchy for handling various crawling scenarios and failure modes.

class HttpStatusCodeError(Exception): ...
class ProxyError(Exception): ...
class SessionError(Exception): ...
class RequestHandlerError(Exception): ...

Error Handling

Request Management

Advanced request lifecycle management with support for static lists, dynamic queues, and tandem operations.

class RequestList:
    def __init__(self, requests: list[str | Request]): ...

class RequestManager:
    def __init__(self, **options): ...

class RequestManagerTandem:
    def __init__(self, request_list: RequestList, request_queue: RequestQueue): ...

Request Management

Events System

Event-driven architecture for hooking into crawler lifecycle events and implementing custom behaviors.

class EventManager:
    def emit(self, event: Event, data: EventData): ...
    def on(self, event: Event, listener: EventListener): ...

class LocalEventManager(EventManager): ...

Events

CLI Tools

Command-line interface for project scaffolding and development workflow automation.

# Command line usage:
# crawlee create my-project
# crawlee --version

CLI Tools

docs

browser-automation.md

request-management.md

tessl/pypi-crawlee

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

Crawlee

Package Information

Core Imports

Basic Usage

Architecture

Core API

Core Types and Request Handling

Capabilities

Web Crawlers

Data Storage

HTTP Clients

Session Management

Browser Automation

Fingerprinting and Anti-Detection

Configuration and Routing

Statistics and Monitoring

Error Handling

Request Management

Events System

CLI Tools

index.mddocs/