Tessl Tile for pypi/crawlee@0.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

browser-automation.md cli-tools.md configuration.md core-types.md crawlers.md error-handling.md events.md fingerprinting.md http-clients.md index.md request-management.md sessions.md statistics.md storage.md

index.mddocs/

0
# Crawlee
1

2
A comprehensive web scraping and browser automation library for Python designed to help developers build reliable scrapers that appear human-like and bypass modern bot protections. Crawlee provides end-to-end crawling and scraping capabilities with tools to crawl the web for links, scrape data, and persistently store it in machine-readable formats.
3

4
## Package Information
5

6
- **Package Name**: crawlee
7
- **Language**: Python
8
- **Installation**: `pip install 'crawlee[all]'` (full features) or `pip install crawlee` (core only)
9
- **Python Version**: ≥3.9
10

11
## Core Imports
12

13
```python
14
import crawlee
15
from crawlee import Request, service_locator
16
```
17

18
Common patterns for crawlers:
19

20
```python
21
from crawlee.crawlers import (
22
    BasicCrawler, HttpCrawler,
23
    BeautifulSoupCrawler, ParselCrawler, PlaywrightCrawler,
24
    AdaptivePlaywrightCrawler
25
)
26
```
27

28
For specific functionality:
29

30
```python
31
from crawlee.storages import Dataset, KeyValueStore, RequestQueue
32
from crawlee.sessions import Session, SessionPool
33
from crawlee.http_clients import HttpxHttpClient, CurlImpersonateHttpClient
34
from crawlee import ConcurrencySettings, HttpHeaders, EnqueueStrategy
35
```
36

37
## Basic Usage
38

39
```python
40
import asyncio
41
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
42

43
async def main() -> None:
44
    crawler = BeautifulSoupCrawler(
45
        max_requests_per_crawl=10,
46
    )
47

48
    @crawler.router.default_handler
49
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
50
        context.log.info(f'Processing {context.request.url} ...')
51

52
        # Extract data from the page
53
        data = {
54
            'url': context.request.url,
55
            'title': context.soup.title.string if context.soup.title else None,
56
        }
57

58
        # Push data to storage
59
        await context.push_data(data)
60

61
        # Enqueue all links found on the page
62
        await context.enqueue_links()
63

64
    # Run the crawler
65
    await crawler.run(['https://example.com'])
66

67
if __name__ == '__main__':
68
    asyncio.run(main())
69
```
70

71
## Architecture
72

73
Crawlee follows a modular architecture with clear separation of concerns:
74

75
- **Crawlers**: Main orchestrators that manage crawling workflows and provide specialized parsing capabilities
76
- **Storage**: Persistent data management with support for datasets, key-value stores, and request queues
77
- **HTTP Clients**: Pluggable HTTP implementations with support for different libraries and browser impersonation
78
- **Sessions**: Session management with cookie persistence and rotation
79
- **Request Management**: Advanced request queuing, deduplication, and lifecycle management
80
- **Browser Automation**: Optional Playwright integration for JavaScript-heavy sites
81
- **Fingerprinting**: Browser fingerprint generation for enhanced anti-detection capabilities
82

83
This design enables Crawlee to handle everything from simple HTTP scraping to complex browser automation while maintaining human-like behavior patterns.
84

85
## Core API
86

87
### Core Types and Request Handling
88

89
Essential types and request management functionality that forms the foundation of all crawling operations.
90

91
```python { .api }
92
class Request:
93
    @classmethod
94
    def from_url(cls, url: str, **options) -> Request: ...
95

96
class ConcurrencySettings:
97
    def __init__(
98
        self,
99
        min_concurrency: int = 1,
100
        max_concurrency: int = 200,
101
        max_tasks_per_minute: float = float('inf'),
102
        desired_concurrency: int | None = None
103
    ): ...
104

105
class HttpHeaders(Mapping[str, str]):
106
    def __init__(self, headers: dict[str, str] | None = None): ...
107

108
service_locator: ServiceLocator
109
```
110

111
[Core Types](./core-types.md)
112

113
## Capabilities
114

115
### Web Crawlers
116

117
Specialized crawler implementations for different scraping needs, from HTTP-only to full browser automation with intelligent adaptation between modes.
118

119
```python { .api }
120
class BasicCrawler:
121
    def __init__(self, **options): ...
122
    async def run(self, requests: list[str | Request]): ...
123

124
class BeautifulSoupCrawler(AbstractHttpCrawler):
125
    def __init__(self, **options): ...
126

127
class PlaywrightCrawler:
128
    def __init__(self, **options): ...
129

130
class AdaptivePlaywrightCrawler:
131
    def __init__(self, **options): ...
132
```
133

134
[Crawlers](./crawlers.md)
135

136
### Data Storage
137

138
Persistent storage solutions for structured data, key-value pairs, and request queue management with built-in export capabilities.
139

140
```python { .api }
141
class Dataset:
142
    def push_data(self, data: dict | list[dict]): ...
143
    def export_to(self, format: str, path: str): ...
144

145
class KeyValueStore:
146
    def set_value(self, key: str, value: any): ...
147
    def get_value(self, key: str): ...
148

149
class RequestQueue:
150
    def add_request(self, request: Request): ...
151
    def fetch_next_request(self) -> Request | None: ...
152
```
153

154
[Storage](./storage.md)
155

156
### HTTP Clients
157

158
Pluggable HTTP client implementations supporting different libraries and browser impersonation for enhanced anti-detection capabilities.
159

160
```python { .api }
161
class HttpxHttpClient(HttpClient):
162
    def __init__(self, **options): ...
163

164
class CurlImpersonateHttpClient(HttpClient):
165
    def __init__(self, **options): ...
166

167
class HttpResponse:
168
    status_code: int
169
    headers: HttpHeaders
170
    text: str
171
    content: bytes
172
```
173

174
[HTTP Clients](./http-clients.md)
175

176
### Session Management
177

178
Session and cookie management with rotation capabilities for maintaining state across requests and avoiding detection.
179

180
```python { .api }
181
class Session:
182
    def __init__(self, session_pool: SessionPool): ...
183

184
class SessionPool:
185
    def __init__(self, max_pool_size: int = 1000): ...
186
    def get_session(self) -> Session: ...
187

188
class SessionCookies:
189
    def add_cookie(self, cookie: CookieParam): ...
190
```
191

192
[Sessions](./sessions.md)
193

194
### Browser Automation
195

196
Optional Playwright integration for full browser automation with support for JavaScript-heavy sites and complex user interactions.
197

198
```python { .api }
199
class BrowserPool:
200
    def __init__(self, **options): ...
201

202
class PlaywrightBrowserController:
203
    def __init__(self, **options): ...
204
```
205

206
[Browser Automation](./browser-automation.md)
207

208
### Fingerprinting and Anti-Detection
209

210
Browser fingerprint generation and header randomization for enhanced stealth capabilities and bot protection bypass.
211

212
```python { .api }
213
class FingerprintGenerator:
214
    def generate_fingerprint(self) -> dict: ...
215

216
class HeaderGenerator:
217
    def get_headers(self, **options: HeaderGeneratorOptions) -> HttpHeaders: ...
218

219
class DefaultFingerprintGenerator(FingerprintGenerator):
220
    def __init__(self, **options): ...
221
```
222

223
[Fingerprinting](./fingerprinting.md)
224

225
### Configuration and Routing
226

227
Global configuration management and request routing systems for fine-tuned control over crawling behavior.
228

229
```python { .api }
230
class Configuration:
231
    def __init__(self, **settings): ...
232

233
class Router:
234
    def default_handler(self, handler): ...
235
    def route(self, label: str, handler): ...
236

237
class ProxyConfiguration:
238
    def __init__(self, proxy_urls: list[str]): ...
239
```
240

241
[Configuration](./configuration.md)
242

243
### Statistics and Monitoring
244

245
Performance monitoring and statistics collection for tracking crawling progress and system resource usage.
246

247
```python { .api }
248
class Statistics:
249
    def __init__(self): ...
250
    def get_state(self) -> StatisticsState: ...
251

252
class FinalStatistics:
253
    requests_finished: int
254
    requests_failed: int
255
    retry_histogram: list[int]
256
```
257

258
[Statistics](./statistics.md)
259

260
### Error Handling
261

262
Comprehensive exception hierarchy for handling various crawling scenarios and failure modes.
263

264
```python { .api }
265
class HttpStatusCodeError(Exception): ...
266
class ProxyError(Exception): ...
267
class SessionError(Exception): ...
268
class RequestHandlerError(Exception): ...
269
```
270

271
[Error Handling](./error-handling.md)
272

273
### Request Management
274

275
Advanced request lifecycle management with support for static lists, dynamic queues, and tandem operations.
276

277
```python { .api }
278
class RequestList:
279
    def __init__(self, requests: list[str | Request]): ...
280

281
class RequestManager:
282
    def __init__(self, **options): ...
283

284
class RequestManagerTandem:
285
    def __init__(self, request_list: RequestList, request_queue: RequestQueue): ...
286
```
287

288
[Request Management](./request-management.md)
289

290
### Events System
291

292
Event-driven architecture for hooking into crawler lifecycle events and implementing custom behaviors.
293

294
```python { .api }
295
class EventManager:
296
    def emit(self, event: Event, data: EventData): ...
297
    def on(self, event: Event, listener: EventListener): ...
298

299
class LocalEventManager(EventManager): ...
300
```
301

302
[Events](./events.md)
303

304
### CLI Tools
305

306
Command-line interface for project scaffolding and development workflow automation.
307

308
```python { .api }
309
# Command line usage:
310
# crawlee create my-project
311
# crawlee --version
312
```
313

314
[CLI Tools](./cli-tools.md)

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/