A comprehensive web scraping and browser automation library for Python with human-like behavior and bot protection bypass
npx @tessl/cli install tessl/pypi-crawlee@0.6.00
# Crawlee
1
2
A comprehensive web scraping and browser automation library for Python designed to help developers build reliable scrapers that appear human-like and bypass modern bot protections. Crawlee provides end-to-end crawling and scraping capabilities with tools to crawl the web for links, scrape data, and persistently store it in machine-readable formats.
3
4
## Package Information
5
6
- **Package Name**: crawlee
7
- **Language**: Python
8
- **Installation**: `pip install 'crawlee[all]'` (full features) or `pip install crawlee` (core only)
9
- **Python Version**: ≥3.9
10
11
## Core Imports
12
13
```python
14
import crawlee
15
from crawlee import Request, service_locator
16
```
17
18
Common patterns for crawlers:
19
20
```python
21
from crawlee.crawlers import (
22
BasicCrawler, HttpCrawler,
23
BeautifulSoupCrawler, ParselCrawler, PlaywrightCrawler,
24
AdaptivePlaywrightCrawler
25
)
26
```
27
28
For specific functionality:
29
30
```python
31
from crawlee.storages import Dataset, KeyValueStore, RequestQueue
32
from crawlee.sessions import Session, SessionPool
33
from crawlee.http_clients import HttpxHttpClient, CurlImpersonateHttpClient
34
from crawlee import ConcurrencySettings, HttpHeaders, EnqueueStrategy
35
```
36
37
## Basic Usage
38
39
```python
40
import asyncio
41
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
42
43
async def main() -> None:
44
crawler = BeautifulSoupCrawler(
45
max_requests_per_crawl=10,
46
)
47
48
@crawler.router.default_handler
49
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
50
context.log.info(f'Processing {context.request.url} ...')
51
52
# Extract data from the page
53
data = {
54
'url': context.request.url,
55
'title': context.soup.title.string if context.soup.title else None,
56
}
57
58
# Push data to storage
59
await context.push_data(data)
60
61
# Enqueue all links found on the page
62
await context.enqueue_links()
63
64
# Run the crawler
65
await crawler.run(['https://example.com'])
66
67
if __name__ == '__main__':
68
asyncio.run(main())
69
```
70
71
## Architecture
72
73
Crawlee follows a modular architecture with clear separation of concerns:
74
75
- **Crawlers**: Main orchestrators that manage crawling workflows and provide specialized parsing capabilities
76
- **Storage**: Persistent data management with support for datasets, key-value stores, and request queues
77
- **HTTP Clients**: Pluggable HTTP implementations with support for different libraries and browser impersonation
78
- **Sessions**: Session management with cookie persistence and rotation
79
- **Request Management**: Advanced request queuing, deduplication, and lifecycle management
80
- **Browser Automation**: Optional Playwright integration for JavaScript-heavy sites
81
- **Fingerprinting**: Browser fingerprint generation for enhanced anti-detection capabilities
82
83
This design enables Crawlee to handle everything from simple HTTP scraping to complex browser automation while maintaining human-like behavior patterns.
84
85
## Core API
86
87
### Core Types and Request Handling
88
89
Essential types and request management functionality that forms the foundation of all crawling operations.
90
91
```python { .api }
92
class Request:
93
@classmethod
94
def from_url(cls, url: str, **options) -> Request: ...
95
96
class ConcurrencySettings:
97
def __init__(
98
self,
99
min_concurrency: int = 1,
100
max_concurrency: int = 200,
101
max_tasks_per_minute: float = float('inf'),
102
desired_concurrency: int | None = None
103
): ...
104
105
class HttpHeaders(Mapping[str, str]):
106
def __init__(self, headers: dict[str, str] | None = None): ...
107
108
service_locator: ServiceLocator
109
```
110
111
[Core Types](./core-types.md)
112
113
## Capabilities
114
115
### Web Crawlers
116
117
Specialized crawler implementations for different scraping needs, from HTTP-only to full browser automation with intelligent adaptation between modes.
118
119
```python { .api }
120
class BasicCrawler:
121
def __init__(self, **options): ...
122
async def run(self, requests: list[str | Request]): ...
123
124
class BeautifulSoupCrawler(AbstractHttpCrawler):
125
def __init__(self, **options): ...
126
127
class PlaywrightCrawler:
128
def __init__(self, **options): ...
129
130
class AdaptivePlaywrightCrawler:
131
def __init__(self, **options): ...
132
```
133
134
[Crawlers](./crawlers.md)
135
136
### Data Storage
137
138
Persistent storage solutions for structured data, key-value pairs, and request queue management with built-in export capabilities.
139
140
```python { .api }
141
class Dataset:
142
def push_data(self, data: dict | list[dict]): ...
143
def export_to(self, format: str, path: str): ...
144
145
class KeyValueStore:
146
def set_value(self, key: str, value: any): ...
147
def get_value(self, key: str): ...
148
149
class RequestQueue:
150
def add_request(self, request: Request): ...
151
def fetch_next_request(self) -> Request | None: ...
152
```
153
154
[Storage](./storage.md)
155
156
### HTTP Clients
157
158
Pluggable HTTP client implementations supporting different libraries and browser impersonation for enhanced anti-detection capabilities.
159
160
```python { .api }
161
class HttpxHttpClient(HttpClient):
162
def __init__(self, **options): ...
163
164
class CurlImpersonateHttpClient(HttpClient):
165
def __init__(self, **options): ...
166
167
class HttpResponse:
168
status_code: int
169
headers: HttpHeaders
170
text: str
171
content: bytes
172
```
173
174
[HTTP Clients](./http-clients.md)
175
176
### Session Management
177
178
Session and cookie management with rotation capabilities for maintaining state across requests and avoiding detection.
179
180
```python { .api }
181
class Session:
182
def __init__(self, session_pool: SessionPool): ...
183
184
class SessionPool:
185
def __init__(self, max_pool_size: int = 1000): ...
186
def get_session(self) -> Session: ...
187
188
class SessionCookies:
189
def add_cookie(self, cookie: CookieParam): ...
190
```
191
192
[Sessions](./sessions.md)
193
194
### Browser Automation
195
196
Optional Playwright integration for full browser automation with support for JavaScript-heavy sites and complex user interactions.
197
198
```python { .api }
199
class BrowserPool:
200
def __init__(self, **options): ...
201
202
class PlaywrightBrowserController:
203
def __init__(self, **options): ...
204
```
205
206
[Browser Automation](./browser-automation.md)
207
208
### Fingerprinting and Anti-Detection
209
210
Browser fingerprint generation and header randomization for enhanced stealth capabilities and bot protection bypass.
211
212
```python { .api }
213
class FingerprintGenerator:
214
def generate_fingerprint(self) -> dict: ...
215
216
class HeaderGenerator:
217
def get_headers(self, **options: HeaderGeneratorOptions) -> HttpHeaders: ...
218
219
class DefaultFingerprintGenerator(FingerprintGenerator):
220
def __init__(self, **options): ...
221
```
222
223
[Fingerprinting](./fingerprinting.md)
224
225
### Configuration and Routing
226
227
Global configuration management and request routing systems for fine-tuned control over crawling behavior.
228
229
```python { .api }
230
class Configuration:
231
def __init__(self, **settings): ...
232
233
class Router:
234
def default_handler(self, handler): ...
235
def route(self, label: str, handler): ...
236
237
class ProxyConfiguration:
238
def __init__(self, proxy_urls: list[str]): ...
239
```
240
241
[Configuration](./configuration.md)
242
243
### Statistics and Monitoring
244
245
Performance monitoring and statistics collection for tracking crawling progress and system resource usage.
246
247
```python { .api }
248
class Statistics:
249
def __init__(self): ...
250
def get_state(self) -> StatisticsState: ...
251
252
class FinalStatistics:
253
requests_finished: int
254
requests_failed: int
255
retry_histogram: list[int]
256
```
257
258
[Statistics](./statistics.md)
259
260
### Error Handling
261
262
Comprehensive exception hierarchy for handling various crawling scenarios and failure modes.
263
264
```python { .api }
265
class HttpStatusCodeError(Exception): ...
266
class ProxyError(Exception): ...
267
class SessionError(Exception): ...
268
class RequestHandlerError(Exception): ...
269
```
270
271
[Error Handling](./error-handling.md)
272
273
### Request Management
274
275
Advanced request lifecycle management with support for static lists, dynamic queues, and tandem operations.
276
277
```python { .api }
278
class RequestList:
279
def __init__(self, requests: list[str | Request]): ...
280
281
class RequestManager:
282
def __init__(self, **options): ...
283
284
class RequestManagerTandem:
285
def __init__(self, request_list: RequestList, request_queue: RequestQueue): ...
286
```
287
288
[Request Management](./request-management.md)
289
290
### Events System
291
292
Event-driven architecture for hooking into crawler lifecycle events and implementing custom behaviors.
293
294
```python { .api }
295
class EventManager:
296
def emit(self, event: Event, data: EventData): ...
297
def on(self, event: Event, listener: EventListener): ...
298
299
class LocalEventManager(EventManager): ...
300
```
301
302
[Events](./events.md)
303
304
### CLI Tools
305
306
Command-line interface for project scaffolding and development workflow automation.
307
308
```python { .api }
309
# Command line usage:
310
# crawlee create my-project
311
# crawlee --version
312
```
313
314
[CLI Tools](./cli-tools.md)