0
# Crawlers
1
2
Specialized crawler implementations for different scraping needs, from simple HTTP requests to full browser automation. Crawlee provides a unified interface across different crawler types while offering specialized capabilities for specific use cases.
3
4
## Capabilities
5
6
### Basic Crawler
7
8
Foundation crawler providing core functionality including autoscaling, session management, and request lifecycle handling. All other crawlers extend from this base implementation.
9
10
```python { .api }
11
class BasicCrawler:
12
def __init__(
13
self,
14
*,
15
max_requests_per_crawl: int | None = None,
16
max_request_retries: int = 3,
17
request_handler_timeout: timedelta | None = None,
18
session_pool: SessionPool | None = None,
19
use_session_pool: bool = True,
20
retry_on_blocked: bool = True,
21
statistics: Statistics | None = None,
22
**options: BasicCrawlerOptions
23
): ...
24
25
async def run(self, requests: list[str | Request]) -> FinalStatistics: ...
26
27
async def add_requests(
28
self,
29
requests: list[str | Request],
30
**kwargs
31
) -> None: ...
32
33
@property
34
def router(self) -> Router: ...
35
36
@property
37
def stats(self) -> Statistics: ...
38
```
39
40
### HTTP Crawler
41
42
HTTP-based crawler for web scraping using configurable HTTP clients. Ideal for sites that don't require JavaScript execution.
43
44
```python { .api }
45
class HttpCrawler(AbstractHttpCrawler):
46
def __init__(
47
self,
48
*,
49
http_client: HttpClient | None = None,
50
ignore_http_error_status_codes: list[int] | None = None,
51
**options
52
): ...
53
```
54
55
### BeautifulSoup Crawler
56
57
HTML parsing crawler using BeautifulSoup for content extraction. Combines HTTP requests with powerful CSS selector and BeautifulSoup parsing capabilities.
58
59
```python { .api }
60
class BeautifulSoupCrawler(AbstractHttpCrawler):
61
def __init__(
62
self,
63
*,
64
parser_type: BeautifulSoupParserType = BeautifulSoupParserType.HTML_PARSER,
65
**options
66
): ...
67
```
68
69
```python { .api }
70
class BeautifulSoupParserType(str, Enum):
71
HTML_PARSER = "html.parser"
72
LXML = "lxml"
73
HTML5LIB = "html5lib"
74
```
75
76
### Parsel Crawler
77
78
CSS selector and XPath-based crawler using the Parsel library for structured data extraction from HTML and XML documents.
79
80
```python { .api }
81
class ParselCrawler(AbstractHttpCrawler):
82
def __init__(self, **options): ...
83
```
84
85
### Playwright Crawler
86
87
Full browser automation crawler using Playwright for JavaScript-heavy sites and complex user interactions. Supports headless and headful modes.
88
89
```python { .api }
90
class PlaywrightCrawler:
91
def __init__(
92
self,
93
*,
94
browser_type: Literal["chromium", "firefox", "webkit"] = "chromium",
95
browser_pool: BrowserPool | None = None,
96
headless: bool = True,
97
**options
98
): ...
99
```
100
101
### Adaptive Playwright Crawler
102
103
Intelligent crawler that automatically decides between HTTP and browser modes based on page requirements using machine learning prediction.
104
105
```python { .api }
106
class AdaptivePlaywrightCrawler:
107
def __init__(
108
self,
109
*,
110
rendering_type_predictor: RenderingTypePredictor | None = None,
111
**options
112
): ...
113
```
114
115
```python { .api }
116
class RenderingType(str, Enum):
117
CLIENT_SIDE_ONLY = "client_side_only"
118
SERVER_SIDE_ONLY = "server_side_only"
119
CLIENT_SERVER_SIDE = "client_server_side"
120
```
121
122
```python { .api }
123
class RenderingTypePrediction:
124
rendering_type: RenderingType
125
probability: float
126
```
127
128
```python { .api }
129
class RenderingTypePredictor:
130
def predict(self, url: str) -> RenderingTypePrediction: ...
131
```
132
133
## Crawling Contexts
134
135
### Basic Crawling Context
136
137
Base context available in all crawler request handlers with core functionality for data extraction and request management.
138
139
```python { .api }
140
class BasicCrawlingContext:
141
request: Request
142
session: Session
143
log: Logger
144
145
async def push_data(self, data: dict | list[dict]) -> None: ...
146
async def enqueue_links(
147
self,
148
*,
149
selector: str = "a[href]",
150
base_url: str | None = None,
151
**kwargs
152
) -> None: ...
153
async def add_requests(self, requests: list[str | Request]) -> None: ...
154
async def get_key_value_store(self, name: str | None = None) -> KeyValueStore: ...
155
async def use_state(self, default_value: any = None) -> any: ...
156
```
157
158
### HTTP Crawling Context
159
160
Context for HTTP-based crawlers providing access to response data and HTTP-specific functionality.
161
162
```python { .api }
163
class HttpCrawlingContext(BasicCrawlingContext):
164
response: HttpResponse
165
166
@property
167
def body(self) -> str: ...
168
169
@property
170
def content_type(self) -> str | None: ...
171
172
@property
173
def encoding(self) -> str: ...
174
```
175
176
### BeautifulSoup Crawling Context
177
178
Context with BeautifulSoup parsed HTML content and CSS selector capabilities for easy data extraction.
179
180
```python { .api }
181
class BeautifulSoupCrawlingContext(ParsedHttpCrawlingContext):
182
soup: BeautifulSoup
183
184
def css(self, selector: str) -> list: ...
185
def xpath(self, xpath: str) -> list: ...
186
```
187
188
### Parsel Crawling Context
189
190
Context with Parsel selector objects for advanced CSS and XPath-based data extraction from HTML and XML.
191
192
```python { .api }
193
class ParselCrawlingContext(ParsedHttpCrawlingContext):
194
selector: Selector
195
196
def css(self, selector: str) -> SelectorList: ...
197
def xpath(self, xpath: str) -> SelectorList: ...
198
```
199
200
### Playwright Crawling Context
201
202
Context with Playwright page objects for full browser automation and JavaScript interaction capabilities.
203
204
```python { .api }
205
class PlaywrightCrawlingContext(BasicCrawlingContext):
206
page: Page
207
response: Response | None
208
209
async def infinite_scroll(
210
self,
211
*,
212
max_scroll_height: int | None = None,
213
button_selector: str | None = None,
214
wait_for_selector: str | None = None,
215
) -> None: ...
216
217
async def save_snapshot(self, *, key: str | None = None) -> None: ...
218
```
219
220
### Playwright Pre-Navigation Context
221
222
Context available before page navigation in Playwright crawlers for setting up page configuration and listeners.
223
224
```python { .api }
225
class PlaywrightPreNavCrawlingContext:
226
page: Page
227
request: Request
228
session: Session
229
log: Logger
230
```
231
232
## Crawler Configuration
233
234
### Basic Crawler Options
235
236
Configuration options for customizing crawler behavior, performance, and resource management.
237
238
```python { .api }
239
class BasicCrawlerOptions:
240
request_provider: RequestProvider | None = None
241
request_handler: RequestHandler | None = None
242
failed_request_handler: ErrorHandler | None = None
243
max_requests_per_crawl: int | None = None
244
max_request_retries: int = 3
245
request_handler_timeout: timedelta | None = None
246
navigation_timeout: timedelta | None = None
247
session_pool: SessionPool | None = None
248
use_session_pool: bool = True
249
statistics: Statistics | None = None
250
event_manager: EventManager | None = None
251
```
252
253
### Context Pipeline
254
255
Middleware pipeline system for processing crawling contexts with support for initialization, processing, and cleanup phases.
256
257
```python { .api }
258
class ContextPipeline:
259
def __init__(self): ...
260
261
def use(self, middleware: Callable) -> None: ...
262
263
async def compose(self, context: BasicCrawlingContext) -> None: ...
264
```
265
266
### HTTP Crawling Result
267
268
Result object containing response data and metadata from HTTP-based crawling operations.
269
270
```python { .api }
271
class HttpCrawlingResult:
272
http_response: HttpResponse
273
encoding: str | None = None
274
```
275
276
### Abstract HTTP Parser
277
278
Base parser interface for implementing custom response parsing in HTTP-based crawlers.
279
280
```python { .api }
281
class AbstractHttpParser:
282
async def parse(
283
self,
284
crawling_context: HttpCrawlingContext
285
) -> ParsedHttpCrawlingContext: ...
286
```
287
288
## Usage Examples
289
290
### Basic HTTP Scraping
291
292
```python
293
import asyncio
294
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
295
296
async def main():
297
crawler = HttpCrawler()
298
299
@crawler.router.default_handler
300
async def handler(context: HttpCrawlingContext):
301
context.log.info(f'Processing {context.request.url}')
302
303
data = {
304
'url': context.request.url,
305
'status': context.response.status_code,
306
'length': len(context.body)
307
}
308
309
await context.push_data(data)
310
311
await crawler.run(['https://example.com'])
312
313
asyncio.run(main())
314
```
315
316
### Browser Automation with Playwright
317
318
```python
319
import asyncio
320
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
321
322
async def main():
323
crawler = PlaywrightCrawler(headless=True)
324
325
@crawler.router.default_handler
326
async def handler(context: PlaywrightCrawlingContext):
327
await context.page.wait_for_load_state('networkidle')
328
329
title = await context.page.title()
330
331
data = {
332
'url': context.request.url,
333
'title': title
334
}
335
336
await context.push_data(data)
337
338
await crawler.run(['https://example.com'])
339
340
asyncio.run(main())
341
```
342
343
### Adaptive Crawling
344
345
```python
346
import asyncio
347
from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext
348
349
async def main():
350
crawler = AdaptivePlaywrightCrawler()
351
352
@crawler.router.default_handler
353
async def handler(context: AdaptivePlaywrightCrawlingContext):
354
# Context automatically switches between HTTP and browser modes
355
# based on page rendering requirements
356
357
if hasattr(context, 'page'):
358
# Browser mode - page requires JavaScript
359
title = await context.page.title()
360
else:
361
# HTTP mode - static content
362
title = context.soup.title.string if context.soup.title else None
363
364
data = {
365
'url': context.request.url,
366
'title': title
367
}
368
369
await context.push_data(data)
370
371
await crawler.run(['https://example.com'])
372
373
asyncio.run(main())
374
```