Tessl Tile for pypi/crawlee@0.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

browser-automation.md cli-tools.md configuration.md core-types.md crawlers.md error-handling.md events.md fingerprinting.md http-clients.md index.md request-management.md sessions.md statistics.md storage.md

crawlers.mddocs/

0
# Crawlers
1

2
Specialized crawler implementations for different scraping needs, from simple HTTP requests to full browser automation. Crawlee provides a unified interface across different crawler types while offering specialized capabilities for specific use cases.
3

4
## Capabilities
5

6
### Basic Crawler
7

8
Foundation crawler providing core functionality including autoscaling, session management, and request lifecycle handling. All other crawlers extend from this base implementation.
9

10
```python { .api }
11
class BasicCrawler:
12
    def __init__(
13
        self,
14
        *,
15
        max_requests_per_crawl: int | None = None,
16
        max_request_retries: int = 3,
17
        request_handler_timeout: timedelta | None = None,
18
        session_pool: SessionPool | None = None,
19
        use_session_pool: bool = True,
20
        retry_on_blocked: bool = True,
21
        statistics: Statistics | None = None,
22
        **options: BasicCrawlerOptions
23
    ): ...
24

25
    async def run(self, requests: list[str | Request]) -> FinalStatistics: ...
26

27
    async def add_requests(
28
        self,
29
        requests: list[str | Request],
30
        **kwargs
31
    ) -> None: ...
32

33
    @property
34
    def router(self) -> Router: ...
35

36
    @property
37
    def stats(self) -> Statistics: ...
38
```
39

40
### HTTP Crawler
41

42
HTTP-based crawler for web scraping using configurable HTTP clients. Ideal for sites that don't require JavaScript execution.
43

44
```python { .api }
45
class HttpCrawler(AbstractHttpCrawler):
46
    def __init__(
47
        self,
48
        *,
49
        http_client: HttpClient | None = None,
50
        ignore_http_error_status_codes: list[int] | None = None,
51
        **options
52
    ): ...
53
```
54

55
### BeautifulSoup Crawler
56

57
HTML parsing crawler using BeautifulSoup for content extraction. Combines HTTP requests with powerful CSS selector and BeautifulSoup parsing capabilities.
58

59
```python { .api }
60
class BeautifulSoupCrawler(AbstractHttpCrawler):
61
    def __init__(
62
        self,
63
        *,
64
        parser_type: BeautifulSoupParserType = BeautifulSoupParserType.HTML_PARSER,
65
        **options
66
    ): ...
67
```
68

69
```python { .api }
70
class BeautifulSoupParserType(str, Enum):
71
    HTML_PARSER = "html.parser"
72
    LXML = "lxml"
73
    HTML5LIB = "html5lib"
74
```
75

76
### Parsel Crawler
77

78
CSS selector and XPath-based crawler using the Parsel library for structured data extraction from HTML and XML documents.
79

80
```python { .api }
81
class ParselCrawler(AbstractHttpCrawler):
82
    def __init__(self, **options): ...
83
```
84

85
### Playwright Crawler
86

87
Full browser automation crawler using Playwright for JavaScript-heavy sites and complex user interactions. Supports headless and headful modes.
88

89
```python { .api }
90
class PlaywrightCrawler:
91
    def __init__(
92
        self,
93
        *,
94
        browser_type: Literal["chromium", "firefox", "webkit"] = "chromium",
95
        browser_pool: BrowserPool | None = None,
96
        headless: bool = True,
97
        **options
98
    ): ...
99
```
100

101
### Adaptive Playwright Crawler
102

103
Intelligent crawler that automatically decides between HTTP and browser modes based on page requirements using machine learning prediction.
104

105
```python { .api }
106
class AdaptivePlaywrightCrawler:
107
    def __init__(
108
        self,
109
        *,
110
        rendering_type_predictor: RenderingTypePredictor | None = None,
111
        **options
112
    ): ...
113
```
114

115
```python { .api }
116
class RenderingType(str, Enum):
117
    CLIENT_SIDE_ONLY = "client_side_only"
118
    SERVER_SIDE_ONLY = "server_side_only"
119
    CLIENT_SERVER_SIDE = "client_server_side"
120
```
121

122
```python { .api }
123
class RenderingTypePrediction:
124
    rendering_type: RenderingType
125
    probability: float
126
```
127

128
```python { .api }
129
class RenderingTypePredictor:
130
    def predict(self, url: str) -> RenderingTypePrediction: ...
131
```
132

133
## Crawling Contexts
134

135
### Basic Crawling Context
136

137
Base context available in all crawler request handlers with core functionality for data extraction and request management.
138

139
```python { .api }
140
class BasicCrawlingContext:
141
    request: Request
142
    session: Session
143
    log: Logger
144

145
    async def push_data(self, data: dict | list[dict]) -> None: ...
146
    async def enqueue_links(
147
        self,
148
        *,
149
        selector: str = "a[href]",
150
        base_url: str | None = None,
151
        **kwargs
152
    ) -> None: ...
153
    async def add_requests(self, requests: list[str | Request]) -> None: ...
154
    async def get_key_value_store(self, name: str | None = None) -> KeyValueStore: ...
155
    async def use_state(self, default_value: any = None) -> any: ...
156
```
157

158
### HTTP Crawling Context
159

160
Context for HTTP-based crawlers providing access to response data and HTTP-specific functionality.
161

162
```python { .api }
163
class HttpCrawlingContext(BasicCrawlingContext):
164
    response: HttpResponse
165

166
    @property
167
    def body(self) -> str: ...
168

169
    @property
170
    def content_type(self) -> str | None: ...
171

172
    @property
173
    def encoding(self) -> str: ...
174
```
175

176
### BeautifulSoup Crawling Context
177

178
Context with BeautifulSoup parsed HTML content and CSS selector capabilities for easy data extraction.
179

180
```python { .api }
181
class BeautifulSoupCrawlingContext(ParsedHttpCrawlingContext):
182
    soup: BeautifulSoup
183

184
    def css(self, selector: str) -> list: ...
185
    def xpath(self, xpath: str) -> list: ...
186
```
187

188
### Parsel Crawling Context
189

190
Context with Parsel selector objects for advanced CSS and XPath-based data extraction from HTML and XML.
191

192
```python { .api }
193
class ParselCrawlingContext(ParsedHttpCrawlingContext):
194
    selector: Selector
195

196
    def css(self, selector: str) -> SelectorList: ...
197
    def xpath(self, xpath: str) -> SelectorList: ...
198
```
199

200
### Playwright Crawling Context
201

202
Context with Playwright page objects for full browser automation and JavaScript interaction capabilities.
203

204
```python { .api }
205
class PlaywrightCrawlingContext(BasicCrawlingContext):
206
    page: Page
207
    response: Response | None
208

209
    async def infinite_scroll(
210
        self,
211
        *,
212
        max_scroll_height: int | None = None,
213
        button_selector: str | None = None,
214
        wait_for_selector: str | None = None,
215
    ) -> None: ...
216

217
    async def save_snapshot(self, *, key: str | None = None) -> None: ...
218
```
219

220
### Playwright Pre-Navigation Context
221

222
Context available before page navigation in Playwright crawlers for setting up page configuration and listeners.
223

224
```python { .api }
225
class PlaywrightPreNavCrawlingContext:
226
    page: Page
227
    request: Request
228
    session: Session
229
    log: Logger
230
```
231

232
## Crawler Configuration
233

234
### Basic Crawler Options
235

236
Configuration options for customizing crawler behavior, performance, and resource management.
237

238
```python { .api }
239
class BasicCrawlerOptions:
240
    request_provider: RequestProvider | None = None
241
    request_handler: RequestHandler | None = None
242
    failed_request_handler: ErrorHandler | None = None
243
    max_requests_per_crawl: int | None = None
244
    max_request_retries: int = 3
245
    request_handler_timeout: timedelta | None = None
246
    navigation_timeout: timedelta | None = None
247
    session_pool: SessionPool | None = None
248
    use_session_pool: bool = True
249
    statistics: Statistics | None = None
250
    event_manager: EventManager | None = None
251
```
252

253
### Context Pipeline
254

255
Middleware pipeline system for processing crawling contexts with support for initialization, processing, and cleanup phases.
256

257
```python { .api }
258
class ContextPipeline:
259
    def __init__(self): ...
260

261
    def use(self, middleware: Callable) -> None: ...
262

263
    async def compose(self, context: BasicCrawlingContext) -> None: ...
264
```
265

266
### HTTP Crawling Result
267

268
Result object containing response data and metadata from HTTP-based crawling operations.
269

270
```python { .api }
271
class HttpCrawlingResult:
272
    http_response: HttpResponse
273
    encoding: str | None = None
274
```
275

276
### Abstract HTTP Parser
277

278
Base parser interface for implementing custom response parsing in HTTP-based crawlers.
279

280
```python { .api }
281
class AbstractHttpParser:
282
    async def parse(
283
        self,
284
        crawling_context: HttpCrawlingContext
285
    ) -> ParsedHttpCrawlingContext: ...
286
```
287

288
## Usage Examples
289

290
### Basic HTTP Scraping
291

292
```python
293
import asyncio
294
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
295

296
async def main():
297
    crawler = HttpCrawler()
298

299
    @crawler.router.default_handler
300
    async def handler(context: HttpCrawlingContext):
301
        context.log.info(f'Processing {context.request.url}')
302

303
        data = {
304
            'url': context.request.url,
305
            'status': context.response.status_code,
306
            'length': len(context.body)
307
        }
308

309
        await context.push_data(data)
310

311
    await crawler.run(['https://example.com'])
312

313
asyncio.run(main())
314
```
315

316
### Browser Automation with Playwright
317

318
```python
319
import asyncio
320
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
321

322
async def main():
323
    crawler = PlaywrightCrawler(headless=True)
324

325
    @crawler.router.default_handler
326
    async def handler(context: PlaywrightCrawlingContext):
327
        await context.page.wait_for_load_state('networkidle')
328

329
        title = await context.page.title()
330

331
        data = {
332
            'url': context.request.url,
333
            'title': title
334
        }
335

336
        await context.push_data(data)
337

338
    await crawler.run(['https://example.com'])
339

340
asyncio.run(main())
341
```
342

343
### Adaptive Crawling
344

345
```python
346
import asyncio
347
from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext
348

349
async def main():
350
    crawler = AdaptivePlaywrightCrawler()
351

352
    @crawler.router.default_handler
353
    async def handler(context: AdaptivePlaywrightCrawlingContext):
354
        # Context automatically switches between HTTP and browser modes
355
        # based on page rendering requirements
356

357
        if hasattr(context, 'page'):
358
            # Browser mode - page requires JavaScript
359
            title = await context.page.title()
360
        else:
361
            # HTTP mode - static content
362
            title = context.soup.title.string if context.soup.title else None
363

364
        data = {
365
            'url': context.request.url,
366
            'title': title
367
        }
368

369
        await context.push_data(data)
370

371
    await crawler.run(['https://example.com'])
372

373
asyncio.run(main())
374
```

Version

Tile

Files

crawlers.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

crawlers.mddocs/