or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

browser-automation.mdcli-tools.mdconfiguration.mdcore-types.mdcrawlers.mderror-handling.mdevents.mdfingerprinting.mdhttp-clients.mdindex.mdrequest-management.mdsessions.mdstatistics.mdstorage.md

crawlers.mddocs/

0

# Crawlers

1

2

Specialized crawler implementations for different scraping needs, from simple HTTP requests to full browser automation. Crawlee provides a unified interface across different crawler types while offering specialized capabilities for specific use cases.

3

4

## Capabilities

5

6

### Basic Crawler

7

8

Foundation crawler providing core functionality including autoscaling, session management, and request lifecycle handling. All other crawlers extend from this base implementation.

9

10

```python { .api }

11

class BasicCrawler:

12

def __init__(

13

self,

14

*,

15

max_requests_per_crawl: int | None = None,

16

max_request_retries: int = 3,

17

request_handler_timeout: timedelta | None = None,

18

session_pool: SessionPool | None = None,

19

use_session_pool: bool = True,

20

retry_on_blocked: bool = True,

21

statistics: Statistics | None = None,

22

**options: BasicCrawlerOptions

23

): ...

24

25

async def run(self, requests: list[str | Request]) -> FinalStatistics: ...

26

27

async def add_requests(

28

self,

29

requests: list[str | Request],

30

**kwargs

31

) -> None: ...

32

33

@property

34

def router(self) -> Router: ...

35

36

@property

37

def stats(self) -> Statistics: ...

38

```

39

40

### HTTP Crawler

41

42

HTTP-based crawler for web scraping using configurable HTTP clients. Ideal for sites that don't require JavaScript execution.

43

44

```python { .api }

45

class HttpCrawler(AbstractHttpCrawler):

46

def __init__(

47

self,

48

*,

49

http_client: HttpClient | None = None,

50

ignore_http_error_status_codes: list[int] | None = None,

51

**options

52

): ...

53

```

54

55

### BeautifulSoup Crawler

56

57

HTML parsing crawler using BeautifulSoup for content extraction. Combines HTTP requests with powerful CSS selector and BeautifulSoup parsing capabilities.

58

59

```python { .api }

60

class BeautifulSoupCrawler(AbstractHttpCrawler):

61

def __init__(

62

self,

63

*,

64

parser_type: BeautifulSoupParserType = BeautifulSoupParserType.HTML_PARSER,

65

**options

66

): ...

67

```

68

69

```python { .api }

70

class BeautifulSoupParserType(str, Enum):

71

HTML_PARSER = "html.parser"

72

LXML = "lxml"

73

HTML5LIB = "html5lib"

74

```

75

76

### Parsel Crawler

77

78

CSS selector and XPath-based crawler using the Parsel library for structured data extraction from HTML and XML documents.

79

80

```python { .api }

81

class ParselCrawler(AbstractHttpCrawler):

82

def __init__(self, **options): ...

83

```

84

85

### Playwright Crawler

86

87

Full browser automation crawler using Playwright for JavaScript-heavy sites and complex user interactions. Supports headless and headful modes.

88

89

```python { .api }

90

class PlaywrightCrawler:

91

def __init__(

92

self,

93

*,

94

browser_type: Literal["chromium", "firefox", "webkit"] = "chromium",

95

browser_pool: BrowserPool | None = None,

96

headless: bool = True,

97

**options

98

): ...

99

```

100

101

### Adaptive Playwright Crawler

102

103

Intelligent crawler that automatically decides between HTTP and browser modes based on page requirements using machine learning prediction.

104

105

```python { .api }

106

class AdaptivePlaywrightCrawler:

107

def __init__(

108

self,

109

*,

110

rendering_type_predictor: RenderingTypePredictor | None = None,

111

**options

112

): ...

113

```

114

115

```python { .api }

116

class RenderingType(str, Enum):

117

CLIENT_SIDE_ONLY = "client_side_only"

118

SERVER_SIDE_ONLY = "server_side_only"

119

CLIENT_SERVER_SIDE = "client_server_side"

120

```

121

122

```python { .api }

123

class RenderingTypePrediction:

124

rendering_type: RenderingType

125

probability: float

126

```

127

128

```python { .api }

129

class RenderingTypePredictor:

130

def predict(self, url: str) -> RenderingTypePrediction: ...

131

```

132

133

## Crawling Contexts

134

135

### Basic Crawling Context

136

137

Base context available in all crawler request handlers with core functionality for data extraction and request management.

138

139

```python { .api }

140

class BasicCrawlingContext:

141

request: Request

142

session: Session

143

log: Logger

144

145

async def push_data(self, data: dict | list[dict]) -> None: ...

146

async def enqueue_links(

147

self,

148

*,

149

selector: str = "a[href]",

150

base_url: str | None = None,

151

**kwargs

152

) -> None: ...

153

async def add_requests(self, requests: list[str | Request]) -> None: ...

154

async def get_key_value_store(self, name: str | None = None) -> KeyValueStore: ...

155

async def use_state(self, default_value: any = None) -> any: ...

156

```

157

158

### HTTP Crawling Context

159

160

Context for HTTP-based crawlers providing access to response data and HTTP-specific functionality.

161

162

```python { .api }

163

class HttpCrawlingContext(BasicCrawlingContext):

164

response: HttpResponse

165

166

@property

167

def body(self) -> str: ...

168

169

@property

170

def content_type(self) -> str | None: ...

171

172

@property

173

def encoding(self) -> str: ...

174

```

175

176

### BeautifulSoup Crawling Context

177

178

Context with BeautifulSoup parsed HTML content and CSS selector capabilities for easy data extraction.

179

180

```python { .api }

181

class BeautifulSoupCrawlingContext(ParsedHttpCrawlingContext):

182

soup: BeautifulSoup

183

184

def css(self, selector: str) -> list: ...

185

def xpath(self, xpath: str) -> list: ...

186

```

187

188

### Parsel Crawling Context

189

190

Context with Parsel selector objects for advanced CSS and XPath-based data extraction from HTML and XML.

191

192

```python { .api }

193

class ParselCrawlingContext(ParsedHttpCrawlingContext):

194

selector: Selector

195

196

def css(self, selector: str) -> SelectorList: ...

197

def xpath(self, xpath: str) -> SelectorList: ...

198

```

199

200

### Playwright Crawling Context

201

202

Context with Playwright page objects for full browser automation and JavaScript interaction capabilities.

203

204

```python { .api }

205

class PlaywrightCrawlingContext(BasicCrawlingContext):

206

page: Page

207

response: Response | None

208

209

async def infinite_scroll(

210

self,

211

*,

212

max_scroll_height: int | None = None,

213

button_selector: str | None = None,

214

wait_for_selector: str | None = None,

215

) -> None: ...

216

217

async def save_snapshot(self, *, key: str | None = None) -> None: ...

218

```

219

220

### Playwright Pre-Navigation Context

221

222

Context available before page navigation in Playwright crawlers for setting up page configuration and listeners.

223

224

```python { .api }

225

class PlaywrightPreNavCrawlingContext:

226

page: Page

227

request: Request

228

session: Session

229

log: Logger

230

```

231

232

## Crawler Configuration

233

234

### Basic Crawler Options

235

236

Configuration options for customizing crawler behavior, performance, and resource management.

237

238

```python { .api }

239

class BasicCrawlerOptions:

240

request_provider: RequestProvider | None = None

241

request_handler: RequestHandler | None = None

242

failed_request_handler: ErrorHandler | None = None

243

max_requests_per_crawl: int | None = None

244

max_request_retries: int = 3

245

request_handler_timeout: timedelta | None = None

246

navigation_timeout: timedelta | None = None

247

session_pool: SessionPool | None = None

248

use_session_pool: bool = True

249

statistics: Statistics | None = None

250

event_manager: EventManager | None = None

251

```

252

253

### Context Pipeline

254

255

Middleware pipeline system for processing crawling contexts with support for initialization, processing, and cleanup phases.

256

257

```python { .api }

258

class ContextPipeline:

259

def __init__(self): ...

260

261

def use(self, middleware: Callable) -> None: ...

262

263

async def compose(self, context: BasicCrawlingContext) -> None: ...

264

```

265

266

### HTTP Crawling Result

267

268

Result object containing response data and metadata from HTTP-based crawling operations.

269

270

```python { .api }

271

class HttpCrawlingResult:

272

http_response: HttpResponse

273

encoding: str | None = None

274

```

275

276

### Abstract HTTP Parser

277

278

Base parser interface for implementing custom response parsing in HTTP-based crawlers.

279

280

```python { .api }

281

class AbstractHttpParser:

282

async def parse(

283

self,

284

crawling_context: HttpCrawlingContext

285

) -> ParsedHttpCrawlingContext: ...

286

```

287

288

## Usage Examples

289

290

### Basic HTTP Scraping

291

292

```python

293

import asyncio

294

from crawlee.crawlers import HttpCrawler, HttpCrawlingContext

295

296

async def main():

297

crawler = HttpCrawler()

298

299

@crawler.router.default_handler

300

async def handler(context: HttpCrawlingContext):

301

context.log.info(f'Processing {context.request.url}')

302

303

data = {

304

'url': context.request.url,

305

'status': context.response.status_code,

306

'length': len(context.body)

307

}

308

309

await context.push_data(data)

310

311

await crawler.run(['https://example.com'])

312

313

asyncio.run(main())

314

```

315

316

### Browser Automation with Playwright

317

318

```python

319

import asyncio

320

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

321

322

async def main():

323

crawler = PlaywrightCrawler(headless=True)

324

325

@crawler.router.default_handler

326

async def handler(context: PlaywrightCrawlingContext):

327

await context.page.wait_for_load_state('networkidle')

328

329

title = await context.page.title()

330

331

data = {

332

'url': context.request.url,

333

'title': title

334

}

335

336

await context.push_data(data)

337

338

await crawler.run(['https://example.com'])

339

340

asyncio.run(main())

341

```

342

343

### Adaptive Crawling

344

345

```python

346

import asyncio

347

from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext

348

349

async def main():

350

crawler = AdaptivePlaywrightCrawler()

351

352

@crawler.router.default_handler

353

async def handler(context: AdaptivePlaywrightCrawlingContext):

354

# Context automatically switches between HTTP and browser modes

355

# based on page rendering requirements

356

357

if hasattr(context, 'page'):

358

# Browser mode - page requires JavaScript

359

title = await context.page.title()

360

else:

361

# HTTP mode - static content

362

title = context.soup.title.string if context.soup.title else None

363

364

data = {

365

'url': context.request.url,

366

'title': title

367

}

368

369

await context.push_data(data)

370

371

await crawler.run(['https://example.com'])

372

373

asyncio.run(main())

374

```