or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

browser-automation.mdcli-tools.mdconfiguration.mdcore-types.mdcrawlers.mderror-handling.mdevents.mdfingerprinting.mdhttp-clients.mdindex.mdrequest-management.mdsessions.mdstatistics.mdstorage.md

index.mddocs/

0

# Crawlee

1

2

A comprehensive web scraping and browser automation library for Python designed to help developers build reliable scrapers that appear human-like and bypass modern bot protections. Crawlee provides end-to-end crawling and scraping capabilities with tools to crawl the web for links, scrape data, and persistently store it in machine-readable formats.

3

4

## Package Information

5

6

- **Package Name**: crawlee

7

- **Language**: Python

8

- **Installation**: `pip install 'crawlee[all]'` (full features) or `pip install crawlee` (core only)

9

- **Python Version**: ≥3.9

10

11

## Core Imports

12

13

```python

14

import crawlee

15

from crawlee import Request, service_locator

16

```

17

18

Common patterns for crawlers:

19

20

```python

21

from crawlee.crawlers import (

22

BasicCrawler, HttpCrawler,

23

BeautifulSoupCrawler, ParselCrawler, PlaywrightCrawler,

24

AdaptivePlaywrightCrawler

25

)

26

```

27

28

For specific functionality:

29

30

```python

31

from crawlee.storages import Dataset, KeyValueStore, RequestQueue

32

from crawlee.sessions import Session, SessionPool

33

from crawlee.http_clients import HttpxHttpClient, CurlImpersonateHttpClient

34

from crawlee import ConcurrencySettings, HttpHeaders, EnqueueStrategy

35

```

36

37

## Basic Usage

38

39

```python

40

import asyncio

41

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

42

43

async def main() -> None:

44

crawler = BeautifulSoupCrawler(

45

max_requests_per_crawl=10,

46

)

47

48

@crawler.router.default_handler

49

async def request_handler(context: BeautifulSoupCrawlingContext) -> None:

50

context.log.info(f'Processing {context.request.url} ...')

51

52

# Extract data from the page

53

data = {

54

'url': context.request.url,

55

'title': context.soup.title.string if context.soup.title else None,

56

}

57

58

# Push data to storage

59

await context.push_data(data)

60

61

# Enqueue all links found on the page

62

await context.enqueue_links()

63

64

# Run the crawler

65

await crawler.run(['https://example.com'])

66

67

if __name__ == '__main__':

68

asyncio.run(main())

69

```

70

71

## Architecture

72

73

Crawlee follows a modular architecture with clear separation of concerns:

74

75

- **Crawlers**: Main orchestrators that manage crawling workflows and provide specialized parsing capabilities

76

- **Storage**: Persistent data management with support for datasets, key-value stores, and request queues

77

- **HTTP Clients**: Pluggable HTTP implementations with support for different libraries and browser impersonation

78

- **Sessions**: Session management with cookie persistence and rotation

79

- **Request Management**: Advanced request queuing, deduplication, and lifecycle management

80

- **Browser Automation**: Optional Playwright integration for JavaScript-heavy sites

81

- **Fingerprinting**: Browser fingerprint generation for enhanced anti-detection capabilities

82

83

This design enables Crawlee to handle everything from simple HTTP scraping to complex browser automation while maintaining human-like behavior patterns.

84

85

## Core API

86

87

### Core Types and Request Handling

88

89

Essential types and request management functionality that forms the foundation of all crawling operations.

90

91

```python { .api }

92

class Request:

93

@classmethod

94

def from_url(cls, url: str, **options) -> Request: ...

95

96

class ConcurrencySettings:

97

def __init__(

98

self,

99

min_concurrency: int = 1,

100

max_concurrency: int = 200,

101

max_tasks_per_minute: float = float('inf'),

102

desired_concurrency: int | None = None

103

): ...

104

105

class HttpHeaders(Mapping[str, str]):

106

def __init__(self, headers: dict[str, str] | None = None): ...

107

108

service_locator: ServiceLocator

109

```

110

111

[Core Types](./core-types.md)

112

113

## Capabilities

114

115

### Web Crawlers

116

117

Specialized crawler implementations for different scraping needs, from HTTP-only to full browser automation with intelligent adaptation between modes.

118

119

```python { .api }

120

class BasicCrawler:

121

def __init__(self, **options): ...

122

async def run(self, requests: list[str | Request]): ...

123

124

class BeautifulSoupCrawler(AbstractHttpCrawler):

125

def __init__(self, **options): ...

126

127

class PlaywrightCrawler:

128

def __init__(self, **options): ...

129

130

class AdaptivePlaywrightCrawler:

131

def __init__(self, **options): ...

132

```

133

134

[Crawlers](./crawlers.md)

135

136

### Data Storage

137

138

Persistent storage solutions for structured data, key-value pairs, and request queue management with built-in export capabilities.

139

140

```python { .api }

141

class Dataset:

142

def push_data(self, data: dict | list[dict]): ...

143

def export_to(self, format: str, path: str): ...

144

145

class KeyValueStore:

146

def set_value(self, key: str, value: any): ...

147

def get_value(self, key: str): ...

148

149

class RequestQueue:

150

def add_request(self, request: Request): ...

151

def fetch_next_request(self) -> Request | None: ...

152

```

153

154

[Storage](./storage.md)

155

156

### HTTP Clients

157

158

Pluggable HTTP client implementations supporting different libraries and browser impersonation for enhanced anti-detection capabilities.

159

160

```python { .api }

161

class HttpxHttpClient(HttpClient):

162

def __init__(self, **options): ...

163

164

class CurlImpersonateHttpClient(HttpClient):

165

def __init__(self, **options): ...

166

167

class HttpResponse:

168

status_code: int

169

headers: HttpHeaders

170

text: str

171

content: bytes

172

```

173

174

[HTTP Clients](./http-clients.md)

175

176

### Session Management

177

178

Session and cookie management with rotation capabilities for maintaining state across requests and avoiding detection.

179

180

```python { .api }

181

class Session:

182

def __init__(self, session_pool: SessionPool): ...

183

184

class SessionPool:

185

def __init__(self, max_pool_size: int = 1000): ...

186

def get_session(self) -> Session: ...

187

188

class SessionCookies:

189

def add_cookie(self, cookie: CookieParam): ...

190

```

191

192

[Sessions](./sessions.md)

193

194

### Browser Automation

195

196

Optional Playwright integration for full browser automation with support for JavaScript-heavy sites and complex user interactions.

197

198

```python { .api }

199

class BrowserPool:

200

def __init__(self, **options): ...

201

202

class PlaywrightBrowserController:

203

def __init__(self, **options): ...

204

```

205

206

[Browser Automation](./browser-automation.md)

207

208

### Fingerprinting and Anti-Detection

209

210

Browser fingerprint generation and header randomization for enhanced stealth capabilities and bot protection bypass.

211

212

```python { .api }

213

class FingerprintGenerator:

214

def generate_fingerprint(self) -> dict: ...

215

216

class HeaderGenerator:

217

def get_headers(self, **options: HeaderGeneratorOptions) -> HttpHeaders: ...

218

219

class DefaultFingerprintGenerator(FingerprintGenerator):

220

def __init__(self, **options): ...

221

```

222

223

[Fingerprinting](./fingerprinting.md)

224

225

### Configuration and Routing

226

227

Global configuration management and request routing systems for fine-tuned control over crawling behavior.

228

229

```python { .api }

230

class Configuration:

231

def __init__(self, **settings): ...

232

233

class Router:

234

def default_handler(self, handler): ...

235

def route(self, label: str, handler): ...

236

237

class ProxyConfiguration:

238

def __init__(self, proxy_urls: list[str]): ...

239

```

240

241

[Configuration](./configuration.md)

242

243

### Statistics and Monitoring

244

245

Performance monitoring and statistics collection for tracking crawling progress and system resource usage.

246

247

```python { .api }

248

class Statistics:

249

def __init__(self): ...

250

def get_state(self) -> StatisticsState: ...

251

252

class FinalStatistics:

253

requests_finished: int

254

requests_failed: int

255

retry_histogram: list[int]

256

```

257

258

[Statistics](./statistics.md)

259

260

### Error Handling

261

262

Comprehensive exception hierarchy for handling various crawling scenarios and failure modes.

263

264

```python { .api }

265

class HttpStatusCodeError(Exception): ...

266

class ProxyError(Exception): ...

267

class SessionError(Exception): ...

268

class RequestHandlerError(Exception): ...

269

```

270

271

[Error Handling](./error-handling.md)

272

273

### Request Management

274

275

Advanced request lifecycle management with support for static lists, dynamic queues, and tandem operations.

276

277

```python { .api }

278

class RequestList:

279

def __init__(self, requests: list[str | Request]): ...

280

281

class RequestManager:

282

def __init__(self, **options): ...

283

284

class RequestManagerTandem:

285

def __init__(self, request_list: RequestList, request_queue: RequestQueue): ...

286

```

287

288

[Request Management](./request-management.md)

289

290

### Events System

291

292

Event-driven architecture for hooking into crawler lifecycle events and implementing custom behaviors.

293

294

```python { .api }

295

class EventManager:

296

def emit(self, event: Event, data: EventData): ...

297

def on(self, event: Event, listener: EventListener): ...

298

299

class LocalEventManager(EventManager): ...

300

```

301

302

[Events](./events.md)

303

304

### CLI Tools

305

306

Command-line interface for project scaffolding and development workflow automation.

307

308

```python { .api }

309

# Command line usage:

310

# crawlee create my-project

311

# crawlee --version

312

```

313

314

[CLI Tools](./cli-tools.md)