or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/npm-crawlee

The scalable web crawling and scraping library for JavaScript/Node.js that enables development of data extraction and web automation jobs with headless Chrome and Puppeteer.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
npmpkg:npm/crawlee@3.15.x

To install, run

npx @tessl/cli install tessl/npm-crawlee@3.15.0

0

# Crawlee

1

2

Crawlee is a comprehensive web crawling and scraping library for Node.js that enables development of robust data extraction and web automation jobs. It provides a unified interface for various crawling strategies, from simple HTTP requests to full browser automation with headless Chrome, Puppeteer, and Playwright.

3

4

## Package Information

5

6

- **Package Name**: crawlee

7

- **Package Type**: npm

8

- **Language**: TypeScript/JavaScript

9

- **Installation**: `npm install crawlee`

10

11

## Core Imports

12

13

```typescript

14

import {

15

// Core crawlers

16

BasicCrawler,

17

HttpCrawler,

18

CheerioCrawler,

19

JSDOMCrawler,

20

LinkedOMCrawler,

21

PuppeteerCrawler,

22

PlaywrightCrawler,

23

FileDownload,

24

25

// Storage

26

Dataset,

27

KeyValueStore,

28

RequestQueue,

29

RequestList,

30

RecoverableState,

31

32

// Session management

33

SessionPool,

34

Session,

35

36

// Configuration and proxies

37

Configuration,

38

ProxyConfiguration,

39

40

// Error handling

41

NonRetryableError,

42

CriticalError,

43

MissingRouteError,

44

RetryRequestError,

45

SessionError,

46

BrowserLaunchError,

47

48

// State management

49

useState,

50

purgeDefaultStorages,

51

52

// Utilities

53

utils,

54

enqueueLinks,

55

sleep

56

} from "crawlee";

57

```

58

59

For CommonJS:

60

61

```javascript

62

const {

63

// Core crawlers

64

BasicCrawler,

65

HttpCrawler,

66

CheerioCrawler,

67

JSDOMCrawler,

68

LinkedOMCrawler,

69

PuppeteerCrawler,

70

PlaywrightCrawler,

71

FileDownload,

72

73

// Storage

74

Dataset,

75

KeyValueStore,

76

RequestQueue,

77

RequestList,

78

RecoverableState,

79

80

// Session management

81

SessionPool,

82

Session,

83

84

// Configuration and proxies

85

Configuration,

86

ProxyConfiguration,

87

88

// Error handling

89

NonRetryableError,

90

CriticalError,

91

MissingRouteError,

92

RetryRequestError,

93

SessionError,

94

BrowserLaunchError,

95

96

// State management

97

useState,

98

purgeDefaultStorages,

99

100

// Utilities

101

utils,

102

enqueueLinks,

103

sleep

104

} = require("crawlee");

105

```

106

107

## Basic Usage

108

109

```typescript

110

import { CheerioCrawler, Dataset } from "crawlee";

111

112

const crawler = new CheerioCrawler({

113

requestHandler: async ({ $, request, enqueueLinks }) => {

114

// Extract data from the page

115

const title = $('title').text();

116

const price = $('.price').text();

117

118

// Save data to dataset

119

await Dataset.pushData({

120

url: request.loadedUrl,

121

title,

122

price,

123

});

124

125

// Find and enqueue new links

126

await enqueueLinks({

127

selector: 'a[href^="/products/"]',

128

label: 'PRODUCT',

129

});

130

},

131

});

132

133

// Add initial URLs

134

await crawler.addRequests(['https://example.com/products']);

135

136

// Run the crawler

137

await crawler.run();

138

```

139

140

## Architecture

141

142

Crawlee is built around several key architectural components:

143

144

- **Crawler Hierarchy**: Specialized crawlers built on a common foundation (`BasicCrawler``HttpCrawler``CheerioCrawler`/`JSDOMCrawler`/`LinkedOMCrawler`, and `BasicCrawler``BrowserCrawler``PuppeteerCrawler`/`PlaywrightCrawler`)

145

- **Storage System**: Unified storage interfaces for datasets, key-value stores, and request queues

146

- **Autoscaling**: Automatic concurrency management based on system resources

147

- **Session Management**: Session rotation and proxy handling for large-scale crawling

148

- **Request Routing**: URL pattern-based request routing with handlers

149

- **Browser Pool**: Efficient browser instance management and reuse

150

151

## Capabilities

152

153

### Core Crawling

154

155

Foundation classes for building custom crawlers with autoscaling, request management, and error handling.

156

157

```typescript { .api }

158

class BasicCrawler<Context = BasicCrawlingContext> {

159

constructor(options: BasicCrawlerOptions<Context>);

160

run(): Promise<FinalStatistics>;

161

addRequests(requests: (string | RequestOptions)[]): Promise<void>;

162

}

163

164

class AutoscaledPool {

165

constructor(options: AutoscaledPoolOptions);

166

run(): Promise<void>;

167

abort(): Promise<void>;

168

pause(): Promise<void>;

169

resume(): Promise<void>;

170

}

171

```

172

173

[Core Crawling](./core-crawling.md)

174

175

### HTTP Crawling

176

177

Server-side HTML parsing crawlers for efficient data extraction without browser overhead.

178

179

```typescript { .api }

180

class HttpCrawler extends BasicCrawler<HttpCrawlingContext> {

181

constructor(options: HttpCrawlerOptions);

182

}

183

184

class CheerioCrawler extends HttpCrawler {

185

constructor(options: CheerioCrawlerOptions);

186

}

187

188

class JSDOMCrawler extends HttpCrawler {

189

constructor(options: JSDOMCrawlerOptions);

190

}

191

```

192

193

[HTTP Crawling](./http-crawling.md)

194

195

### Browser Crawling

196

197

Full browser automation with Puppeteer and Playwright for JavaScript-heavy websites.

198

199

```typescript { .api }

200

class BrowserCrawler extends BasicCrawler<BrowserCrawlingContext> {

201

constructor(options: BrowserCrawlerOptions);

202

}

203

204

class PuppeteerCrawler extends BrowserCrawler {

205

constructor(options: PuppeteerCrawlerOptions);

206

}

207

208

class PlaywrightCrawler extends BrowserCrawler {

209

constructor(options: PlaywrightCrawlerOptions);

210

}

211

```

212

213

[Browser Crawling](./browser-crawling.md)

214

215

### Storage

216

217

Persistent storage solutions for structured data, key-value pairs, and request management.

218

219

```typescript { .api }

220

class Dataset {

221

static open(idOrName?: string): Promise<Dataset>;

222

pushData(data: Dictionary | Dictionary[]): Promise<void>;

223

getData(options?: DatasetDataOptions): Promise<DatasetData>;

224

}

225

226

class KeyValueStore {

227

static open(idOrName?: string): Promise<KeyValueStore>;

228

setValue(key: string, value: any, options?: RecordOptions): Promise<void>;

229

getValue<T>(key: string): Promise<T | null>;

230

}

231

232

class RequestQueue {

233

static open(idOrName?: string): Promise<RequestQueue>;

234

addRequest(request: RequestOptions | string): Promise<QueueOperationInfo>;

235

fetchNextRequest(): Promise<Request | null>;

236

}

237

```

238

239

[Storage](./storage.md)

240

241

### Utilities

242

243

Helper functions for URL extraction, social media parsing, and system detection.

244

245

```typescript { .api }

246

const utils: {

247

sleep(millis?: number): Promise<void>;

248

enqueueLinks(options: EnqueueLinksOptions): Promise<BatchAddRequestsResult>;

249

social: {

250

parseHandlesFromHtml(html: string): SocialHandles;

251

emailsFromText(text: string): string[];

252

phonesFromText(text: string): string[];

253

};

254

downloadListOfUrls(options: DownloadListOfUrlsOptions): Promise<string[]>;

255

parseOpenGraph(html: string): Dictionary;

256

isDocker(): boolean;

257

isLambda(): boolean;

258

};

259

```

260

261

[Utilities](./utilities.md)

262

263

### Session Management

264

265

Session rotation and proxy management for handling anti-bot measures.

266

267

```typescript { .api }

268

class SessionPool {

269

constructor(options?: SessionPoolOptions);

270

getSession(request?: Request): Promise<Session>;

271

markSessionBad(session: Session): Promise<void>;

272

}

273

274

class Session {

275

constructor(options: SessionOptions);

276

getCookieString(url: string): string;

277

setPuppeteerCookies(page: Page, domain?: string): Promise<void>;

278

}

279

```

280

281

[Session Management](./session-management.md)

282

283

### Configuration and Proxies

284

285

Global configuration management and proxy handling for distributed crawling.

286

287

```typescript { .api }

288

class Configuration {

289

static getGlobalConfig(): Configuration;

290

get(key: string): any;

291

set(key: string, value: any): void;

292

}

293

294

class ProxyConfiguration {

295

constructor(options?: ProxyConfigurationOptions);

296

newUrl(sessionId?: number | string): Promise<string | undefined>;

297

newProxyInfo(sessionId?: number | string): Promise<ProxyInfo | undefined>;

298

}

299

```

300

301

[Configuration and Proxies](./configuration-proxies.md)

302

303

## Error Handling

304

305

Comprehensive error handling system with specialized error types for different failure scenarios.

306

307

```typescript { .api }

308

/**

309

* Base error for requests that should not be retried

310

*/

311

class NonRetryableError extends Error {

312

constructor(message?: string);

313

}

314

315

/**

316

* Critical error that extends NonRetryableError

317

*/

318

class CriticalError extends NonRetryableError {

319

constructor(message?: string);

320

}

321

322

/**

323

* Error indicating a missing route handler

324

*/

325

class MissingRouteError extends CriticalError {

326

constructor(message?: string);

327

}

328

329

/**

330

* Error requesting that a request should be retried

331

*/

332

class RetryRequestError extends Error {

333

constructor(message?: string, options?: { retryAfter?: number });

334

}

335

336

/**

337

* Session-related error extending RetryRequestError

338

*/

339

class SessionError extends RetryRequestError {

340

constructor(session: Session, message?: string, options?: { retryAfter?: number });

341

}

342

343

/**

344

* Browser launch error for browser pool issues

345

*/

346

class BrowserLaunchError extends CriticalError {

347

constructor(message?: string);

348

}

349

350

/**

351

* Cookie parsing error for session management

352

*/

353

class CookieParseError extends Error {

354

constructor(message?: string);

355

}

356

```

357

358

## Common Types

359

360

```typescript { .api }

361

interface BasicCrawlerOptions<Context> {

362

requestList?: RequestList;

363

requestQueue?: RequestQueue;

364

requestHandler: (context: Context) => Promise<void>;

365

maxRequestRetries?: number;

366

maxRequestsPerCrawl?: number;

367

maxConcurrency?: number;

368

autoscaledPoolOptions?: AutoscaledPoolOptions;

369

sessionPoolOptions?: SessionPoolOptions;

370

useSessionPool?: boolean;

371

persistCookiesPerSession?: boolean;

372

}

373

374

interface BasicCrawlingContext<UserData = Dictionary> {

375

request: Request<UserData>;

376

session?: Session;

377

proxyInfo?: ProxyInfo;

378

response?: IncomingMessage;

379

crawler: BasicCrawler;

380

log: Log;

381

sendRequest<T>(overrideOptions?: Partial<OptionsInit>): Promise<T>;

382

enqueueLinks(options?: EnqueueLinksOptions): Promise<BatchAddRequestsResult>;

383

pushData(data: Dictionary | Dictionary[]): Promise<void>;

384

setValue(key: string, value: any, options?: RecordOptions): Promise<void>;

385

getValue<T>(key: string): Promise<T | null>;

386

}

387

388

interface Request<UserData = Dictionary> {

389

url: string;

390

loadedUrl?: string;

391

uniqueKey: string;

392

method?: HttpMethod;

393

payload?: string;

394

noRetry?: boolean;

395

retryCount?: number;

396

errorMessages?: string[];

397

headers?: Dictionary;

398

userData?: UserData;

399

handledAt?: Date;

400

label?: string;

401

keepUrlFragment?: boolean;

402

}

403

404

interface ProxyInfo {

405

url: string;

406

hostname: string;

407

port: number;

408

auth?: {

409

username: string;

410

password: string;

411

};

412

protocol: string;

413

sessionId?: string | number;

414

}

415

416

interface FinalStatistics {

417

requestsFinished: number;

418

requestsFailed: number;

419

requestsRetries: number;

420

requestsFailedPerMinute: number;

421

requestsFinishedPerMinute: number;

422

requestMinDurationMillis: number;

423

requestMaxDurationMillis: number;

424

requestTotalDurationMillis: number;

425

crawlerStartedAt: Date;

426

crawlerFinishedAt: Date;

427

statsId: string;

428

}

429

```