The scalable web crawling and scraping library for JavaScript/Node.js that enables development of data extraction and web automation jobs with headless Chrome and Puppeteer.
npx @tessl/cli install tessl/npm-crawlee@3.15.00
# Crawlee
1
2
Crawlee is a comprehensive web crawling and scraping library for Node.js that enables development of robust data extraction and web automation jobs. It provides a unified interface for various crawling strategies, from simple HTTP requests to full browser automation with headless Chrome, Puppeteer, and Playwright.
3
4
## Package Information
5
6
- **Package Name**: crawlee
7
- **Package Type**: npm
8
- **Language**: TypeScript/JavaScript
9
- **Installation**: `npm install crawlee`
10
11
## Core Imports
12
13
```typescript
14
import {
15
// Core crawlers
16
BasicCrawler,
17
HttpCrawler,
18
CheerioCrawler,
19
JSDOMCrawler,
20
LinkedOMCrawler,
21
PuppeteerCrawler,
22
PlaywrightCrawler,
23
FileDownload,
24
25
// Storage
26
Dataset,
27
KeyValueStore,
28
RequestQueue,
29
RequestList,
30
RecoverableState,
31
32
// Session management
33
SessionPool,
34
Session,
35
36
// Configuration and proxies
37
Configuration,
38
ProxyConfiguration,
39
40
// Error handling
41
NonRetryableError,
42
CriticalError,
43
MissingRouteError,
44
RetryRequestError,
45
SessionError,
46
BrowserLaunchError,
47
48
// State management
49
useState,
50
purgeDefaultStorages,
51
52
// Utilities
53
utils,
54
enqueueLinks,
55
sleep
56
} from "crawlee";
57
```
58
59
For CommonJS:
60
61
```javascript
62
const {
63
// Core crawlers
64
BasicCrawler,
65
HttpCrawler,
66
CheerioCrawler,
67
JSDOMCrawler,
68
LinkedOMCrawler,
69
PuppeteerCrawler,
70
PlaywrightCrawler,
71
FileDownload,
72
73
// Storage
74
Dataset,
75
KeyValueStore,
76
RequestQueue,
77
RequestList,
78
RecoverableState,
79
80
// Session management
81
SessionPool,
82
Session,
83
84
// Configuration and proxies
85
Configuration,
86
ProxyConfiguration,
87
88
// Error handling
89
NonRetryableError,
90
CriticalError,
91
MissingRouteError,
92
RetryRequestError,
93
SessionError,
94
BrowserLaunchError,
95
96
// State management
97
useState,
98
purgeDefaultStorages,
99
100
// Utilities
101
utils,
102
enqueueLinks,
103
sleep
104
} = require("crawlee");
105
```
106
107
## Basic Usage
108
109
```typescript
110
import { CheerioCrawler, Dataset } from "crawlee";
111
112
const crawler = new CheerioCrawler({
113
requestHandler: async ({ $, request, enqueueLinks }) => {
114
// Extract data from the page
115
const title = $('title').text();
116
const price = $('.price').text();
117
118
// Save data to dataset
119
await Dataset.pushData({
120
url: request.loadedUrl,
121
title,
122
price,
123
});
124
125
// Find and enqueue new links
126
await enqueueLinks({
127
selector: 'a[href^="/products/"]',
128
label: 'PRODUCT',
129
});
130
},
131
});
132
133
// Add initial URLs
134
await crawler.addRequests(['https://example.com/products']);
135
136
// Run the crawler
137
await crawler.run();
138
```
139
140
## Architecture
141
142
Crawlee is built around several key architectural components:
143
144
- **Crawler Hierarchy**: Specialized crawlers built on a common foundation (`BasicCrawler` → `HttpCrawler` → `CheerioCrawler`/`JSDOMCrawler`/`LinkedOMCrawler`, and `BasicCrawler` → `BrowserCrawler` → `PuppeteerCrawler`/`PlaywrightCrawler`)
145
- **Storage System**: Unified storage interfaces for datasets, key-value stores, and request queues
146
- **Autoscaling**: Automatic concurrency management based on system resources
147
- **Session Management**: Session rotation and proxy handling for large-scale crawling
148
- **Request Routing**: URL pattern-based request routing with handlers
149
- **Browser Pool**: Efficient browser instance management and reuse
150
151
## Capabilities
152
153
### Core Crawling
154
155
Foundation classes for building custom crawlers with autoscaling, request management, and error handling.
156
157
```typescript { .api }
158
class BasicCrawler<Context = BasicCrawlingContext> {
159
constructor(options: BasicCrawlerOptions<Context>);
160
run(): Promise<FinalStatistics>;
161
addRequests(requests: (string | RequestOptions)[]): Promise<void>;
162
}
163
164
class AutoscaledPool {
165
constructor(options: AutoscaledPoolOptions);
166
run(): Promise<void>;
167
abort(): Promise<void>;
168
pause(): Promise<void>;
169
resume(): Promise<void>;
170
}
171
```
172
173
[Core Crawling](./core-crawling.md)
174
175
### HTTP Crawling
176
177
Server-side HTML parsing crawlers for efficient data extraction without browser overhead.
178
179
```typescript { .api }
180
class HttpCrawler extends BasicCrawler<HttpCrawlingContext> {
181
constructor(options: HttpCrawlerOptions);
182
}
183
184
class CheerioCrawler extends HttpCrawler {
185
constructor(options: CheerioCrawlerOptions);
186
}
187
188
class JSDOMCrawler extends HttpCrawler {
189
constructor(options: JSDOMCrawlerOptions);
190
}
191
```
192
193
[HTTP Crawling](./http-crawling.md)
194
195
### Browser Crawling
196
197
Full browser automation with Puppeteer and Playwright for JavaScript-heavy websites.
198
199
```typescript { .api }
200
class BrowserCrawler extends BasicCrawler<BrowserCrawlingContext> {
201
constructor(options: BrowserCrawlerOptions);
202
}
203
204
class PuppeteerCrawler extends BrowserCrawler {
205
constructor(options: PuppeteerCrawlerOptions);
206
}
207
208
class PlaywrightCrawler extends BrowserCrawler {
209
constructor(options: PlaywrightCrawlerOptions);
210
}
211
```
212
213
[Browser Crawling](./browser-crawling.md)
214
215
### Storage
216
217
Persistent storage solutions for structured data, key-value pairs, and request management.
218
219
```typescript { .api }
220
class Dataset {
221
static open(idOrName?: string): Promise<Dataset>;
222
pushData(data: Dictionary | Dictionary[]): Promise<void>;
223
getData(options?: DatasetDataOptions): Promise<DatasetData>;
224
}
225
226
class KeyValueStore {
227
static open(idOrName?: string): Promise<KeyValueStore>;
228
setValue(key: string, value: any, options?: RecordOptions): Promise<void>;
229
getValue<T>(key: string): Promise<T | null>;
230
}
231
232
class RequestQueue {
233
static open(idOrName?: string): Promise<RequestQueue>;
234
addRequest(request: RequestOptions | string): Promise<QueueOperationInfo>;
235
fetchNextRequest(): Promise<Request | null>;
236
}
237
```
238
239
[Storage](./storage.md)
240
241
### Utilities
242
243
Helper functions for URL extraction, social media parsing, and system detection.
244
245
```typescript { .api }
246
const utils: {
247
sleep(millis?: number): Promise<void>;
248
enqueueLinks(options: EnqueueLinksOptions): Promise<BatchAddRequestsResult>;
249
social: {
250
parseHandlesFromHtml(html: string): SocialHandles;
251
emailsFromText(text: string): string[];
252
phonesFromText(text: string): string[];
253
};
254
downloadListOfUrls(options: DownloadListOfUrlsOptions): Promise<string[]>;
255
parseOpenGraph(html: string): Dictionary;
256
isDocker(): boolean;
257
isLambda(): boolean;
258
};
259
```
260
261
[Utilities](./utilities.md)
262
263
### Session Management
264
265
Session rotation and proxy management for handling anti-bot measures.
266
267
```typescript { .api }
268
class SessionPool {
269
constructor(options?: SessionPoolOptions);
270
getSession(request?: Request): Promise<Session>;
271
markSessionBad(session: Session): Promise<void>;
272
}
273
274
class Session {
275
constructor(options: SessionOptions);
276
getCookieString(url: string): string;
277
setPuppeteerCookies(page: Page, domain?: string): Promise<void>;
278
}
279
```
280
281
[Session Management](./session-management.md)
282
283
### Configuration and Proxies
284
285
Global configuration management and proxy handling for distributed crawling.
286
287
```typescript { .api }
288
class Configuration {
289
static getGlobalConfig(): Configuration;
290
get(key: string): any;
291
set(key: string, value: any): void;
292
}
293
294
class ProxyConfiguration {
295
constructor(options?: ProxyConfigurationOptions);
296
newUrl(sessionId?: number | string): Promise<string | undefined>;
297
newProxyInfo(sessionId?: number | string): Promise<ProxyInfo | undefined>;
298
}
299
```
300
301
[Configuration and Proxies](./configuration-proxies.md)
302
303
## Error Handling
304
305
Comprehensive error handling system with specialized error types for different failure scenarios.
306
307
```typescript { .api }
308
/**
309
* Base error for requests that should not be retried
310
*/
311
class NonRetryableError extends Error {
312
constructor(message?: string);
313
}
314
315
/**
316
* Critical error that extends NonRetryableError
317
*/
318
class CriticalError extends NonRetryableError {
319
constructor(message?: string);
320
}
321
322
/**
323
* Error indicating a missing route handler
324
*/
325
class MissingRouteError extends CriticalError {
326
constructor(message?: string);
327
}
328
329
/**
330
* Error requesting that a request should be retried
331
*/
332
class RetryRequestError extends Error {
333
constructor(message?: string, options?: { retryAfter?: number });
334
}
335
336
/**
337
* Session-related error extending RetryRequestError
338
*/
339
class SessionError extends RetryRequestError {
340
constructor(session: Session, message?: string, options?: { retryAfter?: number });
341
}
342
343
/**
344
* Browser launch error for browser pool issues
345
*/
346
class BrowserLaunchError extends CriticalError {
347
constructor(message?: string);
348
}
349
350
/**
351
* Cookie parsing error for session management
352
*/
353
class CookieParseError extends Error {
354
constructor(message?: string);
355
}
356
```
357
358
## Common Types
359
360
```typescript { .api }
361
interface BasicCrawlerOptions<Context> {
362
requestList?: RequestList;
363
requestQueue?: RequestQueue;
364
requestHandler: (context: Context) => Promise<void>;
365
maxRequestRetries?: number;
366
maxRequestsPerCrawl?: number;
367
maxConcurrency?: number;
368
autoscaledPoolOptions?: AutoscaledPoolOptions;
369
sessionPoolOptions?: SessionPoolOptions;
370
useSessionPool?: boolean;
371
persistCookiesPerSession?: boolean;
372
}
373
374
interface BasicCrawlingContext<UserData = Dictionary> {
375
request: Request<UserData>;
376
session?: Session;
377
proxyInfo?: ProxyInfo;
378
response?: IncomingMessage;
379
crawler: BasicCrawler;
380
log: Log;
381
sendRequest<T>(overrideOptions?: Partial<OptionsInit>): Promise<T>;
382
enqueueLinks(options?: EnqueueLinksOptions): Promise<BatchAddRequestsResult>;
383
pushData(data: Dictionary | Dictionary[]): Promise<void>;
384
setValue(key: string, value: any, options?: RecordOptions): Promise<void>;
385
getValue<T>(key: string): Promise<T | null>;
386
}
387
388
interface Request<UserData = Dictionary> {
389
url: string;
390
loadedUrl?: string;
391
uniqueKey: string;
392
method?: HttpMethod;
393
payload?: string;
394
noRetry?: boolean;
395
retryCount?: number;
396
errorMessages?: string[];
397
headers?: Dictionary;
398
userData?: UserData;
399
handledAt?: Date;
400
label?: string;
401
keepUrlFragment?: boolean;
402
}
403
404
interface ProxyInfo {
405
url: string;
406
hostname: string;
407
port: number;
408
auth?: {
409
username: string;
410
password: string;
411
};
412
protocol: string;
413
sessionId?: string | number;
414
}
415
416
interface FinalStatistics {
417
requestsFinished: number;
418
requestsFailed: number;
419
requestsRetries: number;
420
requestsFailedPerMinute: number;
421
requestsFinishedPerMinute: number;
422
requestMinDurationMillis: number;
423
requestMaxDurationMillis: number;
424
requestTotalDurationMillis: number;
425
crawlerStartedAt: Date;
426
crawlerFinishedAt: Date;
427
statsId: string;
428
}
429
```