Tessl Tile for npm/crawlee@3.15.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/npm-crawlee

The scalable web crawling and scraping library for JavaScript/Node.js that enables development of data extraction and web automation jobs with headless Chrome and Puppeteer.

Workspace: tessl
Visibility: Public
Created: 2 months ago
Last updated: 2 months ago
Describes: pkg:npm/crawlee@3.15.x

To install, run

npx @tessl/cli install tessl/npm-crawlee@3.15.0

0
# Crawlee
1

2
Crawlee is a comprehensive web crawling and scraping library for Node.js that enables development of robust data extraction and web automation jobs. It provides a unified interface for various crawling strategies, from simple HTTP requests to full browser automation with headless Chrome, Puppeteer, and Playwright.
3

4
## Package Information
5

6
- **Package Name**: crawlee
7
- **Package Type**: npm
8
- **Language**: TypeScript/JavaScript
9
- **Installation**: `npm install crawlee`
10

11
## Core Imports
12

13
```typescript
14
import {
15
  // Core crawlers
16
  BasicCrawler,
17
  HttpCrawler,
18
  CheerioCrawler,
19
  JSDOMCrawler,
20
  LinkedOMCrawler,
21
  PuppeteerCrawler,
22
  PlaywrightCrawler,
23
  FileDownload,
24

25
  // Storage
26
  Dataset,
27
  KeyValueStore,
28
  RequestQueue,
29
  RequestList,
30
  RecoverableState,
31

32
  // Session management
33
  SessionPool,
34
  Session,
35

36
  // Configuration and proxies
37
  Configuration,
38
  ProxyConfiguration,
39

40
  // Error handling
41
  NonRetryableError,
42
  CriticalError,
43
  MissingRouteError,
44
  RetryRequestError,
45
  SessionError,
46
  BrowserLaunchError,
47

48
  // State management
49
  useState,
50
  purgeDefaultStorages,
51

52
  // Utilities
53
  utils,
54
  enqueueLinks,
55
  sleep
56
} from "crawlee";
57
```
58

59
For CommonJS:
60

61
```javascript
62
const {
63
  // Core crawlers
64
  BasicCrawler,
65
  HttpCrawler,
66
  CheerioCrawler,
67
  JSDOMCrawler,
68
  LinkedOMCrawler,
69
  PuppeteerCrawler,
70
  PlaywrightCrawler,
71
  FileDownload,
72

73
  // Storage
74
  Dataset,
75
  KeyValueStore,
76
  RequestQueue,
77
  RequestList,
78
  RecoverableState,
79

80
  // Session management
81
  SessionPool,
82
  Session,
83

84
  // Configuration and proxies
85
  Configuration,
86
  ProxyConfiguration,
87

88
  // Error handling
89
  NonRetryableError,
90
  CriticalError,
91
  MissingRouteError,
92
  RetryRequestError,
93
  SessionError,
94
  BrowserLaunchError,
95

96
  // State management
97
  useState,
98
  purgeDefaultStorages,
99

100
  // Utilities
101
  utils,
102
  enqueueLinks,
103
  sleep
104
} = require("crawlee");
105
```
106

107
## Basic Usage
108

109
```typescript
110
import { CheerioCrawler, Dataset } from "crawlee";
111

112
const crawler = new CheerioCrawler({
113
  requestHandler: async ({ $, request, enqueueLinks }) => {
114
    // Extract data from the page
115
    const title = $('title').text();
116
    const price = $('.price').text();
117

118
    // Save data to dataset
119
    await Dataset.pushData({
120
      url: request.loadedUrl,
121
      title,
122
      price,
123
    });
124

125
    // Find and enqueue new links
126
    await enqueueLinks({
127
      selector: 'a[href^="/products/"]',
128
      label: 'PRODUCT',
129
    });
130
  },
131
});
132

133
// Add initial URLs
134
await crawler.addRequests(['https://example.com/products']);
135

136
// Run the crawler
137
await crawler.run();
138
```
139

140
## Architecture
141

142
Crawlee is built around several key architectural components:
143

144
- **Crawler Hierarchy**: Specialized crawlers built on a common foundation (`BasicCrawler` → `HttpCrawler` → `CheerioCrawler`/`JSDOMCrawler`/`LinkedOMCrawler`, and `BasicCrawler` → `BrowserCrawler` → `PuppeteerCrawler`/`PlaywrightCrawler`)
145
- **Storage System**: Unified storage interfaces for datasets, key-value stores, and request queues
146
- **Autoscaling**: Automatic concurrency management based on system resources
147
- **Session Management**: Session rotation and proxy handling for large-scale crawling
148
- **Request Routing**: URL pattern-based request routing with handlers
149
- **Browser Pool**: Efficient browser instance management and reuse
150

151
## Capabilities
152

153
### Core Crawling
154

155
Foundation classes for building custom crawlers with autoscaling, request management, and error handling.
156

157
```typescript { .api }
158
class BasicCrawler<Context = BasicCrawlingContext> {
159
  constructor(options: BasicCrawlerOptions<Context>);
160
  run(): Promise<FinalStatistics>;
161
  addRequests(requests: (string | RequestOptions)[]): Promise<void>;
162
}
163

164
class AutoscaledPool {
165
  constructor(options: AutoscaledPoolOptions);
166
  run(): Promise<void>;
167
  abort(): Promise<void>;
168
  pause(): Promise<void>;
169
  resume(): Promise<void>;
170
}
171
```
172

173
[Core Crawling](./core-crawling.md)
174

175
### HTTP Crawling
176

177
Server-side HTML parsing crawlers for efficient data extraction without browser overhead.
178

179
```typescript { .api }
180
class HttpCrawler extends BasicCrawler<HttpCrawlingContext> {
181
  constructor(options: HttpCrawlerOptions);
182
}
183

184
class CheerioCrawler extends HttpCrawler {
185
  constructor(options: CheerioCrawlerOptions);
186
}
187

188
class JSDOMCrawler extends HttpCrawler {
189
  constructor(options: JSDOMCrawlerOptions);
190
}
191
```
192

193
[HTTP Crawling](./http-crawling.md)
194

195
### Browser Crawling
196

197
Full browser automation with Puppeteer and Playwright for JavaScript-heavy websites.
198

199
```typescript { .api }
200
class BrowserCrawler extends BasicCrawler<BrowserCrawlingContext> {
201
  constructor(options: BrowserCrawlerOptions);
202
}
203

204
class PuppeteerCrawler extends BrowserCrawler {
205
  constructor(options: PuppeteerCrawlerOptions);
206
}
207

208
class PlaywrightCrawler extends BrowserCrawler {
209
  constructor(options: PlaywrightCrawlerOptions);
210
}
211
```
212

213
[Browser Crawling](./browser-crawling.md)
214

215
### Storage
216

217
Persistent storage solutions for structured data, key-value pairs, and request management.
218

219
```typescript { .api }
220
class Dataset {
221
  static open(idOrName?: string): Promise<Dataset>;
222
  pushData(data: Dictionary | Dictionary[]): Promise<void>;
223
  getData(options?: DatasetDataOptions): Promise<DatasetData>;
224
}
225

226
class KeyValueStore {
227
  static open(idOrName?: string): Promise<KeyValueStore>;
228
  setValue(key: string, value: any, options?: RecordOptions): Promise<void>;
229
  getValue<T>(key: string): Promise<T | null>;
230
}
231

232
class RequestQueue {
233
  static open(idOrName?: string): Promise<RequestQueue>;
234
  addRequest(request: RequestOptions | string): Promise<QueueOperationInfo>;
235
  fetchNextRequest(): Promise<Request | null>;
236
}
237
```
238

239
[Storage](./storage.md)
240

241
### Utilities
242

243
Helper functions for URL extraction, social media parsing, and system detection.
244

245
```typescript { .api }
246
const utils: {
247
  sleep(millis?: number): Promise<void>;
248
  enqueueLinks(options: EnqueueLinksOptions): Promise<BatchAddRequestsResult>;
249
  social: {
250
    parseHandlesFromHtml(html: string): SocialHandles;
251
    emailsFromText(text: string): string[];
252
    phonesFromText(text: string): string[];
253
  };
254
  downloadListOfUrls(options: DownloadListOfUrlsOptions): Promise<string[]>;
255
  parseOpenGraph(html: string): Dictionary;
256
  isDocker(): boolean;
257
  isLambda(): boolean;
258
};
259
```
260

261
[Utilities](./utilities.md)
262

263
### Session Management
264

265
Session rotation and proxy management for handling anti-bot measures.
266

267
```typescript { .api }
268
class SessionPool {
269
  constructor(options?: SessionPoolOptions);
270
  getSession(request?: Request): Promise<Session>;
271
  markSessionBad(session: Session): Promise<void>;
272
}
273

274
class Session {
275
  constructor(options: SessionOptions);
276
  getCookieString(url: string): string;
277
  setPuppeteerCookies(page: Page, domain?: string): Promise<void>;
278
}
279
```
280

281
[Session Management](./session-management.md)
282

283
### Configuration and Proxies
284

285
Global configuration management and proxy handling for distributed crawling.
286

287
```typescript { .api }
288
class Configuration {
289
  static getGlobalConfig(): Configuration;
290
  get(key: string): any;
291
  set(key: string, value: any): void;
292
}
293

294
class ProxyConfiguration {
295
  constructor(options?: ProxyConfigurationOptions);
296
  newUrl(sessionId?: number | string): Promise<string | undefined>;
297
  newProxyInfo(sessionId?: number | string): Promise<ProxyInfo | undefined>;
298
}
299
```
300

301
[Configuration and Proxies](./configuration-proxies.md)
302

303
## Error Handling
304

305
Comprehensive error handling system with specialized error types for different failure scenarios.
306

307
```typescript { .api }
308
/**
309
 * Base error for requests that should not be retried
310
 */
311
class NonRetryableError extends Error {
312
  constructor(message?: string);
313
}
314

315
/**
316
 * Critical error that extends NonRetryableError
317
 */
318
class CriticalError extends NonRetryableError {
319
  constructor(message?: string);
320
}
321

322
/**
323
 * Error indicating a missing route handler
324
 */
325
class MissingRouteError extends CriticalError {
326
  constructor(message?: string);
327
}
328

329
/**
330
 * Error requesting that a request should be retried
331
 */
332
class RetryRequestError extends Error {
333
  constructor(message?: string, options?: { retryAfter?: number });
334
}
335

336
/**
337
 * Session-related error extending RetryRequestError
338
 */
339
class SessionError extends RetryRequestError {
340
  constructor(session: Session, message?: string, options?: { retryAfter?: number });
341
}
342

343
/**
344
 * Browser launch error for browser pool issues
345
 */
346
class BrowserLaunchError extends CriticalError {
347
  constructor(message?: string);
348
}
349

350
/**
351
 * Cookie parsing error for session management
352
 */
353
class CookieParseError extends Error {
354
  constructor(message?: string);
355
}
356
```
357

358
## Common Types
359

360
```typescript { .api }
361
interface BasicCrawlerOptions<Context> {
362
  requestList?: RequestList;
363
  requestQueue?: RequestQueue;
364
  requestHandler: (context: Context) => Promise<void>;
365
  maxRequestRetries?: number;
366
  maxRequestsPerCrawl?: number;
367
  maxConcurrency?: number;
368
  autoscaledPoolOptions?: AutoscaledPoolOptions;
369
  sessionPoolOptions?: SessionPoolOptions;
370
  useSessionPool?: boolean;
371
  persistCookiesPerSession?: boolean;
372
}
373

374
interface BasicCrawlingContext<UserData = Dictionary> {
375
  request: Request<UserData>;
376
  session?: Session;
377
  proxyInfo?: ProxyInfo;
378
  response?: IncomingMessage;
379
  crawler: BasicCrawler;
380
  log: Log;
381
  sendRequest<T>(overrideOptions?: Partial<OptionsInit>): Promise<T>;
382
  enqueueLinks(options?: EnqueueLinksOptions): Promise<BatchAddRequestsResult>;
383
  pushData(data: Dictionary | Dictionary[]): Promise<void>;
384
  setValue(key: string, value: any, options?: RecordOptions): Promise<void>;
385
  getValue<T>(key: string): Promise<T | null>;
386
}
387

388
interface Request<UserData = Dictionary> {
389
  url: string;
390
  loadedUrl?: string;
391
  uniqueKey: string;
392
  method?: HttpMethod;
393
  payload?: string;
394
  noRetry?: boolean;
395
  retryCount?: number;
396
  errorMessages?: string[];
397
  headers?: Dictionary;
398
  userData?: UserData;
399
  handledAt?: Date;
400
  label?: string;
401
  keepUrlFragment?: boolean;
402
}
403

404
interface ProxyInfo {
405
  url: string;
406
  hostname: string;
407
  port: number;
408
  auth?: {
409
    username: string;
410
    password: string;
411
  };
412
  protocol: string;
413
  sessionId?: string | number;
414
}
415

416
interface FinalStatistics {
417
  requestsFinished: number;
418
  requestsFailed: number;
419
  requestsRetries: number;
420
  requestsFailedPerMinute: number;
421
  requestsFinishedPerMinute: number;
422
  requestMinDurationMillis: number;
423
  requestMaxDurationMillis: number;
424
  requestTotalDurationMillis: number;
425
  crawlerStartedAt: Date;
426
  crawlerFinishedAt: Date;
427
  statsId: string;
428
}
429
```