or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

configuration.md crawler.md index.md queues.md rate-limiting.md utilities.md

tile.json

tessl/npm-crawler

A ready-to-use web spider that works with proxies, asynchrony, rate limit, configurable request pools, jQuery, and HTTP/2 support.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:npm/crawler@2.0.x

To install, run

npx @tessl/cli install tessl/npm-crawler@2.0.0

Crawler

Crawler is a ready-to-use web spider that works with proxies, asynchrony, rate limit, configurable request pools, jQuery, and HTTP/2 support. It provides a comprehensive web crawling and scraping library built with TypeScript that enables developers to create sophisticated web spiders with advanced features including configurable connection pools, rate limiting, priority queues, and automatic charset detection.

Package Information

Package Name: crawler
Package Type: npm
Language: TypeScript
Installation: npm install crawler
Node.js Version: Requires Node.js 18 or above
Module Type: ESM (no CommonJS export)

Core Imports

import Crawler from "crawler";

Basic Usage

import Crawler from "crawler";

// Create crawler with configuration
const c = new Crawler({
    maxConnections: 10,
    callback: (error, res, done) => {
        if (error) {
            console.log(error);
        } else {
            const $ = res.$;
            // $ is Cheerio by default - jQuery-like server-side DOM manipulation
            console.log($("title").text());
        }
        done();
    },
});

// Add URLs to crawl
c.add("http://www.example.com");
c.add(["http://www.google.com/", "http://www.yahoo.com"]);

// Add URLs with custom options
c.add({
    url: "http://example.org/",
    jQuery: false,
    callback: (error, res, done) => {
        if (error) {
            console.log(error);
        } else {
            console.log("Grabbed", res.body.length, "bytes");
        }
        done();
    },
});

Architecture

Crawler is built around several key components:

Main Crawler Class: Core crawler extending EventEmitter with request management
Rate Limiting System: Cluster and RateLimiter classes for controlling request frequency and concurrency
Queue System: Multi-priority queues for managing request execution order
Request Processing: Integration with Got HTTP client and Cheerio for DOM manipulation
Event System: EventEmitter-based architecture for request lifecycle management

Capabilities

Main Crawler Interface

Primary web crawler class with request queuing, rate limiting, and response processing. Handles concurrent requests with configurable limits and provides jQuery integration for DOM manipulation.

class Crawler extends EventEmitter {
  constructor(options?: CrawlerOptions);
  add(options: RequestConfig): void;
  send(options: RequestConfig): Promise<CrawlerResponse>;
  setLimiter(rateLimiterId: number, property: string, value: unknown): void;
  readonly queueSize: number;
}

type RequestConfig = string | RequestOptions | RequestOptions[];
type CrawlerResponse = any;

Main Crawler Interface

Configuration Options

Comprehensive configuration system supporting global crawler settings and per-request options. Includes rate limiting, proxy support, retry mechanisms, and request customization.

interface CrawlerOptions extends Partial<GlobalOnlyOptions>, RequestOptions {}

interface GlobalOnlyOptions {
  maxConnections: number;
  priorityLevels: number;
  rateLimit: number;
  skipDuplicates: boolean;
  homogeneous: boolean;
  userAgents?: string | string[];
  silence?: boolean;
}

interface RequestOptions {
  url?: string | Function;
  method?: string;
  headers?: Record<string, unknown>;
  body?: string | Record<string, unknown>;
  jQuery?: boolean;
  timeout?: number;
  retries?: number;
  proxy?: string;
  proxies?: string[];
  callback?: (error: unknown, response: CrawlerResponse, done?: unknown) => void;
  // ... additional options
}

Configuration Options

Rate Limiting System

Advanced rate limiting with multiple limiters, priority queues, and cluster management. Supports per-domain rate limiting and dynamic task reallocation.

class RateLimiter {
  constructor(options: RateLimiterOptions);
  submit(options: {priority: number} | number, task: Task): void;
  setRateLimit(rateLimit: number): void;
  readonly waitingSize: number;
  readonly runningSize: number;
}

class Cluster {
  constructor(options: ClusterOptions);
  getRateLimiter(id?: number): RateLimiter;
  hasRateLimiter(id: number): boolean;
  deleteRateLimiter(id: number): boolean;
  readonly waitingSize: number;
  readonly unfinishedSize: number;
}

Rate Limiting System

Queue Management

Multi-priority queue system for managing request execution order with configurable priority levels and efficient task scheduling.

class multiPriorityQueue<T> {
  constructor(priorities: number);
  enqueue(value: T, priority: number): void;
  dequeue(): T | undefined;
  size(): number;
}

class Queue<T> {
  constructor();
  enqueue(value: T): number;
  dequeue(): T | undefined;
  isEmpty(): boolean;
  front(): T | undefined;
  back(): T | undefined;
  readonly length: number;
}

Queue Management

Utility Functions

Helper functions for type checking, object manipulation, URL validation, and data processing used throughout the crawler system.

function getType(value: unknown): string;
function isNumber(value: unknown): boolean;
function isFunction(value: unknown): boolean;
function isBoolean(value: unknown): boolean;
function setDefaults(target: Record<string, unknown>, source: Record<string, unknown>): Record<string, unknown>;
function isValidUrl(url: string): boolean;
function flattenDeep(array: any[]): any[];
function cleanObject(obj: Record<string, unknown>): Record<string, unknown>;
function lowerObjectKeys(obj: Record<string, unknown>): Record<string, unknown>;

Utility Functions

Events

The Crawler class extends EventEmitter and emits the following events:

'schedule': Emitted when a request is scheduled for execution
'request': Emitted when a request is about to be sent (unless skipEventRequest is true)
'limiterChange': Emitted when rate limiter changes for a request
'drain': Emitted when all requests are completed and the queue is empty
'_release': Internal event for task completion (private)

Dependencies

Crawler integrates with several key dependencies:

cheerio: Server-side DOM manipulation with jQuery-like API
got: Modern HTTP client for making requests
seenreq: Duplicate request detection and management
tslog: Structured logging with configurable output formats
iconv-lite: Character encoding detection and conversion
hpagent: HTTP/HTTPS proxy agent support
http2-wrapper: HTTP/2 protocol support

Error Handling

The crawler provides comprehensive error handling through:

Callback error parameters in request processing
Retry mechanisms with configurable attempts and intervals
Event-based error reporting
Proper exception propagation for promise-based operations

Migration Notes

For users migrating from Crawler v1:

ESM Only: v2 is native ESM and no longer supports CommonJS
Method Changes: queue() renamed to add(), direct() renamed to send()
Option Updates: Several options have been renamed or deprecated
Node.js Requirement: Minimum Node.js version is now 18

Version

Tile

Files

tessl/npm-crawler

To install, run

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

Crawler

Package Information

Core Imports

Basic Usage

Architecture

Capabilities

Main Crawler Interface

Configuration Options

Rate Limiting System

Queue Management

Utility Functions

Events

Dependencies

Error Handling

Migration Notes

index.mddocs/