or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

configuration.mdcrawler.mdindex.mdqueues.mdrate-limiting.mdutilities.md
tile.json

tessl/npm-crawler

A ready-to-use web spider that works with proxies, asynchrony, rate limit, configurable request pools, jQuery, and HTTP/2 support.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
npmpkg:npm/crawler@2.0.x

To install, run

npx @tessl/cli install tessl/npm-crawler@2.0.0

index.mddocs/

Crawler

Crawler is a ready-to-use web spider that works with proxies, asynchrony, rate limit, configurable request pools, jQuery, and HTTP/2 support. It provides a comprehensive web crawling and scraping library built with TypeScript that enables developers to create sophisticated web spiders with advanced features including configurable connection pools, rate limiting, priority queues, and automatic charset detection.

Package Information

  • Package Name: crawler
  • Package Type: npm
  • Language: TypeScript
  • Installation: npm install crawler
  • Node.js Version: Requires Node.js 18 or above
  • Module Type: ESM (no CommonJS export)

Core Imports

import Crawler from "crawler";

Basic Usage

import Crawler from "crawler";

// Create crawler with configuration
const c = new Crawler({
    maxConnections: 10,
    callback: (error, res, done) => {
        if (error) {
            console.log(error);
        } else {
            const $ = res.$;
            // $ is Cheerio by default - jQuery-like server-side DOM manipulation
            console.log($("title").text());
        }
        done();
    },
});

// Add URLs to crawl
c.add("http://www.example.com");
c.add(["http://www.google.com/", "http://www.yahoo.com"]);

// Add URLs with custom options
c.add({
    url: "http://example.org/",
    jQuery: false,
    callback: (error, res, done) => {
        if (error) {
            console.log(error);
        } else {
            console.log("Grabbed", res.body.length, "bytes");
        }
        done();
    },
});

Architecture

Crawler is built around several key components:

  • Main Crawler Class: Core crawler extending EventEmitter with request management
  • Rate Limiting System: Cluster and RateLimiter classes for controlling request frequency and concurrency
  • Queue System: Multi-priority queues for managing request execution order
  • Request Processing: Integration with Got HTTP client and Cheerio for DOM manipulation
  • Event System: EventEmitter-based architecture for request lifecycle management

Capabilities

Main Crawler Interface

Primary web crawler class with request queuing, rate limiting, and response processing. Handles concurrent requests with configurable limits and provides jQuery integration for DOM manipulation.

class Crawler extends EventEmitter {
  constructor(options?: CrawlerOptions);
  add(options: RequestConfig): void;
  send(options: RequestConfig): Promise<CrawlerResponse>;
  setLimiter(rateLimiterId: number, property: string, value: unknown): void;
  readonly queueSize: number;
}

type RequestConfig = string | RequestOptions | RequestOptions[];
type CrawlerResponse = any;

Main Crawler Interface

Configuration Options

Comprehensive configuration system supporting global crawler settings and per-request options. Includes rate limiting, proxy support, retry mechanisms, and request customization.

interface CrawlerOptions extends Partial<GlobalOnlyOptions>, RequestOptions {}

interface GlobalOnlyOptions {
  maxConnections: number;
  priorityLevels: number;
  rateLimit: number;
  skipDuplicates: boolean;
  homogeneous: boolean;
  userAgents?: string | string[];
  silence?: boolean;
}

interface RequestOptions {
  url?: string | Function;
  method?: string;
  headers?: Record<string, unknown>;
  body?: string | Record<string, unknown>;
  jQuery?: boolean;
  timeout?: number;
  retries?: number;
  proxy?: string;
  proxies?: string[];
  callback?: (error: unknown, response: CrawlerResponse, done?: unknown) => void;
  // ... additional options
}

Configuration Options

Rate Limiting System

Advanced rate limiting with multiple limiters, priority queues, and cluster management. Supports per-domain rate limiting and dynamic task reallocation.

class RateLimiter {
  constructor(options: RateLimiterOptions);
  submit(options: {priority: number} | number, task: Task): void;
  setRateLimit(rateLimit: number): void;
  readonly waitingSize: number;
  readonly runningSize: number;
}

class Cluster {
  constructor(options: ClusterOptions);
  getRateLimiter(id?: number): RateLimiter;
  hasRateLimiter(id: number): boolean;
  deleteRateLimiter(id: number): boolean;
  readonly waitingSize: number;
  readonly unfinishedSize: number;
}

Rate Limiting System

Queue Management

Multi-priority queue system for managing request execution order with configurable priority levels and efficient task scheduling.

class multiPriorityQueue<T> {
  constructor(priorities: number);
  enqueue(value: T, priority: number): void;
  dequeue(): T | undefined;
  size(): number;
}

class Queue<T> {
  constructor();
  enqueue(value: T): number;
  dequeue(): T | undefined;
  isEmpty(): boolean;
  front(): T | undefined;
  back(): T | undefined;
  readonly length: number;
}

Queue Management

Utility Functions

Helper functions for type checking, object manipulation, URL validation, and data processing used throughout the crawler system.

function getType(value: unknown): string;
function isNumber(value: unknown): boolean;
function isFunction(value: unknown): boolean;
function isBoolean(value: unknown): boolean;
function setDefaults(target: Record<string, unknown>, source: Record<string, unknown>): Record<string, unknown>;
function isValidUrl(url: string): boolean;
function flattenDeep(array: any[]): any[];
function cleanObject(obj: Record<string, unknown>): Record<string, unknown>;
function lowerObjectKeys(obj: Record<string, unknown>): Record<string, unknown>;

Utility Functions

Events

The Crawler class extends EventEmitter and emits the following events:

  • 'schedule': Emitted when a request is scheduled for execution
  • 'request': Emitted when a request is about to be sent (unless skipEventRequest is true)
  • 'limiterChange': Emitted when rate limiter changes for a request
  • 'drain': Emitted when all requests are completed and the queue is empty
  • '_release': Internal event for task completion (private)

Dependencies

Crawler integrates with several key dependencies:

  • cheerio: Server-side DOM manipulation with jQuery-like API
  • got: Modern HTTP client for making requests
  • seenreq: Duplicate request detection and management
  • tslog: Structured logging with configurable output formats
  • iconv-lite: Character encoding detection and conversion
  • hpagent: HTTP/HTTPS proxy agent support
  • http2-wrapper: HTTP/2 protocol support

Error Handling

The crawler provides comprehensive error handling through:

  • Callback error parameters in request processing
  • Retry mechanisms with configurable attempts and intervals
  • Event-based error reporting
  • Proper exception propagation for promise-based operations

Migration Notes

For users migrating from Crawler v1:

  • ESM Only: v2 is native ESM and no longer supports CommonJS
  • Method Changes: queue() renamed to add(), direct() renamed to send()
  • Option Updates: Several options have been renamed or deprecated
  • Node.js Requirement: Minimum Node.js version is now 18