or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

configuration.mdcrawler.mdindex.mdqueues.mdrate-limiting.mdutilities.md
tile.json

crawler.mddocs/

Main Crawler Interface

The primary web crawler class providing request queuing, rate limiting, and response processing capabilities.

Capabilities

Crawler Class

Main crawler class extending EventEmitter for web scraping and crawling operations.

/**
 * Main web crawler class with request management and event handling
 */
class Crawler extends EventEmitter {
  constructor(options?: CrawlerOptions);
  
  /** Current crawler configuration */
  options: CrawlerOptions;
  
  /** Seenreq instance for duplicate detection */
  seen: any;
  
  /** Current queue size (readonly) */
  readonly queueSize: number;
}

Constructor

Creates a new Crawler instance with optional configuration.

/**
 * Creates a new Crawler instance
 * @param options - Optional crawler configuration
 */
constructor(options?: CrawlerOptions);

Usage Example:

import Crawler from "crawler";

// Basic crawler
const crawler = new Crawler();

// Configured crawler
const crawler = new Crawler({
    maxConnections: 5,
    rateLimit: 1000, // 1 second between requests
    callback: (error, res, done) => {
        if (!error) {
            console.log(res.$("title").text());
        }
        done();
    },
});

Add Method

Adds requests to the crawler queue for processing.

/**
 * Add requests to the crawler queue
 * @param options - Request configuration (string URL, options object, or array)
 */
add(options: RequestConfig): void;

type RequestConfig = string | RequestOptions | RequestOptions[];

Usage Examples:

// Add single URL
crawler.add("https://example.com");

// Add multiple URLs
crawler.add([
    "https://example.com",
    "https://google.com",
    "https://github.com"
]);

// Add with custom options
crawler.add({
    url: "https://api.example.com/data",
    method: "POST",
    body: JSON.stringify({ key: "value" }),
    headers: { "Content-Type": "application/json" },
    callback: (error, res, done) => {
        console.log("API response:", res.body);
        done();
    }
});

// Add with different priorities
crawler.add({
    url: "https://high-priority.com",
    priority: 1, // Higher priority
    callback: (error, res, done) => {
        console.log("High priority request completed");
        done();
    }
});

Send Method

Sends a request directly without adding to the queue, returning a Promise.

/**
 * Send a request directly and return a Promise
 * @param options - Request configuration
 * @returns Promise resolving to crawler response
 */
send(options: RequestConfig): Promise<CrawlerResponse>;

Usage Examples:

// Promise-based direct request
try {
    const response = await crawler.send("https://example.com");
    console.log(response.$("title").text());
} catch (error) {
    console.error("Request failed:", error);
}

// Direct request with options
const response = await crawler.send({
    url: "https://api.example.com/data",
    method: "POST",
    body: { key: "value" },
    isJson: true
});
console.log("API data:", response.body);

Set Limiter Method

Configures rate limiter properties dynamically.

/**
 * Set rate limiter property
 * @param rateLimiterId - ID of the rate limiter to modify
 * @param property - Property name to change (currently only "rateLimit" supported)
 * @param value - New value for the property
 */
setLimiter(rateLimiterId: number, property: string, value: unknown): void;

Usage Example:

// Change rate limit for default limiter
crawler.setLimiter(0, "rateLimit", 2000); // 2 seconds between requests

// Change rate limit for specific limiter
crawler.setLimiter(5, "rateLimit", 500); // 0.5 seconds for limiter ID 5

Queue Size Property

Gets the current size of the request queue.

/**
 * Current number of requests in the queue (readonly)
 */
readonly queueSize: number;

Usage Example:

console.log(`Current queue size: ${crawler.queueSize}`);

// Monitor queue size
setInterval(() => {
    if (crawler.queueSize === 0) {
        console.log("Queue is empty");
    }
}, 1000);

Deprecated Methods

Legacy methods maintained for backward compatibility.

/**
 * @deprecated Use add() instead
 */
queue(options: RequestConfig): void;

/**
 * @deprecated Use send() instead
 */
direct(options: RequestConfig): Promise<CrawlerResponse>;

Response Object

The crawler response object contains the processed HTTP response with additional metadata.

/**
 * Crawler response object (typed as 'any' in implementation)
 * Contains processed HTTP response with additional metadata
 */
type CrawlerResponse = any & {
  /** Response body (string or parsed JSON) */
  body: string | any;
  
  /** HTTP response headers */
  headers: Record<string, unknown>;
  
  /** Request options used */
  options: RequestOptions;
  
  /** Detected character encoding */
  charset: string | null;
  
  /** Cheerio jQuery function (if jQuery is enabled) */
  $?: any;
};

Event Handling

The Crawler class emits various events during operation:

// Request scheduled
crawler.on('schedule', (options) => {
    console.log('Request scheduled:', options.url);
});

// Request about to be sent
crawler.on('request', (options) => {
    console.log('Sending request to:', options.url);
});

// Rate limiter changed
crawler.on('limiterChange', (options, rateLimiterId) => {
    console.log('Limiter changed for:', options.url, 'to limiter:', rateLimiterId);
});

// All requests completed
crawler.on('drain', () => {
    console.log('All requests completed');
});

Error Handling

Errors are handled through callback functions and promise rejections:

// Callback-based error handling
crawler.add({
    url: "https://invalid-url",
    callback: (error, response, done) => {
        if (error) {
            console.error("Request failed:", error.message);
            // Handle error (retry, log, etc.)
        } else {
            // Process successful response
            console.log("Success:", response.body.length, "bytes");
        }
        done(); // Always call done() to release the connection
    }
});

// Promise-based error handling
try {
    const response = await crawler.send("https://example.com");
    console.log("Success:", response.body);
} catch (error) {
    console.error("Request failed:", error.message);
}