The primary web crawler class providing request queuing, rate limiting, and response processing capabilities.
Main crawler class extending EventEmitter for web scraping and crawling operations.
/**
* Main web crawler class with request management and event handling
*/
class Crawler extends EventEmitter {
constructor(options?: CrawlerOptions);
/** Current crawler configuration */
options: CrawlerOptions;
/** Seenreq instance for duplicate detection */
seen: any;
/** Current queue size (readonly) */
readonly queueSize: number;
}Creates a new Crawler instance with optional configuration.
/**
* Creates a new Crawler instance
* @param options - Optional crawler configuration
*/
constructor(options?: CrawlerOptions);Usage Example:
import Crawler from "crawler";
// Basic crawler
const crawler = new Crawler();
// Configured crawler
const crawler = new Crawler({
maxConnections: 5,
rateLimit: 1000, // 1 second between requests
callback: (error, res, done) => {
if (!error) {
console.log(res.$("title").text());
}
done();
},
});Adds requests to the crawler queue for processing.
/**
* Add requests to the crawler queue
* @param options - Request configuration (string URL, options object, or array)
*/
add(options: RequestConfig): void;
type RequestConfig = string | RequestOptions | RequestOptions[];Usage Examples:
// Add single URL
crawler.add("https://example.com");
// Add multiple URLs
crawler.add([
"https://example.com",
"https://google.com",
"https://github.com"
]);
// Add with custom options
crawler.add({
url: "https://api.example.com/data",
method: "POST",
body: JSON.stringify({ key: "value" }),
headers: { "Content-Type": "application/json" },
callback: (error, res, done) => {
console.log("API response:", res.body);
done();
}
});
// Add with different priorities
crawler.add({
url: "https://high-priority.com",
priority: 1, // Higher priority
callback: (error, res, done) => {
console.log("High priority request completed");
done();
}
});Sends a request directly without adding to the queue, returning a Promise.
/**
* Send a request directly and return a Promise
* @param options - Request configuration
* @returns Promise resolving to crawler response
*/
send(options: RequestConfig): Promise<CrawlerResponse>;Usage Examples:
// Promise-based direct request
try {
const response = await crawler.send("https://example.com");
console.log(response.$("title").text());
} catch (error) {
console.error("Request failed:", error);
}
// Direct request with options
const response = await crawler.send({
url: "https://api.example.com/data",
method: "POST",
body: { key: "value" },
isJson: true
});
console.log("API data:", response.body);Configures rate limiter properties dynamically.
/**
* Set rate limiter property
* @param rateLimiterId - ID of the rate limiter to modify
* @param property - Property name to change (currently only "rateLimit" supported)
* @param value - New value for the property
*/
setLimiter(rateLimiterId: number, property: string, value: unknown): void;Usage Example:
// Change rate limit for default limiter
crawler.setLimiter(0, "rateLimit", 2000); // 2 seconds between requests
// Change rate limit for specific limiter
crawler.setLimiter(5, "rateLimit", 500); // 0.5 seconds for limiter ID 5Gets the current size of the request queue.
/**
* Current number of requests in the queue (readonly)
*/
readonly queueSize: number;Usage Example:
console.log(`Current queue size: ${crawler.queueSize}`);
// Monitor queue size
setInterval(() => {
if (crawler.queueSize === 0) {
console.log("Queue is empty");
}
}, 1000);Legacy methods maintained for backward compatibility.
/**
* @deprecated Use add() instead
*/
queue(options: RequestConfig): void;
/**
* @deprecated Use send() instead
*/
direct(options: RequestConfig): Promise<CrawlerResponse>;The crawler response object contains the processed HTTP response with additional metadata.
/**
* Crawler response object (typed as 'any' in implementation)
* Contains processed HTTP response with additional metadata
*/
type CrawlerResponse = any & {
/** Response body (string or parsed JSON) */
body: string | any;
/** HTTP response headers */
headers: Record<string, unknown>;
/** Request options used */
options: RequestOptions;
/** Detected character encoding */
charset: string | null;
/** Cheerio jQuery function (if jQuery is enabled) */
$?: any;
};The Crawler class emits various events during operation:
// Request scheduled
crawler.on('schedule', (options) => {
console.log('Request scheduled:', options.url);
});
// Request about to be sent
crawler.on('request', (options) => {
console.log('Sending request to:', options.url);
});
// Rate limiter changed
crawler.on('limiterChange', (options, rateLimiterId) => {
console.log('Limiter changed for:', options.url, 'to limiter:', rateLimiterId);
});
// All requests completed
crawler.on('drain', () => {
console.log('All requests completed');
});Errors are handled through callback functions and promise rejections:
// Callback-based error handling
crawler.add({
url: "https://invalid-url",
callback: (error, response, done) => {
if (error) {
console.error("Request failed:", error.message);
// Handle error (retry, log, etc.)
} else {
// Process successful response
console.log("Success:", response.body.length, "bytes");
}
done(); // Always call done() to release the connection
}
});
// Promise-based error handling
try {
const response = await crawler.send("https://example.com");
console.log("Success:", response.body);
} catch (error) {
console.error("Request failed:", error.message);
}