A ready-to-use web spider that works with proxies, asynchrony, rate limit, configurable request pools, jQuery, and HTTP/2 support.
npx @tessl/cli install tessl/npm-crawler@2.0.0Crawler is a ready-to-use web spider that works with proxies, asynchrony, rate limit, configurable request pools, jQuery, and HTTP/2 support. It provides a comprehensive web crawling and scraping library built with TypeScript that enables developers to create sophisticated web spiders with advanced features including configurable connection pools, rate limiting, priority queues, and automatic charset detection.
npm install crawlerimport Crawler from "crawler";import Crawler from "crawler";
// Create crawler with configuration
const c = new Crawler({
maxConnections: 10,
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
const $ = res.$;
// $ is Cheerio by default - jQuery-like server-side DOM manipulation
console.log($("title").text());
}
done();
},
});
// Add URLs to crawl
c.add("http://www.example.com");
c.add(["http://www.google.com/", "http://www.yahoo.com"]);
// Add URLs with custom options
c.add({
url: "http://example.org/",
jQuery: false,
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
console.log("Grabbed", res.body.length, "bytes");
}
done();
},
});Crawler is built around several key components:
Primary web crawler class with request queuing, rate limiting, and response processing. Handles concurrent requests with configurable limits and provides jQuery integration for DOM manipulation.
class Crawler extends EventEmitter {
constructor(options?: CrawlerOptions);
add(options: RequestConfig): void;
send(options: RequestConfig): Promise<CrawlerResponse>;
setLimiter(rateLimiterId: number, property: string, value: unknown): void;
readonly queueSize: number;
}
type RequestConfig = string | RequestOptions | RequestOptions[];
type CrawlerResponse = any;Comprehensive configuration system supporting global crawler settings and per-request options. Includes rate limiting, proxy support, retry mechanisms, and request customization.
interface CrawlerOptions extends Partial<GlobalOnlyOptions>, RequestOptions {}
interface GlobalOnlyOptions {
maxConnections: number;
priorityLevels: number;
rateLimit: number;
skipDuplicates: boolean;
homogeneous: boolean;
userAgents?: string | string[];
silence?: boolean;
}
interface RequestOptions {
url?: string | Function;
method?: string;
headers?: Record<string, unknown>;
body?: string | Record<string, unknown>;
jQuery?: boolean;
timeout?: number;
retries?: number;
proxy?: string;
proxies?: string[];
callback?: (error: unknown, response: CrawlerResponse, done?: unknown) => void;
// ... additional options
}Advanced rate limiting with multiple limiters, priority queues, and cluster management. Supports per-domain rate limiting and dynamic task reallocation.
class RateLimiter {
constructor(options: RateLimiterOptions);
submit(options: {priority: number} | number, task: Task): void;
setRateLimit(rateLimit: number): void;
readonly waitingSize: number;
readonly runningSize: number;
}
class Cluster {
constructor(options: ClusterOptions);
getRateLimiter(id?: number): RateLimiter;
hasRateLimiter(id: number): boolean;
deleteRateLimiter(id: number): boolean;
readonly waitingSize: number;
readonly unfinishedSize: number;
}Multi-priority queue system for managing request execution order with configurable priority levels and efficient task scheduling.
class multiPriorityQueue<T> {
constructor(priorities: number);
enqueue(value: T, priority: number): void;
dequeue(): T | undefined;
size(): number;
}
class Queue<T> {
constructor();
enqueue(value: T): number;
dequeue(): T | undefined;
isEmpty(): boolean;
front(): T | undefined;
back(): T | undefined;
readonly length: number;
}Helper functions for type checking, object manipulation, URL validation, and data processing used throughout the crawler system.
function getType(value: unknown): string;
function isNumber(value: unknown): boolean;
function isFunction(value: unknown): boolean;
function isBoolean(value: unknown): boolean;
function setDefaults(target: Record<string, unknown>, source: Record<string, unknown>): Record<string, unknown>;
function isValidUrl(url: string): boolean;
function flattenDeep(array: any[]): any[];
function cleanObject(obj: Record<string, unknown>): Record<string, unknown>;
function lowerObjectKeys(obj: Record<string, unknown>): Record<string, unknown>;The Crawler class extends EventEmitter and emits the following events:
'schedule': Emitted when a request is scheduled for execution'request': Emitted when a request is about to be sent (unless skipEventRequest is true)'limiterChange': Emitted when rate limiter changes for a request'drain': Emitted when all requests are completed and the queue is empty'_release': Internal event for task completion (private)Crawler integrates with several key dependencies:
The crawler provides comprehensive error handling through:
For users migrating from Crawler v1:
queue() renamed to add(), direct() renamed to send()