tessl/npm-crawler

A ready-to-use web spider that works with proxies, asynchrony, rate limit, configurable request pools, jQuery, and HTTP/2 support.

1.17x

Overview

Eval results

Files

News Feed Aggregator

Name: tessl/npm-crawler
Rating: 0.94 (1 reviews)
Author: tessl

Build a news feed aggregator that collects article titles from multiple news websites with proper request prioritization and concurrency control.

Requirements

Create a function that:

Accepts an array of news sources (each with a URL, name, and priority)
Accepts a maximum concurrent connection limit
Crawls each source website to extract article titles
Processes sources in priority order (lower priority number = higher priority)
Returns collected articles grouped by source name

Functionality

Priority-Based Processing

The system must respect priority levels where:

Priority 0 = highest priority (process first)
Priority 5 = normal priority (process after higher priorities)

When multiple requests are queued, higher priority requests (lower priority number) should be processed before lower priority requests.

Concurrency Control

The crawler must limit the number of simultaneous active requests to the specified maxConnections value to avoid overwhelming servers.

Data Extraction

For each website:

Extract the text content of all h2.article-title elements
Collect the extracted titles in an array

Result Format

Return results as an array of objects, where each object contains:

source: the name of the news source
articles: array of article title strings from that source

Test Cases

Test 1: Basic priority ordering @test

Given a crawler with maxConnections set to 1:

When three URLs are queued with priorities: 5, 0, 5
Then the URLs should be processed in priority order: priority 0 first, then the two priority 5 URLs in the order they were added

Test 2: Concurrent request limiting @test

Given a crawler with maxConnections set to 2:

When five URLs are queued simultaneously
Then at most 2 requests should be active at any given time
And all five requests should eventually complete

Test 3: Data extraction and collection @test

Given a mock HTML page with three article titles:

When the page is crawled
Then all three article titles should be extracted correctly
And stored in the results array

Implementation Notes

Use the provided callback pattern to handle asynchronous request completion
Ensure proper cleanup by calling the completion callback after processing each request
Structure your code to be testable with mock HTML content

Dependencies { .dependencies }

crawler { .dependency }

Web crawling framework with priority queue support and concurrent request management.

@generates

API

/**
 * Aggregates news from multiple sources with priority-based crawling
 *
 * @param {Array<{url: string, name: string, priority: number}>} sources - Array of news sources with URLs, names, and priorities
 * @param {number} maxConnections - Maximum number of concurrent requests
 * @returns {Promise<Array<{source: string, articles: Array<string>}>>} Promise that resolves with collected articles grouped by source
 */
function aggregateNews(sources, maxConnections) {
  // Implementation here
}

module.exports = { aggregateNews };

Install with Tessl CLI

npx tessl i tessl/npm-crawler