CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/npm-crawler

A ready-to-use web spider that works with proxies, asynchrony, rate limit, configurable request pools, jQuery, and HTTP/2 support.

94

1.17x
Overview
Eval results
Files

task.mdevals/scenario-2/

News Feed Aggregator

Build a news feed aggregator that collects article titles from multiple news websites with proper request prioritization and concurrency control.

Requirements

Create a function that:

  1. Accepts an array of news sources (each with a URL, name, and priority)
  2. Accepts a maximum concurrent connection limit
  3. Crawls each source website to extract article titles
  4. Processes sources in priority order (lower priority number = higher priority)
  5. Returns collected articles grouped by source name

Functionality

Priority-Based Processing

The system must respect priority levels where:

  • Priority 0 = highest priority (process first)
  • Priority 5 = normal priority (process after higher priorities)

When multiple requests are queued, higher priority requests (lower priority number) should be processed before lower priority requests.

Concurrency Control

The crawler must limit the number of simultaneous active requests to the specified maxConnections value to avoid overwhelming servers.

Data Extraction

For each website:

  • Extract the text content of all h2.article-title elements
  • Collect the extracted titles in an array

Result Format

Return results as an array of objects, where each object contains:

  • source: the name of the news source
  • articles: array of article title strings from that source

Test Cases

Test 1: Basic priority ordering @test

Given a crawler with maxConnections set to 1:

  • When three URLs are queued with priorities: 5, 0, 5
  • Then the URLs should be processed in priority order: priority 0 first, then the two priority 5 URLs in the order they were added

Test 2: Concurrent request limiting @test

Given a crawler with maxConnections set to 2:

  • When five URLs are queued simultaneously
  • Then at most 2 requests should be active at any given time
  • And all five requests should eventually complete

Test 3: Data extraction and collection @test

Given a mock HTML page with three article titles:

  • When the page is crawled
  • Then all three article titles should be extracted correctly
  • And stored in the results array

Implementation Notes

  • Use the provided callback pattern to handle asynchronous request completion
  • Ensure proper cleanup by calling the completion callback after processing each request
  • Structure your code to be testable with mock HTML content

Dependencies { .dependencies }

crawler { .dependency }

Web crawling framework with priority queue support and concurrent request management.

@generates

API

/**
 * Aggregates news from multiple sources with priority-based crawling
 *
 * @param {Array<{url: string, name: string, priority: number}>} sources - Array of news sources with URLs, names, and priorities
 * @param {number} maxConnections - Maximum number of concurrent requests
 * @returns {Promise<Array<{source: string, articles: Array<string>}>>} Promise that resolves with collected articles grouped by source
 */
function aggregateNews(sources, maxConnections) {
  // Implementation here
}

module.exports = { aggregateNews };

Install with Tessl CLI

npx tessl i tessl/npm-crawler

tile.json