tessl/npm-crawler

A ready-to-use web spider that works with proxies, asynchrony, rate limit, configurable request pools, jQuery, and HTTP/2 support.

1.17x

Overview

Eval results

Files

Product Catalog Scraper

Name: tessl/npm-crawler
Rating: 0.94 (1 reviews)
Author: tessl

Build a web scraping tool that collects product information from a list of URLs while avoiding duplicate requests to the same URLs.

Requirements

Your scraper should:

Accept URLs to crawl: Accept a list of product page URLs to scrape
Prevent duplicate requests: Ensure that each unique URL is only requested once, even if the same URL appears multiple times in the input
Extract product information: For each product page, extract the product title and price using the HTML structure provided below
Handle multiple batches: Support adding URLs to crawl in multiple separate batches without re-requesting already visited pages
Report results: Output all collected product data when crawling completes

Input Format

The scraper will receive product URLs as input. Each URL represents a product page with the following HTML structure:

<html>
  <head><title>Product Name</title></head>
  <body>
    <h1 class="product-title">Product Name</h1>
    <span class="price">$XX.XX</span>
  </body>
</html>

Output Format

When all crawling completes, your scraper should output all collected products in JSON format:

[
  {
    "url": "http://example.com/product/1",
    "title": "Product Name",
    "price": "$XX.XX"
  }
]

Test Cases

When given a list of URLs that includes duplicates, the scraper only requests each unique URL once @test
When URLs are added in separate batches, previously visited URLs are not re-requested @test
When crawling completes, the scraper outputs all collected product data in the correct JSON format @test

Implementation

@generates

API

/**
 * Creates a new product scraper that prevents duplicate URL requests
 *
 * @param {Function} onComplete - Callback invoked when all crawling completes, receives array of products
 * @returns {Object} Scraper instance with addUrls method
 */
function createScraper(onComplete) {
  // Implementation
}

/**
 * Scraper instance
 * @typedef {Object} Scraper
 * @property {Function} addUrls - Adds URLs to crawl. Signature: addUrls(urls: string[])
 */

module.exports = { createScraper };

Dependencies { .dependencies }

crawler { .dependency }

Provides web crawling and HTML parsing capabilities with built-in duplicate URL detection.

Install with Tessl CLI

npx tessl i tessl/npm-crawler