tessl/pypi-w3lib

Library of web-related functions for HTML manipulation, HTTP processing, URL handling, and encoding detection

0.91x

Overview

Eval results

Files

URL Deduplicator

Name: tessl/pypi-w3lib
Rating: 0.84 (1 reviews)
Author: tessl

Build a URL deduplication system that normalizes URLs to identify and eliminate duplicates in a web crawling context.

Problem Description

When crawling websites, the same page is often accessible through multiple URL variations. Your task is to implement a URL deduplicator that can identify when different URLs point to the same resource by converting them to a canonical form.

Requirements

URL Normalization

Implement a function that normalizes URLs to identify duplicates. The normalization should:

Convert the URL to a consistent canonical form
Sort query parameters alphabetically
Normalize percent-encoding (uppercase hex digits)
Remove default port numbers (80 for HTTP, 443 for HTTPS)
Handle URL fragments appropriately (remove by default)

Test Cases

URLs with reordered query parameters are recognized as identical: http://example.com?b=2&a=1 and http://example.com?a=1&b=2 should canonicalize to the same form @test
URLs with mixed case percent-encoding are normalized: http://example.com/path%2fto and http://example.com/path%2Fto should canonicalize to the same form @test
URLs with default ports are normalized: http://example.com:80/path and http://example.com/path should canonicalize to the same form @test
URL fragments are removed by default: http://example.com/page#section should canonicalize to http://example.com/page @test

Implementation

@generates

API

def canonicalize_url(url: str) -> str:
    """
    Convert a URL to its canonical form for deduplication purposes.

    Args:
        url: The URL string to canonicalize

    Returns:
        The canonical form of the URL
    """
    pass

Dependencies { .dependencies }

w3lib { .dependency }

Provides web-related utility functions for URL manipulation and normalization.

Install with Tessl CLI

npx tessl i tessl/pypi-w3lib