Ctrl + K
DocumentationLog inGet started

tessl/pypi-w3lib

tessl install tessl/pypi-w3lib@2.3.0

Library of web-related functions for HTML manipulation, HTTP processing, URL handling, and encoding detection

Agent Success

Agent success rate when using this tile

84%

Improvement

Agent success rate improvement when using this tile compared to baseline

0.91x

Baseline

Agent success rate without this tile

92%

task.mdevals/scenario-2/

URL Deduplicator

Build a URL deduplication system that normalizes URLs to identify and eliminate duplicates in a web crawling context.

Problem Description

When crawling websites, the same page is often accessible through multiple URL variations. Your task is to implement a URL deduplicator that can identify when different URLs point to the same resource by converting them to a canonical form.

Requirements

URL Normalization

Implement a function that normalizes URLs to identify duplicates. The normalization should:

  • Convert the URL to a consistent canonical form
  • Sort query parameters alphabetically
  • Normalize percent-encoding (uppercase hex digits)
  • Remove default port numbers (80 for HTTP, 443 for HTTPS)
  • Handle URL fragments appropriately (remove by default)

Test Cases

  • URLs with reordered query parameters are recognized as identical: http://example.com?b=2&a=1 and http://example.com?a=1&b=2 should canonicalize to the same form @test

  • URLs with mixed case percent-encoding are normalized: http://example.com/path%2fto and http://example.com/path%2Fto should canonicalize to the same form @test

  • URLs with default ports are normalized: http://example.com:80/path and http://example.com/path should canonicalize to the same form @test

  • URL fragments are removed by default: http://example.com/page#section should canonicalize to http://example.com/page @test

Implementation

@generates

API

def canonicalize_url(url: str) -> str:
    """
    Convert a URL to its canonical form for deduplication purposes.

    Args:
        url: The URL string to canonicalize

    Returns:
        The canonical form of the URL
    """
    pass

Dependencies { .dependencies }

w3lib { .dependency }

Provides web-related utility functions for URL manipulation and normalization.

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/w3lib@2.3.x
tile.json