Ctrl + K

tessl/pypi-w3lib

tessl install tessl/pypi-w3lib@2.3.0

Library of web-related functions for HTML manipulation, HTTP processing, URL handling, and encoding detection

Agent Success

Agent success rate when using this tile

84%

Improvement

Agent success rate improvement when using this tile compared to baseline

0.91x

Baseline

Agent success rate without this tile

92%

URL Deduplicator

Build a URL deduplication system that normalizes URLs to identify and eliminate duplicates in a web crawling context.

Problem Description

When crawling websites, the same page is often accessible through multiple URL variations. Your task is to implement a URL deduplicator that can identify when different URLs point to the same resource by converting them to a canonical form.

Requirements

URL Normalization

Implement a function that normalizes URLs to identify duplicates. The normalization should:

Convert the URL to a consistent canonical form
Sort query parameters alphabetically
Normalize percent-encoding (uppercase hex digits)
Remove default port numbers (80 for HTTP, 443 for HTTPS)
Handle URL fragments appropriately (remove by default)

Test Cases

URLs with reordered query parameters are recognized as identical: http://example.com?b=2&a=1 and http://example.com?a=1&b=2 should canonicalize to the same form @test
URLs with mixed case percent-encoding are normalized: http://example.com/path%2fto and http://example.com/path%2Fto should canonicalize to the same form @test
URLs with default ports are normalized: http://example.com:80/path and http://example.com/path should canonicalize to the same form @test
URL fragments are removed by default: http://example.com/page#section should canonicalize to http://example.com/page @test

Implementation

@generates

API

def canonicalize_url(url: str) -> str:
    """
    Convert a URL to its canonical form for deduplication purposes.

    Args:
        url: The URL string to canonicalize

    Returns:
        The canonical form of the URL
    """
    pass

Dependencies { .dependencies }

w3lib { .dependency }

Provides web-related utility functions for URL manipulation and normalization.

tessl/pypi-w3lib

task.mdevals/scenario-2/

URL Deduplicator

Problem Description

Requirements

URL Normalization

Test Cases

Implementation

API

Dependencies { .dependencies }

w3lib { .dependency }

Version

tessl/pypi-w3lib

task.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-2/

URL Deduplicator

Problem Description

Requirements

URL Normalization

Test Cases

Implementation

API

Dependencies { .dependencies }

w3lib { .dependency }

Version

task.mdevals/scenario-2/