tessl install tessl/pypi-w3lib@2.3.0Library of web-related functions for HTML manipulation, HTTP processing, URL handling, and encoding detection
Agent Success
Agent success rate when using this tile
84%
Improvement
Agent success rate improvement when using this tile compared to baseline
0.91x
Baseline
Agent success rate without this tile
92%
Build a URL deduplication system that normalizes URLs to identify and eliminate duplicates in a web crawling context.
When crawling websites, the same page is often accessible through multiple URL variations. Your task is to implement a URL deduplicator that can identify when different URLs point to the same resource by converting them to a canonical form.
Implement a function that normalizes URLs to identify duplicates. The normalization should:
URLs with reordered query parameters are recognized as identical: http://example.com?b=2&a=1 and http://example.com?a=1&b=2 should canonicalize to the same form @test
URLs with mixed case percent-encoding are normalized: http://example.com/path%2fto and http://example.com/path%2Fto should canonicalize to the same form @test
URLs with default ports are normalized: http://example.com:80/path and http://example.com/path should canonicalize to the same form @test
URL fragments are removed by default: http://example.com/page#section should canonicalize to http://example.com/page @test
@generates
def canonicalize_url(url: str) -> str:
"""
Convert a URL to its canonical form for deduplication purposes.
Args:
url: The URL string to canonicalize
Returns:
The canonical form of the URL
"""
passProvides web-related utility functions for URL manipulation and normalization.