Library of web-related functions for HTML manipulation, HTTP processing, URL handling, and encoding detection
84
Build a URL deduplication system that normalizes URLs to identify and eliminate duplicates in a web crawling context.
When crawling websites, the same page is often accessible through multiple URL variations. Your task is to implement a URL deduplicator that can identify when different URLs point to the same resource by converting them to a canonical form.
Implement a function that normalizes URLs to identify duplicates. The normalization should:
URLs with reordered query parameters are recognized as identical: http://example.com?b=2&a=1 and http://example.com?a=1&b=2 should canonicalize to the same form @test
URLs with mixed case percent-encoding are normalized: http://example.com/path%2fto and http://example.com/path%2Fto should canonicalize to the same form @test
URLs with default ports are normalized: http://example.com:80/path and http://example.com/path should canonicalize to the same form @test
URL fragments are removed by default: http://example.com/page#section should canonicalize to http://example.com/page @test
@generates
def canonicalize_url(url: str) -> str:
"""
Convert a URL to its canonical form for deduplication purposes.
Args:
url: The URL string to canonicalize
Returns:
The canonical form of the URL
"""
passProvides web-related utility functions for URL manipulation and normalization.
Install with Tessl CLI
npx tessl i tessl/pypi-w3libevals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10