tessl/pypi-url-normalize

URL normalization for Python with support for internationalized domain names (IDN)

Overview

Eval results

Files

URL Normalize

Name: tessl/pypi-url-normalize
Author: tessl

A Python library for standardizing and normalizing URLs with support for internationalized domain names (IDN). The library provides robust URL normalization that handles various URL formats, ensures proper percent-encoding, performs case normalization, and provides configurable options for query parameter filtering and default schemes.

Package Information

Package Name: url-normalize
Language: Python
Installation: pip install url-normalize

Core Imports

from url_normalize import url_normalize

Basic Usage

from url_normalize import url_normalize

# Basic normalization (uses https by default)
normalized = url_normalize("www.foo.com:80/foo")
print(normalized)  # https://www.foo.com/foo

# With custom default scheme
normalized = url_normalize("www.foo.com/foo", default_scheme="http")
print(normalized)  # http://www.foo.com/foo

# With query parameter filtering enabled
normalized = url_normalize(
    "www.google.com/search?q=test&utm_source=test", 
    filter_params=True
)
print(normalized)  # https://www.google.com/search?q=test

# With custom parameter allowlist
normalized = url_normalize(
    "example.com?page=1&id=123&ref=test",
    filter_params=True,
    param_allowlist=["page", "id"]
)
print(normalized)  # https://example.com?page=1&id=123

# With default domain for absolute paths
normalized = url_normalize(
    "/images/logo.png", 
    default_domain="example.com"
)
print(normalized)  # https://example.com/images/logo.png

Capabilities

URL Normalization

The core URL normalization function that standardizes URLs according to RFC 3986 and related standards. It handles IDN domains, ensures proper encoding, normalizes case, removes redundant components, and provides configurable options for schemes, domains, and query parameters.

def url_normalize(
    url: str | None,
    *,
    charset: str = "utf-8",
    default_scheme: str = "https", 
    default_domain: str | None = None,
    filter_params: bool = False,
    param_allowlist: dict | list | None = None,
) -> str | None:
    """
    URI normalization routine.

    Sometimes you get an URL by a user that just isn't a real
    URL because it contains unsafe characters like ' ' and so on.
    This function can fix some of the problems in a similar way
    browsers handle data entered by the user.

    Parameters:
    - url (str | None): URL to normalize
    - charset (str): The target charset for the URL if the url was given as unicode string. Default: "utf-8"
    - default_scheme (str): Default scheme to use if none present. Default: "https"
    - default_domain (str | None): Default domain to use for absolute paths (starting with '/'). Default: None
    - filter_params (bool): Whether to filter non-allowlisted parameters. Default: False
    - param_allowlist (dict | list | None): Override for the parameter allowlist. Can be a list of allowed parameters for all domains, or a dict mapping domains to allowed parameters. Default: None

    Returns:
    str | None: A normalized URL, or None if input was None/empty

    Raises:
    Various exceptions may be raised for malformed URLs or encoding errors
    """

Parameter Allowlist Formats

The param_allowlist parameter supports multiple formats for flexible parameter filtering:

List format - applies to all domains:

param_allowlist = ["q", "id", "page"]

Dictionary format - domain-specific rules:

param_allowlist = {
    "google.com": ["q", "ie"],
    "example.com": ["page", "id"]
}

When filter_params=True and no param_allowlist is provided, the library uses built-in allowlists for common domains:

google.com: ["q", "ie"] (search query and input encoding)
baidu.com: ["wd", "ie"] (word search and input encoding)
bing.com: ["q"] (search query)
youtube.com: ["v", "search_query"] (video ID and search query)

Command Line Interface

The package provides a command-line interface for URL normalization with support for all the same options as the Python API.

# Basic usage
url-normalize "www.foo.com:80/foo"

# With options
url-normalize -s http -f -p q,id "example.com?q=test&utm_source=bad"

# Available options:
# -v, --version: Show version information
# -c, --charset: Charset (default: utf-8)  
# -s, --default-scheme: Default scheme (default: https)
# -f, --filter-params: Filter tracking parameters
# -d, --default-domain: Default domain for absolute paths
# -p, --param-allowlist: Comma-separated allowlist

The CLI is available as:

Console script: url-normalize
Module execution: python -m url_normalize.cli
Via uv/uvx: uvx url-normalize

Normalization Features

The library performs comprehensive URL normalization including:

IDN Support: Full internationalized domain name handling using IDNA2008 with UTS46 transitional processing
Case Normalization: Scheme and host converted to lowercase
Port Normalization: Default ports removed (80 for http, 443 for https)
Path Normalization: Dot-segments removed, empty paths converted to "/"
Percent-Encoding: Only essential encoding performed, uppercase hex digits used
Query Parameter Filtering: Optional removal of tracking parameters
Fragment Normalization: Proper encoding and normalization of URL fragments
Unicode Normalization: UTF-8 NFC encoding throughout

Error Handling

The function may raise various exceptions for:

Malformed URLs that cannot be parsed
Encoding errors during character set conversion
IDNA processing errors for invalid internationalized domains

When used via CLI, errors are caught and reported to stderr with exit code 1.

Package Metadata

__version__ = "2.2.1"
__license__ = "MIT"
__all__ = ["url_normalize"]

Install with Tessl CLI

npx tessl i tessl/pypi-url-normalize

Workspace: tessl
Visibility: Public
Created: 6 months ago
Last updated: about 1 month ago
Describes: pkg:pypi/url-normalize@2.2.x