URL normalization for Python with support for internationalized domain names (IDN)
npx @tessl/cli install tessl/pypi-url-normalize@2.2.0A Python library for standardizing and normalizing URLs with support for internationalized domain names (IDN). The library provides robust URL normalization that handles various URL formats, ensures proper percent-encoding, performs case normalization, and provides configurable options for query parameter filtering and default schemes.
pip install url-normalizefrom url_normalize import url_normalizefrom url_normalize import url_normalize
# Basic normalization (uses https by default)
normalized = url_normalize("www.foo.com:80/foo")
print(normalized) # https://www.foo.com/foo
# With custom default scheme
normalized = url_normalize("www.foo.com/foo", default_scheme="http")
print(normalized) # http://www.foo.com/foo
# With query parameter filtering enabled
normalized = url_normalize(
"www.google.com/search?q=test&utm_source=test",
filter_params=True
)
print(normalized) # https://www.google.com/search?q=test
# With custom parameter allowlist
normalized = url_normalize(
"example.com?page=1&id=123&ref=test",
filter_params=True,
param_allowlist=["page", "id"]
)
print(normalized) # https://example.com?page=1&id=123
# With default domain for absolute paths
normalized = url_normalize(
"/images/logo.png",
default_domain="example.com"
)
print(normalized) # https://example.com/images/logo.pngThe core URL normalization function that standardizes URLs according to RFC 3986 and related standards. It handles IDN domains, ensures proper encoding, normalizes case, removes redundant components, and provides configurable options for schemes, domains, and query parameters.
def url_normalize(
url: str | None,
*,
charset: str = "utf-8",
default_scheme: str = "https",
default_domain: str | None = None,
filter_params: bool = False,
param_allowlist: dict | list | None = None,
) -> str | None:
"""
URI normalization routine.
Sometimes you get an URL by a user that just isn't a real
URL because it contains unsafe characters like ' ' and so on.
This function can fix some of the problems in a similar way
browsers handle data entered by the user.
Parameters:
- url (str | None): URL to normalize
- charset (str): The target charset for the URL if the url was given as unicode string. Default: "utf-8"
- default_scheme (str): Default scheme to use if none present. Default: "https"
- default_domain (str | None): Default domain to use for absolute paths (starting with '/'). Default: None
- filter_params (bool): Whether to filter non-allowlisted parameters. Default: False
- param_allowlist (dict | list | None): Override for the parameter allowlist. Can be a list of allowed parameters for all domains, or a dict mapping domains to allowed parameters. Default: None
Returns:
str | None: A normalized URL, or None if input was None/empty
Raises:
Various exceptions may be raised for malformed URLs or encoding errors
"""The param_allowlist parameter supports multiple formats for flexible parameter filtering:
List format - applies to all domains:
param_allowlist = ["q", "id", "page"]Dictionary format - domain-specific rules:
param_allowlist = {
"google.com": ["q", "ie"],
"example.com": ["page", "id"]
}When filter_params=True and no param_allowlist is provided, the library uses built-in allowlists for common domains:
["q", "ie"] (search query and input encoding)["wd", "ie"] (word search and input encoding)["q"] (search query)["v", "search_query"] (video ID and search query)The package provides a command-line interface for URL normalization with support for all the same options as the Python API.
# Basic usage
url-normalize "www.foo.com:80/foo"
# With options
url-normalize -s http -f -p q,id "example.com?q=test&utm_source=bad"
# Available options:
# -v, --version: Show version information
# -c, --charset: Charset (default: utf-8)
# -s, --default-scheme: Default scheme (default: https)
# -f, --filter-params: Filter tracking parameters
# -d, --default-domain: Default domain for absolute paths
# -p, --param-allowlist: Comma-separated allowlistThe CLI is available as:
url-normalizepython -m url_normalize.cliuvx url-normalizeThe library performs comprehensive URL normalization including:
The function may raise various exceptions for:
When used via CLI, errors are caught and reported to stderr with exit code 1.
__version__ = "2.2.1"
__license__ = "MIT"
__all__ = ["url_normalize"]