URL normalization for Python with support for internationalized domain names (IDN)
npx @tessl/cli install tessl/pypi-url-normalize@2.2.00
# URL Normalize
1
2
A Python library for standardizing and normalizing URLs with support for internationalized domain names (IDN). The library provides robust URL normalization that handles various URL formats, ensures proper percent-encoding, performs case normalization, and provides configurable options for query parameter filtering and default schemes.
3
4
## Package Information
5
6
- **Package Name**: url-normalize
7
- **Language**: Python
8
- **Installation**: `pip install url-normalize`
9
10
## Core Imports
11
12
```python
13
from url_normalize import url_normalize
14
```
15
16
## Basic Usage
17
18
```python
19
from url_normalize import url_normalize
20
21
# Basic normalization (uses https by default)
22
normalized = url_normalize("www.foo.com:80/foo")
23
print(normalized) # https://www.foo.com/foo
24
25
# With custom default scheme
26
normalized = url_normalize("www.foo.com/foo", default_scheme="http")
27
print(normalized) # http://www.foo.com/foo
28
29
# With query parameter filtering enabled
30
normalized = url_normalize(
31
"www.google.com/search?q=test&utm_source=test",
32
filter_params=True
33
)
34
print(normalized) # https://www.google.com/search?q=test
35
36
# With custom parameter allowlist
37
normalized = url_normalize(
38
"example.com?page=1&id=123&ref=test",
39
filter_params=True,
40
param_allowlist=["page", "id"]
41
)
42
print(normalized) # https://example.com?page=1&id=123
43
44
# With default domain for absolute paths
45
normalized = url_normalize(
46
"/images/logo.png",
47
default_domain="example.com"
48
)
49
print(normalized) # https://example.com/images/logo.png
50
```
51
52
## Capabilities
53
54
### URL Normalization
55
56
The core URL normalization function that standardizes URLs according to RFC 3986 and related standards. It handles IDN domains, ensures proper encoding, normalizes case, removes redundant components, and provides configurable options for schemes, domains, and query parameters.
57
58
```python { .api }
59
def url_normalize(
60
url: str | None,
61
*,
62
charset: str = "utf-8",
63
default_scheme: str = "https",
64
default_domain: str | None = None,
65
filter_params: bool = False,
66
param_allowlist: dict | list | None = None,
67
) -> str | None:
68
"""
69
URI normalization routine.
70
71
Sometimes you get an URL by a user that just isn't a real
72
URL because it contains unsafe characters like ' ' and so on.
73
This function can fix some of the problems in a similar way
74
browsers handle data entered by the user.
75
76
Parameters:
77
- url (str | None): URL to normalize
78
- charset (str): The target charset for the URL if the url was given as unicode string. Default: "utf-8"
79
- default_scheme (str): Default scheme to use if none present. Default: "https"
80
- default_domain (str | None): Default domain to use for absolute paths (starting with '/'). Default: None
81
- filter_params (bool): Whether to filter non-allowlisted parameters. Default: False
82
- param_allowlist (dict | list | None): Override for the parameter allowlist. Can be a list of allowed parameters for all domains, or a dict mapping domains to allowed parameters. Default: None
83
84
Returns:
85
str | None: A normalized URL, or None if input was None/empty
86
87
Raises:
88
Various exceptions may be raised for malformed URLs or encoding errors
89
"""
90
```
91
92
#### Parameter Allowlist Formats
93
94
The `param_allowlist` parameter supports multiple formats for flexible parameter filtering:
95
96
**List format** - applies to all domains:
97
```python
98
param_allowlist = ["q", "id", "page"]
99
```
100
101
**Dictionary format** - domain-specific rules:
102
```python
103
param_allowlist = {
104
"google.com": ["q", "ie"],
105
"example.com": ["page", "id"]
106
}
107
```
108
109
When `filter_params=True` and no `param_allowlist` is provided, the library uses built-in allowlists for common domains:
110
111
- **google.com**: `["q", "ie"]` (search query and input encoding)
112
- **baidu.com**: `["wd", "ie"]` (word search and input encoding)
113
- **bing.com**: `["q"]` (search query)
114
- **youtube.com**: `["v", "search_query"]` (video ID and search query)
115
116
### Command Line Interface
117
118
The package provides a command-line interface for URL normalization with support for all the same options as the Python API.
119
120
```bash { .api }
121
# Basic usage
122
url-normalize "www.foo.com:80/foo"
123
124
# With options
125
url-normalize -s http -f -p q,id "example.com?q=test&utm_source=bad"
126
127
# Available options:
128
# -v, --version: Show version information
129
# -c, --charset: Charset (default: utf-8)
130
# -s, --default-scheme: Default scheme (default: https)
131
# -f, --filter-params: Filter tracking parameters
132
# -d, --default-domain: Default domain for absolute paths
133
# -p, --param-allowlist: Comma-separated allowlist
134
```
135
136
The CLI is available as:
137
- Console script: `url-normalize`
138
- Module execution: `python -m url_normalize.cli`
139
- Via uv/uvx: `uvx url-normalize`
140
141
## Normalization Features
142
143
The library performs comprehensive URL normalization including:
144
145
- **IDN Support**: Full internationalized domain name handling using IDNA2008 with UTS46 transitional processing
146
- **Case Normalization**: Scheme and host converted to lowercase
147
- **Port Normalization**: Default ports removed (80 for http, 443 for https)
148
- **Path Normalization**: Dot-segments removed, empty paths converted to "/"
149
- **Percent-Encoding**: Only essential encoding performed, uppercase hex digits used
150
- **Query Parameter Filtering**: Optional removal of tracking parameters
151
- **Fragment Normalization**: Proper encoding and normalization of URL fragments
152
- **Unicode Normalization**: UTF-8 NFC encoding throughout
153
154
## Error Handling
155
156
The function may raise various exceptions for:
157
- Malformed URLs that cannot be parsed
158
- Encoding errors during character set conversion
159
- IDNA processing errors for invalid internationalized domains
160
161
When used via CLI, errors are caught and reported to stderr with exit code 1.
162
163
## Package Metadata
164
165
```python { .api }
166
__version__ = "2.2.1"
167
__license__ = "MIT"
168
__all__ = ["url_normalize"]
169
```