or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-url-normalize

URL normalization for Python with support for internationalized domain names (IDN)

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/url-normalize@2.2.x

To install, run

npx @tessl/cli install tessl/pypi-url-normalize@2.2.0

0

# URL Normalize

1

2

A Python library for standardizing and normalizing URLs with support for internationalized domain names (IDN). The library provides robust URL normalization that handles various URL formats, ensures proper percent-encoding, performs case normalization, and provides configurable options for query parameter filtering and default schemes.

3

4

## Package Information

5

6

- **Package Name**: url-normalize

7

- **Language**: Python

8

- **Installation**: `pip install url-normalize`

9

10

## Core Imports

11

12

```python

13

from url_normalize import url_normalize

14

```

15

16

## Basic Usage

17

18

```python

19

from url_normalize import url_normalize

20

21

# Basic normalization (uses https by default)

22

normalized = url_normalize("www.foo.com:80/foo")

23

print(normalized) # https://www.foo.com/foo

24

25

# With custom default scheme

26

normalized = url_normalize("www.foo.com/foo", default_scheme="http")

27

print(normalized) # http://www.foo.com/foo

28

29

# With query parameter filtering enabled

30

normalized = url_normalize(

31

"www.google.com/search?q=test&utm_source=test",

32

filter_params=True

33

)

34

print(normalized) # https://www.google.com/search?q=test

35

36

# With custom parameter allowlist

37

normalized = url_normalize(

38

"example.com?page=1&id=123&ref=test",

39

filter_params=True,

40

param_allowlist=["page", "id"]

41

)

42

print(normalized) # https://example.com?page=1&id=123

43

44

# With default domain for absolute paths

45

normalized = url_normalize(

46

"/images/logo.png",

47

default_domain="example.com"

48

)

49

print(normalized) # https://example.com/images/logo.png

50

```

51

52

## Capabilities

53

54

### URL Normalization

55

56

The core URL normalization function that standardizes URLs according to RFC 3986 and related standards. It handles IDN domains, ensures proper encoding, normalizes case, removes redundant components, and provides configurable options for schemes, domains, and query parameters.

57

58

```python { .api }

59

def url_normalize(

60

url: str | None,

61

*,

62

charset: str = "utf-8",

63

default_scheme: str = "https",

64

default_domain: str | None = None,

65

filter_params: bool = False,

66

param_allowlist: dict | list | None = None,

67

) -> str | None:

68

"""

69

URI normalization routine.

70

71

Sometimes you get an URL by a user that just isn't a real

72

URL because it contains unsafe characters like ' ' and so on.

73

This function can fix some of the problems in a similar way

74

browsers handle data entered by the user.

75

76

Parameters:

77

- url (str | None): URL to normalize

78

- charset (str): The target charset for the URL if the url was given as unicode string. Default: "utf-8"

79

- default_scheme (str): Default scheme to use if none present. Default: "https"

80

- default_domain (str | None): Default domain to use for absolute paths (starting with '/'). Default: None

81

- filter_params (bool): Whether to filter non-allowlisted parameters. Default: False

82

- param_allowlist (dict | list | None): Override for the parameter allowlist. Can be a list of allowed parameters for all domains, or a dict mapping domains to allowed parameters. Default: None

83

84

Returns:

85

str | None: A normalized URL, or None if input was None/empty

86

87

Raises:

88

Various exceptions may be raised for malformed URLs or encoding errors

89

"""

90

```

91

92

#### Parameter Allowlist Formats

93

94

The `param_allowlist` parameter supports multiple formats for flexible parameter filtering:

95

96

**List format** - applies to all domains:

97

```python

98

param_allowlist = ["q", "id", "page"]

99

```

100

101

**Dictionary format** - domain-specific rules:

102

```python

103

param_allowlist = {

104

"google.com": ["q", "ie"],

105

"example.com": ["page", "id"]

106

}

107

```

108

109

When `filter_params=True` and no `param_allowlist` is provided, the library uses built-in allowlists for common domains:

110

111

- **google.com**: `["q", "ie"]` (search query and input encoding)

112

- **baidu.com**: `["wd", "ie"]` (word search and input encoding)

113

- **bing.com**: `["q"]` (search query)

114

- **youtube.com**: `["v", "search_query"]` (video ID and search query)

115

116

### Command Line Interface

117

118

The package provides a command-line interface for URL normalization with support for all the same options as the Python API.

119

120

```bash { .api }

121

# Basic usage

122

url-normalize "www.foo.com:80/foo"

123

124

# With options

125

url-normalize -s http -f -p q,id "example.com?q=test&utm_source=bad"

126

127

# Available options:

128

# -v, --version: Show version information

129

# -c, --charset: Charset (default: utf-8)

130

# -s, --default-scheme: Default scheme (default: https)

131

# -f, --filter-params: Filter tracking parameters

132

# -d, --default-domain: Default domain for absolute paths

133

# -p, --param-allowlist: Comma-separated allowlist

134

```

135

136

The CLI is available as:

137

- Console script: `url-normalize`

138

- Module execution: `python -m url_normalize.cli`

139

- Via uv/uvx: `uvx url-normalize`

140

141

## Normalization Features

142

143

The library performs comprehensive URL normalization including:

144

145

- **IDN Support**: Full internationalized domain name handling using IDNA2008 with UTS46 transitional processing

146

- **Case Normalization**: Scheme and host converted to lowercase

147

- **Port Normalization**: Default ports removed (80 for http, 443 for https)

148

- **Path Normalization**: Dot-segments removed, empty paths converted to "/"

149

- **Percent-Encoding**: Only essential encoding performed, uppercase hex digits used

150

- **Query Parameter Filtering**: Optional removal of tracking parameters

151

- **Fragment Normalization**: Proper encoding and normalization of URL fragments

152

- **Unicode Normalization**: UTF-8 NFC encoding throughout

153

154

## Error Handling

155

156

The function may raise various exceptions for:

157

- Malformed URLs that cannot be parsed

158

- Encoding errors during character set conversion

159

- IDNA processing errors for invalid internationalized domains

160

161

When used via CLI, errors are caught and reported to stderr with exit code 1.

162

163

## Package Metadata

164

165

```python { .api }

166

__version__ = "2.2.1"

167

__license__ = "MIT"

168

__all__ = ["url_normalize"]

169

```