or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

encoding-detection.mdhtml-processing.mdhttp-utilities.mdindex.mdurl-handling.mdutilities.md

index.mddocs/

0

# w3lib

1

2

A comprehensive Python library providing essential web-related utility functions for HTML manipulation, HTTP header processing, URL handling, and character encoding detection. Originally developed as a foundational component of the Scrapy web scraping framework, w3lib offers production-tested utilities for web crawlers, data extraction tools, and content processing pipelines.

3

4

## Package Information

5

6

- **Package Name**: w3lib

7

- **Language**: Python

8

- **Installation**: `pip install w3lib`

9

- **Version**: 2.3.1

10

- **License**: BSD

11

- **Documentation**: https://w3lib.readthedocs.io/en/latest/

12

- **Repository**: https://github.com/scrapy/w3lib

13

14

## Core Imports

15

16

```python

17

import w3lib

18

```

19

20

Module-specific imports:

21

22

```python

23

from w3lib.html import replace_entities, remove_tags, get_base_url

24

from w3lib.http import basic_auth_header, headers_raw_to_dict

25

from w3lib.url import safe_url_string, url_query_parameter, canonicalize_url

26

from w3lib.encoding import html_to_unicode, resolve_encoding

27

from w3lib.util import to_unicode, to_bytes

28

```

29

30

## Basic Usage

31

32

```python

33

from w3lib.html import replace_entities, remove_tags, get_base_url

34

from w3lib.url import safe_url_string, url_query_parameter

35

from w3lib.http import basic_auth_header

36

from w3lib.encoding import html_to_unicode

37

38

# HTML processing - clean up HTML content

39

html = '<p>Price: &pound;100 <b>only!</b></p>'

40

clean_text = replace_entities(html) # 'Price: £100 <b>only!</b>'

41

text_only = remove_tags(clean_text) # 'Price: £100 only!'

42

43

# URL handling - make URLs safe and extract parameters

44

unsafe_url = 'http://example.com/search?q=hello world&price=£100'

45

safe_url = safe_url_string(unsafe_url) # Properly encoded URL

46

query_param = url_query_parameter(safe_url, 'q') # 'hello world'

47

48

# HTTP utilities - create authentication headers

49

auth_header = basic_auth_header('user', 'password') # b'Basic dXNlcjpwYXNzd29yZA=='

50

51

# Encoding detection - convert HTML to Unicode

52

raw_html = b'<html><meta charset="utf-8"><body>Caf\xc3\xa9</body></html>'

53

encoding, unicode_html = html_to_unicode(None, raw_html) # ('utf-8', '<html>...')

54

```

55

56

## Architecture

57

58

w3lib is organized into focused modules, each handling specific web processing tasks:

59

60

- **HTML Module**: Entity translation, tag manipulation, base URL extraction, meta refresh parsing

61

- **HTTP Module**: Header format conversion, authentication header generation

62

- **URL Module**: URL sanitization, parameter manipulation, encoding normalization, data URI parsing

63

- **Encoding Module**: Character encoding detection from HTTP headers, HTML meta tags, and BOMs

64

- **Utilities Module**: Core string/bytes conversion functions used throughout the library

65

66

This modular design allows developers to import only the functionality they need while maintaining consistent interfaces and error handling across all components.

67

68

## Capabilities

69

70

### HTML Processing

71

72

Comprehensive HTML manipulation including entity conversion, tag removal, comment stripping, base URL extraction, and meta refresh parsing. Handles both string and bytes input with robust encoding support.

73

74

```python { .api }

75

def replace_entities(text, keep=(), remove_illegal=True, encoding='utf-8'): ...

76

def remove_tags(text, which_ones=(), keep=(), encoding=None): ...

77

def remove_comments(text, encoding=None): ...

78

def get_base_url(text, baseurl='', encoding='utf-8'): ...

79

def get_meta_refresh(text, baseurl='', encoding='utf-8', ignore_tags=('script', 'noscript')): ...

80

```

81

82

[HTML Processing](./html-processing.md)

83

84

### HTTP Utilities

85

86

HTTP header processing utilities for converting between raw header formats and dictionaries, plus HTTP Basic Authentication header generation.

87

88

```python { .api }

89

def headers_raw_to_dict(headers_raw): ...

90

def headers_dict_to_raw(headers_dict): ...

91

def basic_auth_header(username, password, encoding='ISO-8859-1'): ...

92

```

93

94

[HTTP Utilities](./http-utilities.md)

95

96

### URL Handling

97

98

Comprehensive URL processing including browser-compatible URL sanitization, query parameter manipulation, data URI parsing, and canonicalization with support for various URL standards.

99

100

```python { .api }

101

def safe_url_string(url, encoding='utf8', path_encoding='utf8', quote_path=True): ...

102

def url_query_parameter(url, parameter, default=None, keep_blank_values=0): ...

103

def url_query_cleaner(url, parameterlist=(), sep='&', kvsep='=', remove=False, unique=True, keep_fragments=False): ...

104

def canonicalize_url(url, keep_blank_values=True, keep_fragments=False, encoding=None): ...

105

def parse_data_uri(uri): ...

106

```

107

108

[URL Handling](./url-handling.md)

109

110

### Encoding Detection

111

112

Character encoding detection from HTTP Content-Type headers, HTML meta tags, XML declarations, and byte order marks, with smart fallback handling and encoding alias resolution.

113

114

```python { .api }

115

def html_to_unicode(content_type_header, html_body_str, default_encoding='utf8', auto_detect_fun=None): ...

116

def http_content_type_encoding(content_type): ...

117

def html_body_declared_encoding(html_body_str): ...

118

def resolve_encoding(encoding_alias): ...

119

```

120

121

[Encoding Detection](./encoding-detection.md)

122

123

### Utilities

124

125

Core utility functions for converting between string and bytes representations with robust encoding support and error handling.

126

127

```python { .api }

128

def to_unicode(text, encoding=None, errors='strict'): ...

129

def to_bytes(text, encoding=None, errors='strict'): ...

130

```

131

132

[Utilities](./utilities.md)

133

134

## Common Types

135

136

```python { .api }

137

# Type aliases used across the library

138

StrOrBytes = Union[str, bytes]

139

140

# HTTP header types

141

HeadersDictInput = Mapping[bytes, Union[Any, Sequence[bytes]]]

142

HeadersDictOutput = MutableMapping[bytes, list[bytes]]

143

144

# Data URI parsing result

145

class ParseDataURIResult(NamedTuple):

146

media_type: str

147

media_type_parameters: dict[str, str]

148

data: bytes

149

```

150

151

## Error Handling

152

153

w3lib functions follow consistent error handling patterns:

154

155

- Invalid input types raise `TypeError`

156

- Encoding errors are handled gracefully with replacement characters (`\ufffd`)

157

- URL parsing errors may raise `ValueError` for malformed input

158

- Most functions return safe defaults (empty strings, `None`) rather than raising exceptions

159

- Functions accept both string and bytes input to minimize conversion overhead

160

161

## Performance Considerations

162

163

- Compiled regular expressions are cached and reused across function calls

164

- Functions are optimized for web scraping workloads with large volumes of content

165

- Memory-efficient processing of HTML content avoids unnecessary string duplication

166

- Support for both string and bytes inputs reduces encoding/decoding overhead

167

- Character encoding detection uses fast heuristics before falling back to comprehensive analysis