0
# w3lib
1
2
A comprehensive Python library providing essential web-related utility functions for HTML manipulation, HTTP header processing, URL handling, and character encoding detection. Originally developed as a foundational component of the Scrapy web scraping framework, w3lib offers production-tested utilities for web crawlers, data extraction tools, and content processing pipelines.
3
4
## Package Information
5
6
- **Package Name**: w3lib
7
- **Language**: Python
8
- **Installation**: `pip install w3lib`
9
- **Version**: 2.3.1
10
- **License**: BSD
11
- **Documentation**: https://w3lib.readthedocs.io/en/latest/
12
- **Repository**: https://github.com/scrapy/w3lib
13
14
## Core Imports
15
16
```python
17
import w3lib
18
```
19
20
Module-specific imports:
21
22
```python
23
from w3lib.html import replace_entities, remove_tags, get_base_url
24
from w3lib.http import basic_auth_header, headers_raw_to_dict
25
from w3lib.url import safe_url_string, url_query_parameter, canonicalize_url
26
from w3lib.encoding import html_to_unicode, resolve_encoding
27
from w3lib.util import to_unicode, to_bytes
28
```
29
30
## Basic Usage
31
32
```python
33
from w3lib.html import replace_entities, remove_tags, get_base_url
34
from w3lib.url import safe_url_string, url_query_parameter
35
from w3lib.http import basic_auth_header
36
from w3lib.encoding import html_to_unicode
37
38
# HTML processing - clean up HTML content
39
html = '<p>Price: £100 <b>only!</b></p>'
40
clean_text = replace_entities(html) # 'Price: £100 <b>only!</b>'
41
text_only = remove_tags(clean_text) # 'Price: £100 only!'
42
43
# URL handling - make URLs safe and extract parameters
44
unsafe_url = 'http://example.com/search?q=hello world&price=£100'
45
safe_url = safe_url_string(unsafe_url) # Properly encoded URL
46
query_param = url_query_parameter(safe_url, 'q') # 'hello world'
47
48
# HTTP utilities - create authentication headers
49
auth_header = basic_auth_header('user', 'password') # b'Basic dXNlcjpwYXNzd29yZA=='
50
51
# Encoding detection - convert HTML to Unicode
52
raw_html = b'<html><meta charset="utf-8"><body>Caf\xc3\xa9</body></html>'
53
encoding, unicode_html = html_to_unicode(None, raw_html) # ('utf-8', '<html>...')
54
```
55
56
## Architecture
57
58
w3lib is organized into focused modules, each handling specific web processing tasks:
59
60
- **HTML Module**: Entity translation, tag manipulation, base URL extraction, meta refresh parsing
61
- **HTTP Module**: Header format conversion, authentication header generation
62
- **URL Module**: URL sanitization, parameter manipulation, encoding normalization, data URI parsing
63
- **Encoding Module**: Character encoding detection from HTTP headers, HTML meta tags, and BOMs
64
- **Utilities Module**: Core string/bytes conversion functions used throughout the library
65
66
This modular design allows developers to import only the functionality they need while maintaining consistent interfaces and error handling across all components.
67
68
## Capabilities
69
70
### HTML Processing
71
72
Comprehensive HTML manipulation including entity conversion, tag removal, comment stripping, base URL extraction, and meta refresh parsing. Handles both string and bytes input with robust encoding support.
73
74
```python { .api }
75
def replace_entities(text, keep=(), remove_illegal=True, encoding='utf-8'): ...
76
def remove_tags(text, which_ones=(), keep=(), encoding=None): ...
77
def remove_comments(text, encoding=None): ...
78
def get_base_url(text, baseurl='', encoding='utf-8'): ...
79
def get_meta_refresh(text, baseurl='', encoding='utf-8', ignore_tags=('script', 'noscript')): ...
80
```
81
82
[HTML Processing](./html-processing.md)
83
84
### HTTP Utilities
85
86
HTTP header processing utilities for converting between raw header formats and dictionaries, plus HTTP Basic Authentication header generation.
87
88
```python { .api }
89
def headers_raw_to_dict(headers_raw): ...
90
def headers_dict_to_raw(headers_dict): ...
91
def basic_auth_header(username, password, encoding='ISO-8859-1'): ...
92
```
93
94
[HTTP Utilities](./http-utilities.md)
95
96
### URL Handling
97
98
Comprehensive URL processing including browser-compatible URL sanitization, query parameter manipulation, data URI parsing, and canonicalization with support for various URL standards.
99
100
```python { .api }
101
def safe_url_string(url, encoding='utf8', path_encoding='utf8', quote_path=True): ...
102
def url_query_parameter(url, parameter, default=None, keep_blank_values=0): ...
103
def url_query_cleaner(url, parameterlist=(), sep='&', kvsep='=', remove=False, unique=True, keep_fragments=False): ...
104
def canonicalize_url(url, keep_blank_values=True, keep_fragments=False, encoding=None): ...
105
def parse_data_uri(uri): ...
106
```
107
108
[URL Handling](./url-handling.md)
109
110
### Encoding Detection
111
112
Character encoding detection from HTTP Content-Type headers, HTML meta tags, XML declarations, and byte order marks, with smart fallback handling and encoding alias resolution.
113
114
```python { .api }
115
def html_to_unicode(content_type_header, html_body_str, default_encoding='utf8', auto_detect_fun=None): ...
116
def http_content_type_encoding(content_type): ...
117
def html_body_declared_encoding(html_body_str): ...
118
def resolve_encoding(encoding_alias): ...
119
```
120
121
[Encoding Detection](./encoding-detection.md)
122
123
### Utilities
124
125
Core utility functions for converting between string and bytes representations with robust encoding support and error handling.
126
127
```python { .api }
128
def to_unicode(text, encoding=None, errors='strict'): ...
129
def to_bytes(text, encoding=None, errors='strict'): ...
130
```
131
132
[Utilities](./utilities.md)
133
134
## Common Types
135
136
```python { .api }
137
# Type aliases used across the library
138
StrOrBytes = Union[str, bytes]
139
140
# HTTP header types
141
HeadersDictInput = Mapping[bytes, Union[Any, Sequence[bytes]]]
142
HeadersDictOutput = MutableMapping[bytes, list[bytes]]
143
144
# Data URI parsing result
145
class ParseDataURIResult(NamedTuple):
146
media_type: str
147
media_type_parameters: dict[str, str]
148
data: bytes
149
```
150
151
## Error Handling
152
153
w3lib functions follow consistent error handling patterns:
154
155
- Invalid input types raise `TypeError`
156
- Encoding errors are handled gracefully with replacement characters (`\ufffd`)
157
- URL parsing errors may raise `ValueError` for malformed input
158
- Most functions return safe defaults (empty strings, `None`) rather than raising exceptions
159
- Functions accept both string and bytes input to minimize conversion overhead
160
161
## Performance Considerations
162
163
- Compiled regular expressions are cached and reused across function calls
164
- Functions are optimized for web scraping workloads with large volumes of content
165
- Memory-efficient processing of HTML content avoids unnecessary string duplication
166
- Support for both string and bytes inputs reduces encoding/decoding overhead
167
- Character encoding detection uses fast heuristics before falling back to comprehensive analysis