Tessl Tile for pypi/w3lib@2.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

encoding-detection.md html-processing.md http-utilities.md index.md url-handling.md utilities.md

index.mddocs/

0
# w3lib
1

2
A comprehensive Python library providing essential web-related utility functions for HTML manipulation, HTTP header processing, URL handling, and character encoding detection. Originally developed as a foundational component of the Scrapy web scraping framework, w3lib offers production-tested utilities for web crawlers, data extraction tools, and content processing pipelines.
3

4
## Package Information
5

6
- **Package Name**: w3lib
7
- **Language**: Python  
8
- **Installation**: `pip install w3lib`
9
- **Version**: 2.3.1
10
- **License**: BSD
11
- **Documentation**: https://w3lib.readthedocs.io/en/latest/
12
- **Repository**: https://github.com/scrapy/w3lib
13

14
## Core Imports
15

16
```python
17
import w3lib
18
```
19

20
Module-specific imports:
21

22
```python
23
from w3lib.html import replace_entities, remove_tags, get_base_url
24
from w3lib.http import basic_auth_header, headers_raw_to_dict  
25
from w3lib.url import safe_url_string, url_query_parameter, canonicalize_url
26
from w3lib.encoding import html_to_unicode, resolve_encoding
27
from w3lib.util import to_unicode, to_bytes
28
```
29

30
## Basic Usage
31

32
```python
33
from w3lib.html import replace_entities, remove_tags, get_base_url
34
from w3lib.url import safe_url_string, url_query_parameter
35
from w3lib.http import basic_auth_header
36
from w3lib.encoding import html_to_unicode
37

38
# HTML processing - clean up HTML content
39
html = '<p>Price: &pound;100 <b>only!</b></p>'
40
clean_text = replace_entities(html)  # 'Price: £100 <b>only!</b>'
41
text_only = remove_tags(clean_text)  # 'Price: £100 only!'
42

43
# URL handling - make URLs safe and extract parameters
44
unsafe_url = 'http://example.com/search?q=hello world&price=£100'
45
safe_url = safe_url_string(unsafe_url)  # Properly encoded URL
46
query_param = url_query_parameter(safe_url, 'q')  # 'hello world'
47

48
# HTTP utilities - create authentication headers
49
auth_header = basic_auth_header('user', 'password')  # b'Basic dXNlcjpwYXNzd29yZA=='
50

51
# Encoding detection - convert HTML to Unicode  
52
raw_html = b'<html><meta charset="utf-8"><body>Caf\xc3\xa9</body></html>'
53
encoding, unicode_html = html_to_unicode(None, raw_html)  # ('utf-8', '<html>...')
54
```
55

56
## Architecture
57

58
w3lib is organized into focused modules, each handling specific web processing tasks:
59

60
- **HTML Module**: Entity translation, tag manipulation, base URL extraction, meta refresh parsing
61
- **HTTP Module**: Header format conversion, authentication header generation  
62
- **URL Module**: URL sanitization, parameter manipulation, encoding normalization, data URI parsing
63
- **Encoding Module**: Character encoding detection from HTTP headers, HTML meta tags, and BOMs
64
- **Utilities Module**: Core string/bytes conversion functions used throughout the library
65

66
This modular design allows developers to import only the functionality they need while maintaining consistent interfaces and error handling across all components.
67

68
## Capabilities
69

70
### HTML Processing
71

72
Comprehensive HTML manipulation including entity conversion, tag removal, comment stripping, base URL extraction, and meta refresh parsing. Handles both string and bytes input with robust encoding support.
73

74
```python { .api }
75
def replace_entities(text, keep=(), remove_illegal=True, encoding='utf-8'): ...
76
def remove_tags(text, which_ones=(), keep=(), encoding=None): ...  
77
def remove_comments(text, encoding=None): ...
78
def get_base_url(text, baseurl='', encoding='utf-8'): ...
79
def get_meta_refresh(text, baseurl='', encoding='utf-8', ignore_tags=('script', 'noscript')): ...
80
```
81

82
[HTML Processing](./html-processing.md)
83

84
### HTTP Utilities
85

86
HTTP header processing utilities for converting between raw header formats and dictionaries, plus HTTP Basic Authentication header generation.
87

88
```python { .api }
89
def headers_raw_to_dict(headers_raw): ...
90
def headers_dict_to_raw(headers_dict): ...
91
def basic_auth_header(username, password, encoding='ISO-8859-1'): ...
92
```
93

94
[HTTP Utilities](./http-utilities.md)
95

96
### URL Handling
97

98
Comprehensive URL processing including browser-compatible URL sanitization, query parameter manipulation, data URI parsing, and canonicalization with support for various URL standards.
99

100
```python { .api }
101
def safe_url_string(url, encoding='utf8', path_encoding='utf8', quote_path=True): ...
102
def url_query_parameter(url, parameter, default=None, keep_blank_values=0): ...
103
def url_query_cleaner(url, parameterlist=(), sep='&', kvsep='=', remove=False, unique=True, keep_fragments=False): ...
104
def canonicalize_url(url, keep_blank_values=True, keep_fragments=False, encoding=None): ...
105
def parse_data_uri(uri): ...
106
```
107

108
[URL Handling](./url-handling.md)
109

110
### Encoding Detection
111

112
Character encoding detection from HTTP Content-Type headers, HTML meta tags, XML declarations, and byte order marks, with smart fallback handling and encoding alias resolution.
113

114
```python { .api }
115
def html_to_unicode(content_type_header, html_body_str, default_encoding='utf8', auto_detect_fun=None): ...
116
def http_content_type_encoding(content_type): ...
117
def html_body_declared_encoding(html_body_str): ...
118
def resolve_encoding(encoding_alias): ...
119
```
120

121
[Encoding Detection](./encoding-detection.md)
122

123
### Utilities
124

125
Core utility functions for converting between string and bytes representations with robust encoding support and error handling.
126

127
```python { .api }
128
def to_unicode(text, encoding=None, errors='strict'): ...
129
def to_bytes(text, encoding=None, errors='strict'): ...
130
```
131

132
[Utilities](./utilities.md)
133

134
## Common Types
135

136
```python { .api }
137
# Type aliases used across the library
138
StrOrBytes = Union[str, bytes]
139

140
# HTTP header types  
141
HeadersDictInput = Mapping[bytes, Union[Any, Sequence[bytes]]]
142
HeadersDictOutput = MutableMapping[bytes, list[bytes]]
143

144
# Data URI parsing result
145
class ParseDataURIResult(NamedTuple):
146
    media_type: str
147
    media_type_parameters: dict[str, str] 
148
    data: bytes
149
```
150

151
## Error Handling
152

153
w3lib functions follow consistent error handling patterns:
154

155
- Invalid input types raise `TypeError`
156
- Encoding errors are handled gracefully with replacement characters (`\ufffd`)
157
- URL parsing errors may raise `ValueError` for malformed input
158
- Most functions return safe defaults (empty strings, `None`) rather than raising exceptions
159
- Functions accept both string and bytes input to minimize conversion overhead
160

161
## Performance Considerations
162

163
- Compiled regular expressions are cached and reused across function calls
164
- Functions are optimized for web scraping workloads with large volumes of content
165
- Memory-efficient processing of HTML content avoids unnecessary string duplication  
166
- Support for both string and bytes inputs reduces encoding/decoding overhead
167
- Character encoding detection uses fast heuristics before falling back to comprehensive analysis

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/