0
# String Processing
1
2
Simple encoding and decoding functions for processing individual strings. These functions provide the most common use case for encoding/decoding with proper BOM detection and WHATWG-compliant behavior.
3
4
## Capabilities
5
6
### Single String Decoding
7
8
Decode a byte string to Unicode with BOM detection that takes precedence over the fallback encoding declaration.
9
10
```python { .api }
11
def decode(input: bytes, fallback_encoding: Encoding | str, errors: str = 'replace') -> tuple[str, Encoding]:
12
"""
13
Decode a single byte string with BOM detection.
14
15
Args:
16
input: Byte string to decode
17
fallback_encoding: Encoding object or label string to use if no BOM detected
18
errors: Error handling strategy ('replace', 'strict', 'ignore', etc.)
19
20
Returns:
21
Tuple of (decoded_unicode_string, encoding_used)
22
23
Raises:
24
LookupError: If fallback_encoding label is unknown
25
"""
26
```
27
28
The function first checks for UTF-8, UTF-16LE, or UTF-16BE BOMs. If found, the BOM is removed and the detected encoding is used. Otherwise, the fallback encoding is used for decoding.
29
30
### Single String Encoding
31
32
Encode a Unicode string to bytes using the specified encoding.
33
34
```python { .api }
35
def encode(input: str, encoding: Encoding | str = UTF8, errors: str = 'strict') -> bytes:
36
"""
37
Encode a Unicode string to bytes.
38
39
Args:
40
input: Unicode string to encode
41
encoding: Encoding object or label string (defaults to UTF-8)
42
errors: Error handling strategy ('strict', 'replace', 'ignore', etc.)
43
44
Returns:
45
Encoded byte string
46
47
Raises:
48
LookupError: If encoding label is unknown
49
"""
50
```
51
52
## Usage Examples
53
54
```python
55
import webencodings
56
57
# Decode with BOM detection
58
utf8_bom_data = b'\xef\xbb\xbfHello World'
59
text, encoding = webencodings.decode(utf8_bom_data, 'iso-8859-1')
60
print(text) # 'Hello World'
61
print(encoding.name) # 'utf-8' (BOM detected, fallback ignored)
62
63
# Decode without BOM uses fallback
64
latin_data = b'caf\xe9' # 'café' in latin-1
65
text, encoding = webencodings.decode(latin_data, 'iso-8859-1')
66
print(text) # 'café'
67
print(encoding.name) # 'windows-1252' (iso-8859-1 maps to windows-1252)
68
69
# Handle UTF-16 BOM
70
utf16_data = b'\xff\xfeH\x00e\x00l\x00l\x00o\x00' # UTF-16LE BOM + 'Hello'
71
text, encoding = webencodings.decode(utf16_data, 'utf-8')
72
print(text) # 'Hello'
73
print(encoding.name) # 'utf-16le'
74
75
# Encoding strings
76
text = "Hello World"
77
data = webencodings.encode(text, 'utf-8')
78
print(data) # b'Hello World'
79
80
# Use predefined UTF8 constant
81
data = webencodings.encode(text, webencodings.UTF8)
82
print(data) # b'Hello World'
83
84
# Handle encoding errors
85
text = "café"
86
data = webencodings.encode(text, 'ascii', errors='replace')
87
print(data) # b'caf?'
88
89
# Encode with different encodings
90
text = "café"
91
utf8_data = webencodings.encode(text, 'utf-8')
92
latin1_data = webencodings.encode(text, 'latin-1')
93
print(utf8_data) # b'caf\xc3\xa9'
94
print(latin1_data) # b'caf\xe9'
95
```