0
# Configuration and Types
1
2
Configuration classes and types for controlling ftfy behavior, including comprehensive options for each fix step and explanation data structures.
3
4
## Capabilities
5
6
### Text Fixer Configuration
7
8
Comprehensive configuration class controlling all aspects of ftfy text processing through named tuple with sensible defaults.
9
10
```python { .api }
11
class TextFixerConfig(NamedTuple):
12
"""
13
Configuration for all ftfy text processing options.
14
15
Implemented as NamedTuple with defaults, instantiate with keyword
16
arguments for values to change from defaults.
17
18
Attributes:
19
unescape_html: HTML entity handling ("auto"|True|False)
20
remove_terminal_escapes: Remove ANSI escape sequences (bool)
21
fix_encoding: Detect and fix mojibake (bool)
22
restore_byte_a0: Allow space as non-breaking space in mojibake (bool)
23
replace_lossy_sequences: Fix partial mojibake with � or ? (bool)
24
decode_inconsistent_utf8: Fix inconsistent UTF-8 sequences (bool)
25
fix_c1_controls: Replace C1 controls with Windows-1252 (bool)
26
fix_latin_ligatures: Replace ligatures with letters (bool)
27
fix_character_width: Normalize fullwidth/halfwidth chars (bool)
28
uncurl_quotes: Convert curly quotes to straight quotes (bool)
29
fix_line_breaks: Standardize line breaks to \\n (bool)
30
fix_surrogates: Fix UTF-16 surrogate sequences (bool)
31
remove_control_chars: Remove unnecessary control chars (bool)
32
normalization: Unicode normalization type (str|None)
33
max_decode_length: Maximum segment size for processing (int)
34
explain: Whether to compute explanations (bool)
35
"""
36
unescape_html: str | bool = "auto"
37
remove_terminal_escapes: bool = True
38
fix_encoding: bool = True
39
restore_byte_a0: bool = True
40
replace_lossy_sequences: bool = True
41
decode_inconsistent_utf8: bool = True
42
fix_c1_controls: bool = True
43
fix_latin_ligatures: bool = True
44
fix_character_width: bool = True
45
uncurl_quotes: bool = True
46
fix_line_breaks: bool = True
47
fix_surrogates: bool = True
48
remove_control_chars: bool = True
49
normalization: Literal["NFC", "NFD", "NFKC", "NFKD"] | None = "NFC"
50
max_decode_length: int = 1000000
51
explain: bool = True
52
```
53
54
### Explanation Types
55
56
Data structures for representing text transformation explanations and individual transformation steps.
57
58
```python { .api }
59
class ExplainedText(NamedTuple):
60
"""
61
Return type for ftfy functions that provide explanations.
62
63
Contains both the fixed text result and optional explanation of
64
transformations applied. When explain=False, explanation is None.
65
66
Attributes:
67
text: The processed text result (str)
68
explanation: List of transformation steps or None (list[ExplanationStep]|None)
69
"""
70
text: str
71
explanation: list[ExplanationStep] | None
72
73
class ExplanationStep(NamedTuple):
74
"""
75
Single step in text transformation explanation.
76
77
Describes one transformation applied during text processing with
78
action type and parameter specifying the operation performed.
79
80
Attributes:
81
action: Type of transformation (str)
82
parameter: Encoding name or function name (str)
83
84
Actions:
85
"encode": Convert string to bytes with specified encoding
86
"decode": Convert bytes to string with specified encoding
87
"transcode": Convert bytes to bytes with named function
88
"apply": Convert string to string with named function
89
"normalize": Apply Unicode normalization
90
"""
91
action: str
92
parameter: str
93
```
94
95
## Configuration Details
96
97
### HTML Entity Handling
98
99
The `unescape_html` option controls HTML entity processing:
100
101
- `"auto"` (default): Decode entities unless literal `<` appears (indicating HTML)
102
- `True`: Always decode HTML entities like `&` → `&`
103
- `False`: Never decode HTML entities
104
105
```python
106
from ftfy import TextFixerConfig, fix_text
107
108
# Auto mode - detects HTML context
109
config = TextFixerConfig(unescape_html="auto")
110
fix_text("& text") # → "& text" (no < detected)
111
fix_text("<p>&</p>") # → "<p>&</p>" (< detected, preserve entities)
112
113
# Always decode entities
114
config = TextFixerConfig(unescape_html=True)
115
fix_text("<p>&</p>") # → "<p>&</p>"
116
117
# Never decode entities
118
config = TextFixerConfig(unescape_html=False)
119
fix_text("& text") # → "& text"
120
```
121
122
### Encoding Detection Options
123
124
Several options control encoding detection and mojibake fixing:
125
126
```python
127
# Conservative encoding fixing - fewer false positives
128
conservative = TextFixerConfig(
129
restore_byte_a0=False, # Don't interpret spaces as non-breaking spaces
130
replace_lossy_sequences=False, # Don't fix partial mojibake
131
decode_inconsistent_utf8=False # Don't fix inconsistent UTF-8
132
)
133
134
# Aggressive encoding fixing - more corrections
135
aggressive = TextFixerConfig(
136
restore_byte_a0=True, # Allow space → non-breaking space
137
replace_lossy_sequences=True, # Fix sequences with � or ?
138
decode_inconsistent_utf8=True # Fix inconsistent UTF-8 patterns
139
)
140
```
141
142
### Character Normalization Options
143
144
Control various character formatting fixes:
145
146
```python
147
# Minimal character normalization
148
minimal = TextFixerConfig(
149
fix_latin_ligatures=False, # Keep ligatures like fi
150
fix_character_width=False, # Keep fullwidth characters
151
uncurl_quotes=False, # Keep curly quotes
152
fix_line_breaks=False # Keep original line endings
153
)
154
155
# Text cleaning for terminal display
156
terminal = TextFixerConfig(
157
remove_terminal_escapes=True, # Remove ANSI escapes
158
remove_control_chars=True, # Remove control characters
159
fix_character_width=True, # Normalize character widths
160
normalization="NFC" # Canonical Unicode form
161
)
162
```
163
164
### Unicode Normalization
165
166
The `normalization` option controls Unicode canonical forms:
167
168
```python
169
# NFC - Canonical decomposed + composed (default)
170
nfc_config = TextFixerConfig(normalization="NFC")
171
fix_text("café", nfc_config) # Combines é into single character
172
173
# NFD - Canonical decomposed
174
nfd_config = TextFixerConfig(normalization="NFD")
175
fix_text("café", nfd_config) # Separates é into e + ´
176
177
# NFKC - Compatibility normalization (changes meaning)
178
nfkc_config = TextFixerConfig(normalization="NFKC")
179
fix_text("10³", nfkc_config) # → "103" (loses superscript)
180
181
# No normalization
182
no_norm = TextFixerConfig(normalization=None)
183
```
184
185
## Usage Examples
186
187
### Basic Configuration
188
189
```python
190
from ftfy import TextFixerConfig, fix_text
191
192
# Use defaults
193
config = TextFixerConfig()
194
195
# Change specific options
196
config = TextFixerConfig(uncurl_quotes=False, fix_encoding=True)
197
198
# Create variations
199
no_html = config._replace(unescape_html=False)
200
conservative = config._replace(restore_byte_a0=False, replace_lossy_sequences=False)
201
```
202
203
### Keyword Arguments
204
205
```python
206
from ftfy import fix_text
207
208
# Pass config options as kwargs (equivalent to config object)
209
result = fix_text(text, uncurl_quotes=False, normalization="NFD")
210
211
# Mix config object and kwargs (kwargs override config)
212
config = TextFixerConfig(uncurl_quotes=False)
213
result = fix_text(text, config, normalization="NFD") # NFD overrides config
214
```
215
216
### Working with Explanations
217
218
```python
219
from ftfy import fix_and_explain
220
221
# Get detailed explanations
222
result = fix_and_explain("só")
223
print(f"Text: {result.text}")
224
print(f"Steps: {len(result.explanation)} transformations")
225
226
for step in result.explanation:
227
print(f" {step.action}: {step.parameter}")
228
229
# Disable explanations for performance
230
from ftfy import TextFixerConfig
231
config = TextFixerConfig(explain=False)
232
result = fix_and_explain(text, config)
233
print(result.explanation) # None
234
```
235
236
### Performance Tuning
237
238
```python
239
# Process large texts in smaller segments
240
large_text_config = TextFixerConfig(max_decode_length=500000)
241
242
# Skip expensive operations for simple cleaning
243
fast_config = TextFixerConfig(
244
fix_encoding=False, # Skip mojibake detection
245
unescape_html=False, # Skip HTML processing
246
explain=False # Skip explanation generation
247
)
248
249
# Text-only cleaning (no encoding fixes)
250
text_only = TextFixerConfig(
251
fix_encoding=False,
252
unescape_html=False,
253
remove_terminal_escapes=True,
254
fix_character_width=True,
255
uncurl_quotes=True,
256
fix_line_breaks=True
257
)
258
```