0
# Individual Text Fixes
1
2
Individual transformation functions for specific text problems like HTML entities, terminal escapes, character width, quotes, and line breaks. These functions can be used independently or are applied automatically by the main text fixing functions.
3
4
## Capabilities
5
6
### HTML and Markup Processing
7
8
Functions for handling HTML entities and markup-related text issues.
9
10
```python { .api }
11
def unescape_html(text: str) -> str:
12
"""
13
Convert HTML entities to Unicode characters.
14
15
Robust replacement for html.unescape that handles malformed entities
16
and common entity mistakes. Converts entities like & → &, < → <.
17
18
Args:
19
text: String potentially containing HTML entities
20
21
Returns:
22
String with HTML entities converted to Unicode characters
23
24
Examples:
25
>>> unescape_html("& <tag>")
26
'& <tag>'
27
>>> unescape_html("&EACUTE;") # Handles incorrect capitalization
28
'É'
29
"""
30
```
31
32
### Terminal and Control Characters
33
34
Functions for cleaning terminal escapes and control characters.
35
36
```python { .api }
37
def remove_terminal_escapes(text: str) -> str:
38
"""
39
Remove ANSI terminal escape sequences.
40
41
Strips color codes, cursor positioning, and other ANSI escape
42
sequences commonly found in terminal output or log files.
43
44
Args:
45
text: String potentially containing ANSI escape sequences
46
47
Returns:
48
String with terminal escapes removed
49
50
Examples:
51
>>> remove_terminal_escapes("\\x1b[31mRed text\\x1b[0m")
52
'Red text'
53
>>> remove_terminal_escapes("\\x1b[2J\\x1b[HClear screen")
54
'Clear screen'
55
"""
56
57
def remove_control_chars(text: str) -> str:
58
"""
59
Remove unnecessary Unicode control characters.
60
61
Removes control characters that have no visual effect and are
62
typically unwanted artifacts in text processing.
63
64
Args:
65
text: String potentially containing control characters
66
67
Returns:
68
String with control characters removed
69
"""
70
71
def remove_bom(text: str) -> str:
72
"""
73
Remove byte order marks (BOM) from text.
74
75
Strips Unicode BOM characters that sometimes appear at the
76
beginning of text files or strings.
77
78
Args:
79
text: String potentially starting with BOM
80
81
Returns:
82
String with BOM removed
83
"""
84
```
85
86
### Quote and Punctuation Fixes
87
88
Functions for normalizing quotes and punctuation characters.
89
90
```python { .api }
91
def uncurl_quotes(text: str) -> str:
92
"""
93
Convert curly quotes to straight ASCII quotes.
94
95
Replaces Unicode quotation marks with ASCII equivalents:
96
' ' → ', " " → ". Useful for systems requiring ASCII-only text.
97
98
Args:
99
text: String containing curly quotes
100
101
Returns:
102
String with straight ASCII quotes
103
104
Examples:
105
>>> uncurl_quotes("It's "quoted" text")
106
'It\\'s "quoted" text'
107
>>> uncurl_quotes("'single' and "double" quotes")
108
'\\'single\\' and "double" quotes'
109
"""
110
```
111
112
### Character Width and Typography
113
114
Functions for normalizing character width and typographic elements.
115
116
```python { .api }
117
def fix_character_width(text: str) -> str:
118
"""
119
Normalize fullwidth and halfwidth characters.
120
121
Converts fullwidth Latin characters to normal width and halfwidth
122
Katakana to normal width for consistent display and processing.
123
124
Args:
125
text: String containing width-variant characters
126
127
Returns:
128
String with normalized character widths
129
130
Examples:
131
>>> fix_character_width("LOUD NOISES")
132
'LOUD NOISES'
133
>>> fix_character_width("ハンカク") # Halfwidth Katakana
134
'ハンカク'
135
"""
136
137
def fix_latin_ligatures(text: str) -> str:
138
"""
139
Replace Latin ligatures with individual letters.
140
141
Converts typographic ligatures like fi, fl back to individual
142
characters (fi, fl) for searchability and processing.
143
144
Args:
145
text: String containing Latin ligatures
146
147
Returns:
148
String with ligatures replaced by letter sequences
149
150
Examples:
151
>>> fix_latin_ligatures("file and flower")
152
'file and flower'
153
>>> fix_latin_ligatures("office")
154
'office'
155
"""
156
```
157
158
### Line Break and Whitespace Normalization
159
160
Functions for standardizing line breaks and whitespace.
161
162
```python { .api }
163
def fix_line_breaks(text: str) -> str:
164
"""
165
Standardize line breaks to Unix format (\\n).
166
167
Converts Windows (\\r\\n), Mac (\\r), and other line ending
168
variations to standard Unix newlines. Handles Unicode line
169
separators and paragraph separators.
170
171
Args:
172
text: String with various line break formats
173
174
Returns:
175
String with standardized \\n line breaks
176
177
Examples:
178
>>> fix_line_breaks("line1\\r\\nline2\\rline3")
179
'line1\\nline2\\nline3'
180
>>> fix_line_breaks("para1\\u2029para2") # Unicode paragraph sep
181
'para1\\npara2'
182
"""
183
```
184
185
### Advanced Character Processing
186
187
Functions for handling complex Unicode issues.
188
189
```python { .api }
190
def fix_surrogates(text: str) -> str:
191
"""
192
Fix UTF-16 surrogate pair sequences.
193
194
Converts UTF-16 surrogate codepoints back to the original high-
195
numbered Unicode characters like emoji. Fixes text decoded with
196
obsolete UCS-2 standard.
197
198
Args:
199
text: String containing UTF-16 surrogates
200
201
Returns:
202
String with surrogates converted to proper characters
203
204
Examples:
205
>>> fix_surrogates("\\ud83d\\ude00") # Surrogate pair
206
'😀'
207
"""
208
209
def fix_c1_controls(text: str) -> str:
210
"""
211
Replace C1 control characters with Windows-1252 equivalents.
212
213
Converts Latin-1 control characters (U+80-U+9F) to their
214
Windows-1252 interpretations following HTML5 standard.
215
216
Args:
217
text: String containing C1 control characters
218
219
Returns:
220
String with C1 controls replaced
221
222
Examples:
223
>>> fix_c1_controls("\\x80") # C1 control
224
'€' # Windows-1252 Euro sign
225
"""
226
```
227
228
### Byte-Level Processing
229
230
Functions for processing byte sequences during encoding correction.
231
232
```python { .api }
233
def restore_byte_a0(byts: bytes) -> bytes:
234
"""
235
Restore byte 0xA0 in potential UTF-8 mojibake.
236
237
Replaces literal space (0x20) with non-breaking space (0xA0)
238
when it would make the bytes valid UTF-8. Used during encoding
239
detection to handle common mojibake patterns.
240
241
Args:
242
byts: Byte sequence potentially containing altered UTF-8
243
244
Returns:
245
Byte sequence with 0xA0 restored where appropriate
246
"""
247
248
def replace_lossy_sequences(byts: bytes) -> bytes:
249
"""
250
Replace lossy byte sequences in mojibake correction.
251
252
Identifies and replaces sequences where information was lost
253
during encoding/decoding, typically involving � or ? characters.
254
255
Args:
256
byts: Byte sequence from encoding detection
257
258
Returns:
259
Byte sequence with lossy sequences replaced
260
"""
261
262
def decode_inconsistent_utf8(text: str) -> str:
263
"""
264
Handle inconsistent UTF-8 sequences in text.
265
266
Fixes text where UTF-8 mojibake patterns exist but there's no
267
consistent way to reinterpret the string in a single encoding.
268
Replaces problematic sequences with proper UTF-8.
269
270
Args:
271
text: String with inconsistent UTF-8 sequences
272
273
Returns:
274
String with UTF-8 sequences corrected
275
"""
276
```
277
278
### Utility Functions
279
280
Additional text processing utilities.
281
282
```python { .api }
283
def decode_escapes(text: str) -> str:
284
"""
285
Decode backslash escape sequences in text.
286
287
More robust version of string decode that handles various escape
288
sequence formats including \\n, \\t, \\uXXXX, \\xXX patterns.
289
290
Args:
291
text: String containing escape sequences
292
293
Returns:
294
String with escape sequences decoded
295
296
Examples:
297
>>> decode_escapes("Hello\\nWorld\\t!")
298
'Hello\\nWorld\\t!'
299
>>> decode_escapes("Unicode: \\u00e9")
300
'Unicode: é'
301
"""
302
```
303
304
## Usage Examples
305
306
### Individual Fix Application
307
308
```python
309
from ftfy.fixes import unescape_html, remove_terminal_escapes, uncurl_quotes
310
311
# Apply individual fixes
312
html_text = "<p>Hello & goodbye</p>"
313
clean_html = unescape_html(html_text)
314
print(clean_html) # "<p>Hello & goodbye</p>"
315
316
# Clean terminal output
317
terminal_output = "\x1b[31mError:\x1b[0m File not found"
318
clean_output = remove_terminal_escapes(terminal_output)
319
print(clean_output) # "Error: File not found"
320
321
# Normalize quotes for ASCII systems
322
curly_text = "It's "perfectly" fine"
323
straight_quotes = uncurl_quotes(curly_text)
324
print(straight_quotes) # 'It\'s "perfectly" fine'
325
```
326
327
### Character Width Normalization
328
329
```python
330
from ftfy.fixes import fix_character_width, fix_latin_ligatures
331
332
# Fix fullwidth characters
333
wide_text = "HELLO WORLD"
334
normal_text = fix_character_width(wide_text)
335
print(normal_text) # "HELLO WORLD"
336
337
# Decompose ligatures
338
ligature_text = "The office file"
339
decomposed = fix_latin_ligatures(ligature_text)
340
print(decomposed) # "The office file"
341
```
342
343
### Line Break Standardization
344
345
```python
346
from ftfy.fixes import fix_line_breaks
347
348
# Standardize mixed line endings
349
mixed_lines = "Line 1\r\nLine 2\rLine 3\nLine 4"
350
unix_lines = fix_line_breaks(mixed_lines)
351
print(repr(unix_lines)) # 'Line 1\nLine 2\nLine 3\nLine 4'
352
353
# Handle Unicode line separators
354
unicode_lines = "Para 1\u2029Para 2\u2028Line break"
355
standard_lines = fix_line_breaks(unicode_lines)
356
print(repr(standard_lines)) # 'Para 1\nPara 2\nLine break'
357
```
358
359
### Advanced Character Processing
360
361
```python
362
from ftfy.fixes import fix_surrogates, fix_c1_controls
363
364
# Fix emoji from surrogate pairs
365
surrogate_emoji = "\ud83d\ude00\ud83d\ude01" # Encoded emoji
366
real_emoji = fix_surrogates(surrogate_emoji)
367
print(real_emoji) # "😀😁"
368
369
# Fix C1 control characters
370
latin1_controls = "\x80\x85\x91\x92" # C1 controls
371
windows1252 = fix_c1_controls(latin1_controls)
372
print(windows1252) # "€…''"
373
```
374
375
### Combining Multiple Fixes
376
377
```python
378
from ftfy.fixes import (
379
unescape_html, remove_terminal_escapes,
380
uncurl_quotes, fix_character_width, fix_line_breaks
381
)
382
383
def custom_clean(text):
384
"""Custom text cleaning pipeline."""
385
text = remove_terminal_escapes(text)
386
text = unescape_html(text)
387
text = uncurl_quotes(text)
388
text = fix_character_width(text)
389
text = fix_line_breaks(text)
390
return text
391
392
# Apply custom cleaning
393
messy_text = "\x1b[32m<HELLO>\x1b[0m "world"\r\n"
394
clean_text = custom_clean(messy_text)
395
print(clean_text) # '<HELLO> "world"\n'
396
```