Tessl Tile for pypi/ftfy@6.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli.md configuration.md file-processing.md formatting.md index.md individual-fixes.md text-fixing.md utilities.md

individual-fixes.mddocs/

0
# Individual Text Fixes
1

2
Individual transformation functions for specific text problems like HTML entities, terminal escapes, character width, quotes, and line breaks. These functions can be used independently or are applied automatically by the main text fixing functions.
3

4
## Capabilities
5

6
### HTML and Markup Processing
7

8
Functions for handling HTML entities and markup-related text issues.
9

10
```python { .api }
11
def unescape_html(text: str) -> str:
12
    """
13
    Convert HTML entities to Unicode characters.
14
    
15
    Robust replacement for html.unescape that handles malformed entities
16
    and common entity mistakes. Converts entities like &amp; → &, &lt; → <.
17
    
18
    Args:
19
        text: String potentially containing HTML entities
20
        
21
    Returns:
22
        String with HTML entities converted to Unicode characters
23
        
24
    Examples:
25
        >>> unescape_html("&amp; &lt;tag&gt;")
26
        '& <tag>'
27
        >>> unescape_html("&EACUTE;")  # Handles incorrect capitalization  
28
        'É'
29
    """
30
```
31

32
### Terminal and Control Characters
33

34
Functions for cleaning terminal escapes and control characters.
35

36
```python { .api }
37
def remove_terminal_escapes(text: str) -> str:
38
    """
39
    Remove ANSI terminal escape sequences.
40
    
41
    Strips color codes, cursor positioning, and other ANSI escape
42
    sequences commonly found in terminal output or log files.
43
    
44
    Args:
45
        text: String potentially containing ANSI escape sequences
46
        
47
    Returns:
48
        String with terminal escapes removed
49
        
50
    Examples:
51
        >>> remove_terminal_escapes("\\x1b[31mRed text\\x1b[0m")
52
        'Red text'
53
        >>> remove_terminal_escapes("\\x1b[2J\\x1b[HClear screen")
54
        'Clear screen'
55
    """
56

57
def remove_control_chars(text: str) -> str:
58
    """
59
    Remove unnecessary Unicode control characters.
60
    
61
    Removes control characters that have no visual effect and are
62
    typically unwanted artifacts in text processing.
63
    
64
    Args:
65
        text: String potentially containing control characters
66
        
67
    Returns:
68
        String with control characters removed
69
    """
70

71
def remove_bom(text: str) -> str:
72
    """
73
    Remove byte order marks (BOM) from text.
74
    
75
    Strips Unicode BOM characters that sometimes appear at the
76
    beginning of text files or strings.
77
    
78
    Args:
79
        text: String potentially starting with BOM
80
        
81
    Returns:
82
        String with BOM removed
83
    """
84
```
85

86
### Quote and Punctuation Fixes
87

88
Functions for normalizing quotes and punctuation characters.
89

90
```python { .api }
91
def uncurl_quotes(text: str) -> str:
92
    """
93
    Convert curly quotes to straight ASCII quotes.
94
    
95
    Replaces Unicode quotation marks with ASCII equivalents:
96
    ' ' → ', " " → ". Useful for systems requiring ASCII-only text.
97
    
98
    Args:
99
        text: String containing curly quotes
100
        
101
    Returns:
102
        String with straight ASCII quotes
103
        
104
    Examples:
105
        >>> uncurl_quotes("It's "quoted" text")
106
        'It\\'s "quoted" text'
107
        >>> uncurl_quotes("'single' and "double" quotes")
108
        '\\'single\\' and "double" quotes'
109
    """
110
```
111

112
### Character Width and Typography
113

114
Functions for normalizing character width and typographic elements.
115

116
```python { .api }
117
def fix_character_width(text: str) -> str:
118
    """
119
    Normalize fullwidth and halfwidth characters.
120
    
121
    Converts fullwidth Latin characters to normal width and halfwidth
122
    Katakana to normal width for consistent display and processing.
123
    
124
    Args:
125
        text: String containing width-variant characters
126
        
127
    Returns:
128
        String with normalized character widths
129
        
130
    Examples:
131
        >>> fix_character_width("ＬＯＵＤ　ＮＯＩＳＥＳ")
132
        'LOUD NOISES'
133
        >>> fix_character_width("ﾊﾝｶｸ")  # Halfwidth Katakana
134
        'ハンカク'
135
    """
136

137
def fix_latin_ligatures(text: str) -> str:
138
    """
139
    Replace Latin ligatures with individual letters.
140
    
141
    Converts typographic ligatures like ﬁ, ﬂ back to individual
142
    characters (fi, fl) for searchability and processing.
143
    
144
    Args:
145
        text: String containing Latin ligatures
146
        
147
    Returns:
148
        String with ligatures replaced by letter sequences
149
        
150
    Examples:
151
        >>> fix_latin_ligatures("ﬁle and ﬂower")
152
        'file and flower'
153
        >>> fix_latin_ligatures("ofﬁce")
154
        'office'
155
    """
156
```
157

158
### Line Break and Whitespace Normalization
159

160
Functions for standardizing line breaks and whitespace.
161

162
```python { .api }
163
def fix_line_breaks(text: str) -> str:
164
    """
165
    Standardize line breaks to Unix format (\\n).
166
    
167
    Converts Windows (\\r\\n), Mac (\\r), and other line ending
168
    variations to standard Unix newlines. Handles Unicode line
169
    separators and paragraph separators.
170
    
171
    Args:
172
        text: String with various line break formats
173
        
174
    Returns:
175
        String with standardized \\n line breaks
176
        
177
    Examples:
178
        >>> fix_line_breaks("line1\\r\\nline2\\rline3")
179
        'line1\\nline2\\nline3'
180
        >>> fix_line_breaks("para1\\u2029para2")  # Unicode paragraph sep
181
        'para1\\npara2'
182
    """
183
```
184

185
### Advanced Character Processing  
186

187
Functions for handling complex Unicode issues.
188

189
```python { .api }
190
def fix_surrogates(text: str) -> str:
191
    """
192
    Fix UTF-16 surrogate pair sequences.
193
    
194
    Converts UTF-16 surrogate codepoints back to the original high-
195
    numbered Unicode characters like emoji. Fixes text decoded with
196
    obsolete UCS-2 standard.
197
    
198
    Args:
199
        text: String containing UTF-16 surrogates
200
        
201
    Returns:
202
        String with surrogates converted to proper characters
203
        
204
    Examples:
205
        >>> fix_surrogates("\\ud83d\\ude00")  # Surrogate pair
206
        '😀'
207
    """
208

209
def fix_c1_controls(text: str) -> str:
210
    """
211
    Replace C1 control characters with Windows-1252 equivalents.
212
    
213
    Converts Latin-1 control characters (U+80-U+9F) to their
214
    Windows-1252 interpretations following HTML5 standard.
215
    
216
    Args:
217
        text: String containing C1 control characters
218
        
219
    Returns:
220
        String with C1 controls replaced
221
        
222
    Examples:
223
        >>> fix_c1_controls("\\x80")  # C1 control
224
        '€'  # Windows-1252 Euro sign
225
    """
226
```
227

228
### Byte-Level Processing
229

230
Functions for processing byte sequences during encoding correction.
231

232
```python { .api }
233
def restore_byte_a0(byts: bytes) -> bytes:
234
    """
235
    Restore byte 0xA0 in potential UTF-8 mojibake.
236
    
237
    Replaces literal space (0x20) with non-breaking space (0xA0)
238
    when it would make the bytes valid UTF-8. Used during encoding
239
    detection to handle common mojibake patterns.
240
    
241
    Args:
242
        byts: Byte sequence potentially containing altered UTF-8
243
        
244
    Returns:
245
        Byte sequence with 0xA0 restored where appropriate
246
    """
247

248
def replace_lossy_sequences(byts: bytes) -> bytes:
249
    """
250
    Replace lossy byte sequences in mojibake correction.
251
    
252
    Identifies and replaces sequences where information was lost
253
    during encoding/decoding, typically involving � or ? characters.
254
    
255
    Args:
256
        byts: Byte sequence from encoding detection
257
        
258
    Returns:
259
        Byte sequence with lossy sequences replaced
260
    """
261

262
def decode_inconsistent_utf8(text: str) -> str:
263
    """
264
    Handle inconsistent UTF-8 sequences in text.
265
    
266
    Fixes text where UTF-8 mojibake patterns exist but there's no
267
    consistent way to reinterpret the string in a single encoding.
268
    Replaces problematic sequences with proper UTF-8.
269
    
270
    Args:
271
        text: String with inconsistent UTF-8 sequences
272
        
273
    Returns:
274
        String with UTF-8 sequences corrected
275
    """
276
```
277

278
### Utility Functions
279

280
Additional text processing utilities.
281

282
```python { .api }
283
def decode_escapes(text: str) -> str:
284
    """
285
    Decode backslash escape sequences in text.
286
    
287
    More robust version of string decode that handles various escape
288
    sequence formats including \\n, \\t, \\uXXXX, \\xXX patterns.
289
    
290
    Args:
291
        text: String containing escape sequences
292
        
293
    Returns:
294
        String with escape sequences decoded
295
        
296
    Examples:
297
        >>> decode_escapes("Hello\\nWorld\\t!")
298
        'Hello\\nWorld\\t!'
299
        >>> decode_escapes("Unicode: \\u00e9")
300
        'Unicode: é'
301
    """
302
```
303

304
## Usage Examples
305

306
### Individual Fix Application
307

308
```python
309
from ftfy.fixes import unescape_html, remove_terminal_escapes, uncurl_quotes
310

311
# Apply individual fixes
312
html_text = "&lt;p&gt;Hello &amp; goodbye&lt;/p&gt;"
313
clean_html = unescape_html(html_text)
314
print(clean_html)  # "<p>Hello & goodbye</p>"
315

316
# Clean terminal output
317
terminal_output = "\x1b[31mError:\x1b[0m File not found"
318
clean_output = remove_terminal_escapes(terminal_output)
319
print(clean_output)  # "Error: File not found"
320

321
# Normalize quotes for ASCII systems
322
curly_text = "It's "perfectly" fine"
323
straight_quotes = uncurl_quotes(curly_text)
324
print(straight_quotes)  # 'It\'s "perfectly" fine'
325
```
326

327
### Character Width Normalization
328

329
```python
330
from ftfy.fixes import fix_character_width, fix_latin_ligatures
331

332
# Fix fullwidth characters
333
wide_text = "ＨＥＬＬＯ　ＷＯＲＬＤ"  
334
normal_text = fix_character_width(wide_text)
335
print(normal_text)  # "HELLO WORLD"
336

337
# Decompose ligatures  
338
ligature_text = "The ofﬁce ﬁle"
339
decomposed = fix_latin_ligatures(ligature_text)
340
print(decomposed)  # "The office file"
341
```
342

343
### Line Break Standardization
344

345
```python  
346
from ftfy.fixes import fix_line_breaks
347

348
# Standardize mixed line endings
349
mixed_lines = "Line 1\r\nLine 2\rLine 3\nLine 4"
350
unix_lines = fix_line_breaks(mixed_lines)
351
print(repr(unix_lines))  # 'Line 1\nLine 2\nLine 3\nLine 4'
352

353
# Handle Unicode line separators
354
unicode_lines = "Para 1\u2029Para 2\u2028Line break"
355
standard_lines = fix_line_breaks(unicode_lines) 
356
print(repr(standard_lines))  # 'Para 1\nPara 2\nLine break'
357
```
358

359
### Advanced Character Processing
360

361
```python
362
from ftfy.fixes import fix_surrogates, fix_c1_controls
363

364
# Fix emoji from surrogate pairs
365
surrogate_emoji = "\ud83d\ude00\ud83d\ude01"  # Encoded emoji
366
real_emoji = fix_surrogates(surrogate_emoji)
367
print(real_emoji)  # "😀😁"
368

369
# Fix C1 control characters  
370
latin1_controls = "\x80\x85\x91\x92"  # C1 controls
371
windows1252 = fix_c1_controls(latin1_controls)
372
print(windows1252)  # "€…''"
373
```
374

375
### Combining Multiple Fixes
376

377
```python
378
from ftfy.fixes import (
379
    unescape_html, remove_terminal_escapes, 
380
    uncurl_quotes, fix_character_width, fix_line_breaks
381
)
382

383
def custom_clean(text):
384
    """Custom text cleaning pipeline."""
385
    text = remove_terminal_escapes(text)
386
    text = unescape_html(text)
387
    text = uncurl_quotes(text) 
388
    text = fix_character_width(text)
389
    text = fix_line_breaks(text)
390
    return text
391

392
# Apply custom cleaning
393
messy_text = "\x1b[32m&lt;ＨＥＬＬＯ&gt;\x1b[0m "world"\r\n"
394
clean_text = custom_clean(messy_text)
395
print(clean_text)  # '<HELLO> "world"\n'
396
```

Version

Tile

Files

individual-fixes.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

individual-fixes.mddocs/