or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli.mdconfiguration.mdfile-processing.mdformatting.mdindex.mdindividual-fixes.mdtext-fixing.mdutilities.md

individual-fixes.mddocs/

0

# Individual Text Fixes

1

2

Individual transformation functions for specific text problems like HTML entities, terminal escapes, character width, quotes, and line breaks. These functions can be used independently or are applied automatically by the main text fixing functions.

3

4

## Capabilities

5

6

### HTML and Markup Processing

7

8

Functions for handling HTML entities and markup-related text issues.

9

10

```python { .api }

11

def unescape_html(text: str) -> str:

12

"""

13

Convert HTML entities to Unicode characters.

14

15

Robust replacement for html.unescape that handles malformed entities

16

and common entity mistakes. Converts entities like &amp; → &, &lt; → <.

17

18

Args:

19

text: String potentially containing HTML entities

20

21

Returns:

22

String with HTML entities converted to Unicode characters

23

24

Examples:

25

>>> unescape_html("&amp; &lt;tag&gt;")

26

'& <tag>'

27

>>> unescape_html("&EACUTE;") # Handles incorrect capitalization

28

'É'

29

"""

30

```

31

32

### Terminal and Control Characters

33

34

Functions for cleaning terminal escapes and control characters.

35

36

```python { .api }

37

def remove_terminal_escapes(text: str) -> str:

38

"""

39

Remove ANSI terminal escape sequences.

40

41

Strips color codes, cursor positioning, and other ANSI escape

42

sequences commonly found in terminal output or log files.

43

44

Args:

45

text: String potentially containing ANSI escape sequences

46

47

Returns:

48

String with terminal escapes removed

49

50

Examples:

51

>>> remove_terminal_escapes("\\x1b[31mRed text\\x1b[0m")

52

'Red text'

53

>>> remove_terminal_escapes("\\x1b[2J\\x1b[HClear screen")

54

'Clear screen'

55

"""

56

57

def remove_control_chars(text: str) -> str:

58

"""

59

Remove unnecessary Unicode control characters.

60

61

Removes control characters that have no visual effect and are

62

typically unwanted artifacts in text processing.

63

64

Args:

65

text: String potentially containing control characters

66

67

Returns:

68

String with control characters removed

69

"""

70

71

def remove_bom(text: str) -> str:

72

"""

73

Remove byte order marks (BOM) from text.

74

75

Strips Unicode BOM characters that sometimes appear at the

76

beginning of text files or strings.

77

78

Args:

79

text: String potentially starting with BOM

80

81

Returns:

82

String with BOM removed

83

"""

84

```

85

86

### Quote and Punctuation Fixes

87

88

Functions for normalizing quotes and punctuation characters.

89

90

```python { .api }

91

def uncurl_quotes(text: str) -> str:

92

"""

93

Convert curly quotes to straight ASCII quotes.

94

95

Replaces Unicode quotation marks with ASCII equivalents:

96

' ' → ', " " → ". Useful for systems requiring ASCII-only text.

97

98

Args:

99

text: String containing curly quotes

100

101

Returns:

102

String with straight ASCII quotes

103

104

Examples:

105

>>> uncurl_quotes("It's "quoted" text")

106

'It\\'s "quoted" text'

107

>>> uncurl_quotes("'single' and "double" quotes")

108

'\\'single\\' and "double" quotes'

109

"""

110

```

111

112

### Character Width and Typography

113

114

Functions for normalizing character width and typographic elements.

115

116

```python { .api }

117

def fix_character_width(text: str) -> str:

118

"""

119

Normalize fullwidth and halfwidth characters.

120

121

Converts fullwidth Latin characters to normal width and halfwidth

122

Katakana to normal width for consistent display and processing.

123

124

Args:

125

text: String containing width-variant characters

126

127

Returns:

128

String with normalized character widths

129

130

Examples:

131

>>> fix_character_width("LOUD NOISES")

132

'LOUD NOISES'

133

>>> fix_character_width("ハンカク") # Halfwidth Katakana

134

'ハンカク'

135

"""

136

137

def fix_latin_ligatures(text: str) -> str:

138

"""

139

Replace Latin ligatures with individual letters.

140

141

Converts typographic ligatures like fi, fl back to individual

142

characters (fi, fl) for searchability and processing.

143

144

Args:

145

text: String containing Latin ligatures

146

147

Returns:

148

String with ligatures replaced by letter sequences

149

150

Examples:

151

>>> fix_latin_ligatures("file and flower")

152

'file and flower'

153

>>> fix_latin_ligatures("office")

154

'office'

155

"""

156

```

157

158

### Line Break and Whitespace Normalization

159

160

Functions for standardizing line breaks and whitespace.

161

162

```python { .api }

163

def fix_line_breaks(text: str) -> str:

164

"""

165

Standardize line breaks to Unix format (\\n).

166

167

Converts Windows (\\r\\n), Mac (\\r), and other line ending

168

variations to standard Unix newlines. Handles Unicode line

169

separators and paragraph separators.

170

171

Args:

172

text: String with various line break formats

173

174

Returns:

175

String with standardized \\n line breaks

176

177

Examples:

178

>>> fix_line_breaks("line1\\r\\nline2\\rline3")

179

'line1\\nline2\\nline3'

180

>>> fix_line_breaks("para1\\u2029para2") # Unicode paragraph sep

181

'para1\\npara2'

182

"""

183

```

184

185

### Advanced Character Processing

186

187

Functions for handling complex Unicode issues.

188

189

```python { .api }

190

def fix_surrogates(text: str) -> str:

191

"""

192

Fix UTF-16 surrogate pair sequences.

193

194

Converts UTF-16 surrogate codepoints back to the original high-

195

numbered Unicode characters like emoji. Fixes text decoded with

196

obsolete UCS-2 standard.

197

198

Args:

199

text: String containing UTF-16 surrogates

200

201

Returns:

202

String with surrogates converted to proper characters

203

204

Examples:

205

>>> fix_surrogates("\\ud83d\\ude00") # Surrogate pair

206

'😀'

207

"""

208

209

def fix_c1_controls(text: str) -> str:

210

"""

211

Replace C1 control characters with Windows-1252 equivalents.

212

213

Converts Latin-1 control characters (U+80-U+9F) to their

214

Windows-1252 interpretations following HTML5 standard.

215

216

Args:

217

text: String containing C1 control characters

218

219

Returns:

220

String with C1 controls replaced

221

222

Examples:

223

>>> fix_c1_controls("\\x80") # C1 control

224

'€' # Windows-1252 Euro sign

225

"""

226

```

227

228

### Byte-Level Processing

229

230

Functions for processing byte sequences during encoding correction.

231

232

```python { .api }

233

def restore_byte_a0(byts: bytes) -> bytes:

234

"""

235

Restore byte 0xA0 in potential UTF-8 mojibake.

236

237

Replaces literal space (0x20) with non-breaking space (0xA0)

238

when it would make the bytes valid UTF-8. Used during encoding

239

detection to handle common mojibake patterns.

240

241

Args:

242

byts: Byte sequence potentially containing altered UTF-8

243

244

Returns:

245

Byte sequence with 0xA0 restored where appropriate

246

"""

247

248

def replace_lossy_sequences(byts: bytes) -> bytes:

249

"""

250

Replace lossy byte sequences in mojibake correction.

251

252

Identifies and replaces sequences where information was lost

253

during encoding/decoding, typically involving � or ? characters.

254

255

Args:

256

byts: Byte sequence from encoding detection

257

258

Returns:

259

Byte sequence with lossy sequences replaced

260

"""

261

262

def decode_inconsistent_utf8(text: str) -> str:

263

"""

264

Handle inconsistent UTF-8 sequences in text.

265

266

Fixes text where UTF-8 mojibake patterns exist but there's no

267

consistent way to reinterpret the string in a single encoding.

268

Replaces problematic sequences with proper UTF-8.

269

270

Args:

271

text: String with inconsistent UTF-8 sequences

272

273

Returns:

274

String with UTF-8 sequences corrected

275

"""

276

```

277

278

### Utility Functions

279

280

Additional text processing utilities.

281

282

```python { .api }

283

def decode_escapes(text: str) -> str:

284

"""

285

Decode backslash escape sequences in text.

286

287

More robust version of string decode that handles various escape

288

sequence formats including \\n, \\t, \\uXXXX, \\xXX patterns.

289

290

Args:

291

text: String containing escape sequences

292

293

Returns:

294

String with escape sequences decoded

295

296

Examples:

297

>>> decode_escapes("Hello\\nWorld\\t!")

298

'Hello\\nWorld\\t!'

299

>>> decode_escapes("Unicode: \\u00e9")

300

'Unicode: é'

301

"""

302

```

303

304

## Usage Examples

305

306

### Individual Fix Application

307

308

```python

309

from ftfy.fixes import unescape_html, remove_terminal_escapes, uncurl_quotes

310

311

# Apply individual fixes

312

html_text = "&lt;p&gt;Hello &amp; goodbye&lt;/p&gt;"

313

clean_html = unescape_html(html_text)

314

print(clean_html) # "<p>Hello & goodbye</p>"

315

316

# Clean terminal output

317

terminal_output = "\x1b[31mError:\x1b[0m File not found"

318

clean_output = remove_terminal_escapes(terminal_output)

319

print(clean_output) # "Error: File not found"

320

321

# Normalize quotes for ASCII systems

322

curly_text = "It's "perfectly" fine"

323

straight_quotes = uncurl_quotes(curly_text)

324

print(straight_quotes) # 'It\'s "perfectly" fine'

325

```

326

327

### Character Width Normalization

328

329

```python

330

from ftfy.fixes import fix_character_width, fix_latin_ligatures

331

332

# Fix fullwidth characters

333

wide_text = "HELLO WORLD"

334

normal_text = fix_character_width(wide_text)

335

print(normal_text) # "HELLO WORLD"

336

337

# Decompose ligatures

338

ligature_text = "The office file"

339

decomposed = fix_latin_ligatures(ligature_text)

340

print(decomposed) # "The office file"

341

```

342

343

### Line Break Standardization

344

345

```python

346

from ftfy.fixes import fix_line_breaks

347

348

# Standardize mixed line endings

349

mixed_lines = "Line 1\r\nLine 2\rLine 3\nLine 4"

350

unix_lines = fix_line_breaks(mixed_lines)

351

print(repr(unix_lines)) # 'Line 1\nLine 2\nLine 3\nLine 4'

352

353

# Handle Unicode line separators

354

unicode_lines = "Para 1\u2029Para 2\u2028Line break"

355

standard_lines = fix_line_breaks(unicode_lines)

356

print(repr(standard_lines)) # 'Para 1\nPara 2\nLine break'

357

```

358

359

### Advanced Character Processing

360

361

```python

362

from ftfy.fixes import fix_surrogates, fix_c1_controls

363

364

# Fix emoji from surrogate pairs

365

surrogate_emoji = "\ud83d\ude00\ud83d\ude01" # Encoded emoji

366

real_emoji = fix_surrogates(surrogate_emoji)

367

print(real_emoji) # "😀😁"

368

369

# Fix C1 control characters

370

latin1_controls = "\x80\x85\x91\x92" # C1 controls

371

windows1252 = fix_c1_controls(latin1_controls)

372

print(windows1252) # "€…''"

373

```

374

375

### Combining Multiple Fixes

376

377

```python

378

from ftfy.fixes import (

379

unescape_html, remove_terminal_escapes,

380

uncurl_quotes, fix_character_width, fix_line_breaks

381

)

382

383

def custom_clean(text):

384

"""Custom text cleaning pipeline."""

385

text = remove_terminal_escapes(text)

386

text = unescape_html(text)

387

text = uncurl_quotes(text)

388

text = fix_character_width(text)

389

text = fix_line_breaks(text)

390

return text

391

392

# Apply custom cleaning

393

messy_text = "\x1b[32m&lt;HELLO&gt;\x1b[0m "world"\r\n"

394

clean_text = custom_clean(messy_text)

395

print(clean_text) # '<HELLO> "world"\n'

396

```