or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli.mdconfiguration.mdfile-processing.mdformatting.mdindex.mdindividual-fixes.mdtext-fixing.mdutilities.md

configuration.mddocs/

0

# Configuration and Types

1

2

Configuration classes and types for controlling ftfy behavior, including comprehensive options for each fix step and explanation data structures.

3

4

## Capabilities

5

6

### Text Fixer Configuration

7

8

Comprehensive configuration class controlling all aspects of ftfy text processing through named tuple with sensible defaults.

9

10

```python { .api }

11

class TextFixerConfig(NamedTuple):

12

"""

13

Configuration for all ftfy text processing options.

14

15

Implemented as NamedTuple with defaults, instantiate with keyword

16

arguments for values to change from defaults.

17

18

Attributes:

19

unescape_html: HTML entity handling ("auto"|True|False)

20

remove_terminal_escapes: Remove ANSI escape sequences (bool)

21

fix_encoding: Detect and fix mojibake (bool)

22

restore_byte_a0: Allow space as non-breaking space in mojibake (bool)

23

replace_lossy_sequences: Fix partial mojibake with � or ? (bool)

24

decode_inconsistent_utf8: Fix inconsistent UTF-8 sequences (bool)

25

fix_c1_controls: Replace C1 controls with Windows-1252 (bool)

26

fix_latin_ligatures: Replace ligatures with letters (bool)

27

fix_character_width: Normalize fullwidth/halfwidth chars (bool)

28

uncurl_quotes: Convert curly quotes to straight quotes (bool)

29

fix_line_breaks: Standardize line breaks to \\n (bool)

30

fix_surrogates: Fix UTF-16 surrogate sequences (bool)

31

remove_control_chars: Remove unnecessary control chars (bool)

32

normalization: Unicode normalization type (str|None)

33

max_decode_length: Maximum segment size for processing (int)

34

explain: Whether to compute explanations (bool)

35

"""

36

unescape_html: str | bool = "auto"

37

remove_terminal_escapes: bool = True

38

fix_encoding: bool = True

39

restore_byte_a0: bool = True

40

replace_lossy_sequences: bool = True

41

decode_inconsistent_utf8: bool = True

42

fix_c1_controls: bool = True

43

fix_latin_ligatures: bool = True

44

fix_character_width: bool = True

45

uncurl_quotes: bool = True

46

fix_line_breaks: bool = True

47

fix_surrogates: bool = True

48

remove_control_chars: bool = True

49

normalization: Literal["NFC", "NFD", "NFKC", "NFKD"] | None = "NFC"

50

max_decode_length: int = 1000000

51

explain: bool = True

52

```

53

54

### Explanation Types

55

56

Data structures for representing text transformation explanations and individual transformation steps.

57

58

```python { .api }

59

class ExplainedText(NamedTuple):

60

"""

61

Return type for ftfy functions that provide explanations.

62

63

Contains both the fixed text result and optional explanation of

64

transformations applied. When explain=False, explanation is None.

65

66

Attributes:

67

text: The processed text result (str)

68

explanation: List of transformation steps or None (list[ExplanationStep]|None)

69

"""

70

text: str

71

explanation: list[ExplanationStep] | None

72

73

class ExplanationStep(NamedTuple):

74

"""

75

Single step in text transformation explanation.

76

77

Describes one transformation applied during text processing with

78

action type and parameter specifying the operation performed.

79

80

Attributes:

81

action: Type of transformation (str)

82

parameter: Encoding name or function name (str)

83

84

Actions:

85

"encode": Convert string to bytes with specified encoding

86

"decode": Convert bytes to string with specified encoding

87

"transcode": Convert bytes to bytes with named function

88

"apply": Convert string to string with named function

89

"normalize": Apply Unicode normalization

90

"""

91

action: str

92

parameter: str

93

```

94

95

## Configuration Details

96

97

### HTML Entity Handling

98

99

The `unescape_html` option controls HTML entity processing:

100

101

- `"auto"` (default): Decode entities unless literal `<` appears (indicating HTML)

102

- `True`: Always decode HTML entities like `&amp;``&`

103

- `False`: Never decode HTML entities

104

105

```python

106

from ftfy import TextFixerConfig, fix_text

107

108

# Auto mode - detects HTML context

109

config = TextFixerConfig(unescape_html="auto")

110

fix_text("&amp; text") # → "& text" (no < detected)

111

fix_text("<p>&amp;</p>") # → "<p>&amp;</p>" (< detected, preserve entities)

112

113

# Always decode entities

114

config = TextFixerConfig(unescape_html=True)

115

fix_text("<p>&amp;</p>") # → "<p>&</p>"

116

117

# Never decode entities

118

config = TextFixerConfig(unescape_html=False)

119

fix_text("&amp; text") # → "&amp; text"

120

```

121

122

### Encoding Detection Options

123

124

Several options control encoding detection and mojibake fixing:

125

126

```python

127

# Conservative encoding fixing - fewer false positives

128

conservative = TextFixerConfig(

129

restore_byte_a0=False, # Don't interpret spaces as non-breaking spaces

130

replace_lossy_sequences=False, # Don't fix partial mojibake

131

decode_inconsistent_utf8=False # Don't fix inconsistent UTF-8

132

)

133

134

# Aggressive encoding fixing - more corrections

135

aggressive = TextFixerConfig(

136

restore_byte_a0=True, # Allow space → non-breaking space

137

replace_lossy_sequences=True, # Fix sequences with � or ?

138

decode_inconsistent_utf8=True # Fix inconsistent UTF-8 patterns

139

)

140

```

141

142

### Character Normalization Options

143

144

Control various character formatting fixes:

145

146

```python

147

# Minimal character normalization

148

minimal = TextFixerConfig(

149

fix_latin_ligatures=False, # Keep ligatures like fi

150

fix_character_width=False, # Keep fullwidth characters

151

uncurl_quotes=False, # Keep curly quotes

152

fix_line_breaks=False # Keep original line endings

153

)

154

155

# Text cleaning for terminal display

156

terminal = TextFixerConfig(

157

remove_terminal_escapes=True, # Remove ANSI escapes

158

remove_control_chars=True, # Remove control characters

159

fix_character_width=True, # Normalize character widths

160

normalization="NFC" # Canonical Unicode form

161

)

162

```

163

164

### Unicode Normalization

165

166

The `normalization` option controls Unicode canonical forms:

167

168

```python

169

# NFC - Canonical decomposed + composed (default)

170

nfc_config = TextFixerConfig(normalization="NFC")

171

fix_text("café", nfc_config) # Combines é into single character

172

173

# NFD - Canonical decomposed

174

nfd_config = TextFixerConfig(normalization="NFD")

175

fix_text("café", nfd_config) # Separates é into e + ´

176

177

# NFKC - Compatibility normalization (changes meaning)

178

nfkc_config = TextFixerConfig(normalization="NFKC")

179

fix_text("10³", nfkc_config) # → "103" (loses superscript)

180

181

# No normalization

182

no_norm = TextFixerConfig(normalization=None)

183

```

184

185

## Usage Examples

186

187

### Basic Configuration

188

189

```python

190

from ftfy import TextFixerConfig, fix_text

191

192

# Use defaults

193

config = TextFixerConfig()

194

195

# Change specific options

196

config = TextFixerConfig(uncurl_quotes=False, fix_encoding=True)

197

198

# Create variations

199

no_html = config._replace(unescape_html=False)

200

conservative = config._replace(restore_byte_a0=False, replace_lossy_sequences=False)

201

```

202

203

### Keyword Arguments

204

205

```python

206

from ftfy import fix_text

207

208

# Pass config options as kwargs (equivalent to config object)

209

result = fix_text(text, uncurl_quotes=False, normalization="NFD")

210

211

# Mix config object and kwargs (kwargs override config)

212

config = TextFixerConfig(uncurl_quotes=False)

213

result = fix_text(text, config, normalization="NFD") # NFD overrides config

214

```

215

216

### Working with Explanations

217

218

```python

219

from ftfy import fix_and_explain

220

221

# Get detailed explanations

222

result = fix_and_explain("só")

223

print(f"Text: {result.text}")

224

print(f"Steps: {len(result.explanation)} transformations")

225

226

for step in result.explanation:

227

print(f" {step.action}: {step.parameter}")

228

229

# Disable explanations for performance

230

from ftfy import TextFixerConfig

231

config = TextFixerConfig(explain=False)

232

result = fix_and_explain(text, config)

233

print(result.explanation) # None

234

```

235

236

### Performance Tuning

237

238

```python

239

# Process large texts in smaller segments

240

large_text_config = TextFixerConfig(max_decode_length=500000)

241

242

# Skip expensive operations for simple cleaning

243

fast_config = TextFixerConfig(

244

fix_encoding=False, # Skip mojibake detection

245

unescape_html=False, # Skip HTML processing

246

explain=False # Skip explanation generation

247

)

248

249

# Text-only cleaning (no encoding fixes)

250

text_only = TextFixerConfig(

251

fix_encoding=False,

252

unescape_html=False,

253

remove_terminal_escapes=True,

254

fix_character_width=True,

255

uncurl_quotes=True,

256

fix_line_breaks=True

257

)

258

```