Fixes mojibake and other problems with Unicode, after the fact
npx @tessl/cli install tessl/pypi-ftfy@6.3.00
# ftfy
1
2
Fixes mojibake and other problems with Unicode text after the fact. Detects and corrects common encoding issues, normalizes character formatting, and provides robust text cleaning utilities for handling text from unreliable sources with mixed or unknown encodings.
3
4
## Package Information
5
6
- **Package Name**: ftfy
7
- **Language**: Python
8
- **Installation**: `pip install ftfy`
9
10
## Core Imports
11
12
```python
13
import ftfy
14
```
15
16
Common import patterns:
17
18
```python
19
from ftfy import fix_text, fix_and_explain, TextFixerConfig
20
```
21
22
For individual text fixers:
23
24
```python
25
from ftfy.fixes import unescape_html, remove_terminal_escapes, uncurl_quotes
26
```
27
28
For formatting utilities:
29
30
```python
31
from ftfy.formatting import display_ljust, character_width
32
```
33
34
## Basic Usage
35
36
```python
37
import ftfy
38
39
# Fix encoding problems (mojibake)
40
broken_text = "âœ" No problems"
41
fixed = ftfy.fix_text(broken_text)
42
print(fixed) # "✔ No problems"
43
44
# Fix multiple layers of mojibake
45
broken = "The Mona Lisa doesn’t have eyebrows."
46
fixed = ftfy.fix_text(broken)
47
print(fixed) # "The Mona Lisa doesn't have eyebrows."
48
49
# Get explanation of what was fixed
50
text, explanation = ftfy.fix_and_explain("só")
51
print(text) # "só"
52
print(explanation) # [('encode', 'latin-1'), ('decode', 'utf-8')]
53
54
# Configure specific fixes
55
from ftfy import TextFixerConfig
56
config = TextFixerConfig(uncurl_quotes=False)
57
result = ftfy.fix_text(text, config)
58
```
59
60
## Architecture
61
62
ftfy operates through a multi-step pipeline that detects and corrects text problems:
63
64
- **Heuristic Detection**: Uses statistical analysis to identify mojibake patterns without false positives
65
- **Encoding Analysis**: Systematically tests encoding combinations to reverse encoding errors
66
- **Character Normalization**: Applies format fixes for quotes, ligatures, width, and line breaks
67
- **Configurable Pipeline**: Each fix step can be individually enabled/disabled via TextFixerConfig
68
- **Explanation System**: Provides detailed transformation logs for debugging and understanding
69
70
This design enables ftfy to safely process text from unknown sources while avoiding overcorrection of correctly-encoded text.
71
72
## Capabilities
73
74
### Text Fixing Functions
75
76
Core functions for detecting and fixing text encoding problems, including the main fix_text function and variants that provide explanations of applied transformations.
77
78
```python { .api }
79
def fix_text(text: str, config: TextFixerConfig | None = None, **kwargs) -> str: ...
80
def fix_and_explain(text: str, config: TextFixerConfig | None = None, **kwargs) -> ExplainedText: ...
81
def fix_encoding(text: str, config: TextFixerConfig | None = None, **kwargs) -> str: ...
82
def fix_encoding_and_explain(text: str, config: TextFixerConfig | None = None, **kwargs) -> ExplainedText: ...
83
84
# Alias for fix_text
85
ftfy = fix_text
86
```
87
88
[Text Fixing Functions](./text-fixing.md)
89
90
### Configuration and Types
91
92
Configuration classes and types for controlling ftfy behavior, including comprehensive options for each fix step and explanation data structures.
93
94
```python { .api }
95
class TextFixerConfig(NamedTuple): ...
96
class ExplainedText(NamedTuple): ...
97
class ExplanationStep(NamedTuple): ...
98
```
99
100
[Configuration and Types](./configuration.md)
101
102
### Individual Text Fixes
103
104
Individual transformation functions for specific text problems like HTML entities, terminal escapes, character width, quotes, and line breaks.
105
106
```python { .api }
107
def unescape_html(text: str) -> str: ...
108
def remove_terminal_escapes(text: str) -> str: ...
109
def uncurl_quotes(text: str) -> str: ...
110
def fix_character_width(text: str) -> str: ...
111
def fix_line_breaks(text: str) -> str: ...
112
```
113
114
[Individual Text Fixes](./individual-fixes.md)
115
116
### File and Byte Processing
117
118
Functions for processing files and handling bytes of unknown encoding, including streaming file processing and encoding detection utilities.
119
120
```python { .api }
121
def fix_file(input_file, encoding: str | None = None, config: TextFixerConfig | None = None, **kwargs) -> Iterator[str]: ...
122
def guess_bytes(bstring: bytes) -> tuple[str, str]: ...
123
```
124
125
[File and Byte Processing](./file-processing.md)
126
127
### Display and Formatting
128
129
Unicode-aware text formatting for terminal display, including width calculation and justification functions that handle fullwidth characters and zero-width characters correctly.
130
131
```python { .api }
132
def character_width(char: str) -> int: ...
133
def display_ljust(text: str, width: int, fillchar: str = " ") -> str: ...
134
def display_center(text: str, width: int, fillchar: str = " ") -> str: ...
135
```
136
137
[Display and Formatting](./formatting.md)
138
139
### Utilities and Debugging
140
141
Debugging and utility functions for understanding Unicode text and applying transformation plans manually.
142
143
```python { .api }
144
def explain_unicode(text: str) -> None: ...
145
def apply_plan(text: str, plan: list[tuple[str, str]]) -> str: ...
146
def badness(text: str) -> int: ...
147
def is_bad(text: str) -> bool: ...
148
```
149
150
[Utilities and Debugging](./utilities.md)
151
152
### Command Line Interface
153
154
Command-line tool for batch text processing with configurable options for encoding, normalization, and entity handling.
155
156
```python { .api }
157
def main() -> None: ...
158
```
159
160
[Command Line Interface](./cli.md)
161
162
## Constants
163
164
```python { .api }
165
__version__ = "6.3.1" # Package version string
166
```
167
168
## Core Types
169
170
```python { .api }
171
class TextFixerConfig(NamedTuple):
172
"""Configuration for all ftfy text processing options."""
173
unescape_html: str | bool = "auto"
174
remove_terminal_escapes: bool = True
175
fix_encoding: bool = True
176
restore_byte_a0: bool = True
177
replace_lossy_sequences: bool = True
178
decode_inconsistent_utf8: bool = True
179
fix_c1_controls: bool = True
180
fix_latin_ligatures: bool = True
181
fix_character_width: bool = True
182
uncurl_quotes: bool = True
183
fix_line_breaks: bool = True
184
fix_surrogates: bool = True
185
remove_control_chars: bool = True
186
normalization: Literal["NFC", "NFD", "NFKC", "NFKD"] | None = "NFC"
187
max_decode_length: int = 1000000
188
explain: bool = True
189
190
class ExplainedText(NamedTuple):
191
"""Result containing fixed text and explanation of changes."""
192
text: str
193
explanation: list[ExplanationStep] | None
194
195
class ExplanationStep(NamedTuple):
196
"""Single step in text transformation explanation."""
197
action: str # "encode", "decode", "transcode", "apply", "normalize"
198
parameter: str # encoding name or function name
199
```