Fixes mojibake and other problems with Unicode, after the fact
—
Command-line tool for batch text processing with configurable options for encoding, normalization, and entity handling.
Main function providing command-line access to ftfy text processing.
def main() -> None:
"""
Run ftfy as command-line utility.
Processes files or standard input with configurable text fixing options.
Handles encoding detection, normalization settings, and HTML entity processing.
Command line usage:
ftfy [filename] [options]
Options:
-o, --output: Output file (default: stdout)
-g, --guess: Guess input encoding (risky)
-e, --encoding: Specify input encoding (default: utf-8)
-n, --normalization: Unicode normalization (default: NFC)
--preserve-entities: Don't decode HTML entities
Examples:
ftfy input.txt -o output.txt
ftfy -g mystery.txt
cat file.txt | ftfy > cleaned.txt
"""# Fix a single file, output to stdout
ftfy broken_text.txt
# Fix file and save to new file
ftfy input.txt -o fixed_output.txt
# Process standard input
cat messy_file.txt | ftfy > clean_file.txt
echo "âœ" mojibake" | ftfy# Specify input encoding explicitly
ftfy --encoding latin-1 oldfile.txt
# Let ftfy guess the encoding (not recommended)
ftfy --guess mystery_encoding.txt
# Process file with unknown encoding
ftfy -g -o output.txt unknown_file.txt# Disable Unicode normalization
ftfy --normalization none input.txt
# Use NFD normalization instead of default NFC
ftfy --normalization NFD input.txt
# Preserve HTML entities (don't decode them)
ftfy --preserve-entities html_file.txt
# Combine options
ftfy -e latin-1 -n NFD --preserve-entities input.txt -o output.txt# Process all .txt files in directory
for file in *.txt; do
ftfy "$file" -o "fixed_$file"
done
# Process files preserving directory structure
find . -name "*.txt" -exec sh -c 'ftfy "$1" -o "${1%.txt}_fixed.txt"' _ {} \;
# Process with encoding detection for mixed files
find . -name "*.txt" -exec ftfy -g -o {}.fixed {} \;You can also access CLI functionality programmatically:
from ftfy.cli import main
import sys
# Simulate command line arguments
sys.argv = ['ftfy', 'input.txt', '-o', 'output.txt', '--encoding', 'latin-1']
main()from ftfy import fix_file, TextFixerConfig
import sys
def cli_equivalent(input_file, output_file=None, encoding='utf-8',
normalization='NFC', preserve_entities=False, guess=False):
"""Replicate CLI behavior in Python."""
if guess:
encoding = None
unescape_html = False if preserve_entities else "auto"
normalization = None if normalization.lower() == 'none' else normalization
config = TextFixerConfig(
unescape_html=unescape_html,
normalization=normalization
)
# Open input file
if input_file == '-':
infile = sys.stdin.buffer
else:
infile = open(input_file, 'rb')
# Open output file
if output_file is None or output_file == '-':
outfile = sys.stdout
else:
outfile = open(output_file, 'w', encoding='utf-8')
try:
for line in fix_file(infile, encoding=encoding, config=config):
outfile.write(line)
finally:
if input_file != '-':
infile.close()
if output_file not in (None, '-'):
outfile.close()
# Usage examples
cli_equivalent('messy.txt', 'clean.txt')
cli_equivalent('latin1.txt', encoding='latin-1', preserve_entities=True)
cli_equivalent('unknown.txt', guess=True)The CLI handles various error conditions:
import sys
from ftfy.cli import main
# Test error conditions
test_cases = [
# Same input and output file
['ftfy', 'test.txt', '-o', 'test.txt'],
# Invalid encoding
['ftfy', 'test.txt', '-e', 'invalid-encoding'],
# Non-existent input file
['ftfy', 'nonexistent.txt']
]
for args in test_cases:
print(f"Testing: {' '.join(args)}")
sys.argv = args
try:
main()
print("Success")
except SystemExit as e:
print(f"Exit code: {e.code}")
except Exception as e:
print(f"Error: {e}")
print()#!/bin/bash
# Script to clean up text files from various sources
FTFY_OPTIONS="--encoding utf-8 --normalization NFC"
# Function to process file with error handling
process_file() {
local input="$1"
local output="$2"
if ftfy $FTFY_OPTIONS "$input" -o "$output" 2>/dev/null; then
echo "✓ Processed: $input → $output"
else
echo "✗ Failed to process: $input"
# Try with encoding detection as fallback
if ftfy --guess "$input" -o "$output" 2>/dev/null; then
echo "✓ Processed with encoding detection: $input → $output"
else
echo "✗ Complete failure: $input"
return 1
fi
fi
}
# Process all text files
find . -name "*.txt" | while read file; do
process_file "$file" "${file%.txt}_clean.txt"
done# Integration with common text processing pipelines
# Clean web scraping results
curl -s "https://example.com" | html2text | ftfy > clean_content.txt
# Process CSV files with text cleaning
csvcut -c description messy_data.csv | ftfy > clean_descriptions.txt
# Clean up log files
tail -f application.log | ftfy --preserve-entities > clean.log
# Database export cleaning
pg_dump --data-only mytable | ftfy -g > clean_export.sql
# Clean and normalize for analysis
cat survey_responses.txt | ftfy --normalization NFKC > normalized.txt# Process files with specific configurations for different use cases
# Web content: preserve HTML entities, normalize for display
ftfy --preserve-entities --normalization NFC web_content.txt
# Database text: aggressive cleaning, compatibility normalization
ftfy --normalization NFKC --encoding utf-8 database_dump.txt
# Log processing: preserve structure, clean terminal escapes
ftfy --preserve-entities log_file.txt | grep -v "^\s*$" > clean_log.txt
# Scientific text: preserve Unicode, minimal normalization
ftfy --normalization NFD scientific_paper.txt
# Legacy system integration: guess encoding, normalize for compatibility
ftfy --guess --normalization NFKC legacy_export.txtInstall with Tessl CLI
npx tessl i tessl/pypi-ftfy