tessl/pypi-ftfy

Fixes mojibake and other problems with Unicode, after the fact

—

Pending

Overview

Eval results

Files

Command Line Interface

Name: tessl/pypi-ftfy
Author: tessl

Command-line tool for batch text processing with configurable options for encoding, normalization, and entity handling.

Capabilities

Command Line Entry Point

Main function providing command-line access to ftfy text processing.

def main() -> None:
    """
    Run ftfy as command-line utility.
    
    Processes files or standard input with configurable text fixing options.
    Handles encoding detection, normalization settings, and HTML entity processing.
    
    Command line usage:
        ftfy [filename] [options]
        
    Options:
        -o, --output: Output file (default: stdout)
        -g, --guess: Guess input encoding (risky)
        -e, --encoding: Specify input encoding (default: utf-8)
        -n, --normalization: Unicode normalization (default: NFC)
        --preserve-entities: Don't decode HTML entities
        
    Examples:
        ftfy input.txt -o output.txt
        ftfy -g mystery.txt
        cat file.txt | ftfy > cleaned.txt
    """

Command Line Usage

Basic File Processing

# Fix a single file, output to stdout
ftfy broken_text.txt

# Fix file and save to new file
ftfy input.txt -o fixed_output.txt

# Process standard input
cat messy_file.txt | ftfy > clean_file.txt
echo "âœ" mojibake" | ftfy

Encoding Options

# Specify input encoding explicitly  
ftfy --encoding latin-1 oldfile.txt

# Let ftfy guess the encoding (not recommended)
ftfy --guess mystery_encoding.txt

# Process file with unknown encoding
ftfy -g -o output.txt unknown_file.txt

Normalization and Entity Options

# Disable Unicode normalization
ftfy --normalization none input.txt

# Use NFD normalization instead of default NFC
ftfy --normalization NFD input.txt

# Preserve HTML entities (don't decode them)
ftfy --preserve-entities html_file.txt

# Combine options
ftfy -e latin-1 -n NFD --preserve-entities input.txt -o output.txt

Batch Processing Examples

# Process all .txt files in directory
for file in *.txt; do
    ftfy "$file" -o "fixed_$file"
done

# Process files preserving directory structure
find . -name "*.txt" -exec sh -c 'ftfy "$1" -o "${1%.txt}_fixed.txt"' _ {} \;

# Process with encoding detection for mixed files
find . -name "*.txt" -exec ftfy -g -o {}.fixed {} \;

Python API Access

You can also access CLI functionality programmatically:

from ftfy.cli import main
import sys

# Simulate command line arguments
sys.argv = ['ftfy', 'input.txt', '-o', 'output.txt', '--encoding', 'latin-1']
main()

Usage Examples from Python

Replicating CLI Behavior

from ftfy import fix_file, TextFixerConfig
import sys

def cli_equivalent(input_file, output_file=None, encoding='utf-8', 
                  normalization='NFC', preserve_entities=False, guess=False):
    """Replicate CLI behavior in Python."""
    
    if guess:
        encoding = None
        
    unescape_html = False if preserve_entities else "auto"
    normalization = None if normalization.lower() == 'none' else normalization
    
    config = TextFixerConfig(
        unescape_html=unescape_html,
        normalization=normalization
    )
    
    # Open input file
    if input_file == '-':
        infile = sys.stdin.buffer
    else:
        infile = open(input_file, 'rb')
    
    # Open output file  
    if output_file is None or output_file == '-':
        outfile = sys.stdout
    else:
        outfile = open(output_file, 'w', encoding='utf-8')
    
    try:
        for line in fix_file(infile, encoding=encoding, config=config):
            outfile.write(line)
    finally:
        if input_file != '-':
            infile.close()
        if output_file not in (None, '-'):
            outfile.close()

# Usage examples
cli_equivalent('messy.txt', 'clean.txt')
cli_equivalent('latin1.txt', encoding='latin-1', preserve_entities=True)
cli_equivalent('unknown.txt', guess=True)

Error Handling

The CLI handles various error conditions:

import sys
from ftfy.cli import main

# Test error conditions
test_cases = [
    # Same input and output file
    ['ftfy', 'test.txt', '-o', 'test.txt'],
    
    # Invalid encoding 
    ['ftfy', 'test.txt', '-e', 'invalid-encoding'],
    
    # Non-existent input file
    ['ftfy', 'nonexistent.txt']
]

for args in test_cases:
    print(f"Testing: {' '.join(args)}")
    sys.argv = args
    try:
        main()
        print("Success")
    except SystemExit as e:
        print(f"Exit code: {e.code}")
    except Exception as e:
        print(f"Error: {e}")
    print()

Integration with Shell Scripts

#!/bin/bash
# Script to clean up text files from various sources

FTFY_OPTIONS="--encoding utf-8 --normalization NFC"

# Function to process file with error handling
process_file() {
    local input="$1"
    local output="$2"
    
    if ftfy $FTFY_OPTIONS "$input" -o "$output" 2>/dev/null; then
        echo "✓ Processed: $input → $output"
    else
        echo "✗ Failed to process: $input"
        # Try with encoding detection as fallback
        if ftfy --guess "$input" -o "$output" 2>/dev/null; then
            echo "✓ Processed with encoding detection: $input → $output"
        else
            echo "✗ Complete failure: $input"
            return 1
        fi
    fi
}

# Process all text files
find . -name "*.txt" | while read file; do
    process_file "$file" "${file%.txt}_clean.txt"
done

Pipeline Integration

# Integration with common text processing pipelines

# Clean web scraping results
curl -s "https://example.com" | html2text | ftfy > clean_content.txt

# Process CSV files with text cleaning
csvcut -c description messy_data.csv | ftfy > clean_descriptions.txt

# Clean up log files 
tail -f application.log | ftfy --preserve-entities > clean.log

# Database export cleaning
pg_dump --data-only mytable | ftfy -g > clean_export.sql

# Clean and normalize for analysis
cat survey_responses.txt | ftfy --normalization NFKC > normalized.txt

Advanced CLI Usage

# Process files with specific configurations for different use cases

# Web content: preserve HTML entities, normalize for display
ftfy --preserve-entities --normalization NFC web_content.txt

# Database text: aggressive cleaning, compatibility normalization  
ftfy --normalization NFKC --encoding utf-8 database_dump.txt

# Log processing: preserve structure, clean terminal escapes
ftfy --preserve-entities log_file.txt | grep -v "^\s*$" > clean_log.txt

# Scientific text: preserve Unicode, minimal normalization
ftfy --normalization NFD scientific_paper.txt

# Legacy system integration: guess encoding, normalize for compatibility
ftfy --guess --normalization NFKC legacy_export.txt

Install with Tessl CLI