CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-charset-normalizer

The Real First Universal Charset Detector providing modern, fast, and reliable character encoding detection as an alternative to chardet.

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

cli-interface.mddocs/

CLI Interface

Command-line interface and programmatic CLI functions for charset detection and file normalization. Provides both shell command capabilities and importable Python functions for CLI operations.

Capabilities

Command-Line Detection

Primary CLI detection function that processes files and outputs structured results in JSON format.

def cli_detect(
    paths: list[str],
    alternatives: bool = False,
    normalize: bool = False,
    minimal: bool = False,
    replace: bool = False,
    force: bool = False,
    threshold: float = 0.2,
    verbose: bool = False
) -> None:
    """
    CLI detection function for processing multiple files.
    
    Parameters:
    - paths: List of file paths to analyze
    - alternatives: Output complementary possibilities if any (JSON list format)
    - normalize: Permit normalization of input files
    - minimal: Only output charset to STDOUT, disabling JSON output
    - replace: Replace files when normalizing instead of creating new ones
    - force: Replace files without asking for confirmation
    - threshold: Custom maximum chaos allowed in decoded content (0.0-1.0)
    - verbose: Display complementary information and detection logs
    
    Returns:
    None (outputs to stdout)
    
    Note: This function handles multiple files and outputs JSON results to stdout
    """

Usage Example:

from charset_normalizer.cli import cli_detect

# Analyze single file
cli_detect(['document.txt'])

# Analyze with alternatives and verbose output
cli_detect(['data.csv'], alternatives=True, verbose=True)

# Normalize files with replacement
cli_detect(['file1.txt', 'file2.csv'], normalize=True, replace=True, force=True)

# Use custom detection threshold
cli_detect(['mixed_encoding.txt'], threshold=0.15, verbose=True)

Interactive Confirmation

Helper function for interactive yes/no prompts in CLI operations.

def query_yes_no(question: str, default: str = "yes") -> bool:
    """
    Ask a yes/no question via input() and return the answer.
    
    Parameters:
    - question: Question string presented to the user
    - default: Presumed answer if user just hits Enter ("yes", "no", or None)
    
    Returns:
    bool: True for "yes", False for "no"
    
    Raises:
    ValueError: If default is not "yes", "no", or None
    
    Note: Used internally by CLI for confirmation prompts
    """

Usage Example:

from charset_normalizer.cli import query_yes_no

# Basic yes/no prompt
if query_yes_no("Do you want to continue?"):
    print("Proceeding...")
else:
    print("Cancelled")

# Default to "no"
if query_yes_no("Delete all files?", default="no"):
    print("Files deleted")

# Require explicit answer
answer = query_yes_no("Are you sure?", default=None)

Shell Command Usage

The charset-normalizer package provides the normalizer command-line tool:

# Basic detection
normalizer document.txt

# Multiple files with alternatives
normalizer file1.txt file2.csv --with-alternative

# Normalize files in place
normalizer data.txt --normalize --replace --force

# Verbose detection with custom threshold
normalizer mixed_encoding.txt --verbose --threshold 0.15

# Minimal output (encoding name only)
normalizer simple.txt --minimal

JSON Output Format

The CLI outputs structured JSON results for programmatic consumption:

{
    "path": "/path/to/document.txt",
    "encoding": "utf_8",
    "encoding_aliases": ["utf-8", "u8", "utf8"],
    "alternative_encodings": ["ascii"],
    "language": "English",
    "alphabets": ["Basic Latin"],
    "has_sig_or_bom": false,
    "chaos": 0.02,
    "coherence": 0.85,
    "unicode_path": null,
    "is_preferred": true
}

When --with-alternative is used, output becomes an array of results:

[
    {
        "path": "/path/to/document.txt",
        "encoding": "utf_8",
        "language": "English",
        "chaos": 0.02,
        "coherence": 0.85,
        "is_preferred": true
    },
    {
        "path": "/path/to/document.txt", 
        "encoding": "iso-8859-1",
        "language": "English",
        "chaos": 0.05,
        "coherence": 0.82,
        "is_preferred": false
    }
]

Integration Patterns

Script Integration

import sys
import json
from charset_normalizer.cli import cli_detect
from io import StringIO

# Capture CLI output programmatically
old_stdout = sys.stdout
sys.stdout = buffer = StringIO()

try:
    cli_detect(['document.txt'])
    output = buffer.getvalue()
    result = json.loads(output)
    print(f"Detected encoding: {result['encoding']}")
finally:
    sys.stdout = old_stdout

Batch Processing

from charset_normalizer.cli import cli_detect
import os

# Process all text files in directory
text_files = [f for f in os.listdir('.') if f.endswith('.txt')]
cli_detect(text_files, alternatives=True, verbose=True)

Safe File Normalization

from charset_normalizer.cli import cli_detect, query_yes_no
import os

def safe_normalize_files(file_paths):
    """Safely normalize files with user confirmation."""
    # First, detect encodings
    cli_detect(file_paths, verbose=True)
    
    # Ask for confirmation
    if query_yes_no(f"Normalize {len(file_paths)} files?"):
        cli_detect(file_paths, normalize=True, replace=True)
        print("Files normalized successfully")
    else:
        print("Normalization cancelled")

# Usage
safe_normalize_files(['doc1.txt', 'doc2.csv'])

Error Handling

The CLI functions handle various error conditions:

  • File not found: Skips missing files with warning
  • Permission errors: Reports access issues and continues
  • Binary files: Automatically skips non-text content
  • Encoding failures: Reports problematic files and continues
  • User interruption: Handles Ctrl+C gracefully

For programmatic usage, wrap CLI calls in try-catch blocks:

try:
    cli_detect(['problematic_file.bin'])
except KeyboardInterrupt:
    print("Detection interrupted by user")
except Exception as e:
    print(f"CLI error: {e}")

Install with Tessl CLI

npx tessl i tessl/pypi-charset-normalizer

docs

cli-interface.md

core-detection.md

detection-results.md

index.md

legacy-compatibility.md

tile.json