tessl/pypi-charset-normalizer

The Real First Universal Charset Detector providing modern, fast, and reliable character encoding detection as an alternative to chardet.

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview

Eval results

Files

Legacy Compatibility

Name: tessl/pypi-charset-normalizer
Author: tessl

Chardet-compatible detection function that provides backward compatibility for applications migrating from chardet to charset-normalizer. This function maintains the same API interface and return format as chardet while leveraging charset-normalizer's improved detection algorithms.

Capabilities

Chardet-Compatible Detection

Drop-in replacement for chardet.detect() with improved accuracy and performance while maintaining the same return format.

def detect(
    byte_str: bytes,
    should_rename_legacy: bool = False,
    **kwargs: Any
) -> ResultDict:
    """
    Chardet-compatible charset detection function.
    
    Provides backward compatibility with chardet API while using
    charset-normalizer's advanced detection algorithms. Maintained
    for migration purposes but not recommended for new projects.
    
    Parameters:
    - byte_str: Raw bytes to analyze for encoding detection
    - should_rename_legacy: Whether to rename legacy encodings to modern equivalents
    - **kwargs: Additional arguments (ignored with warning for compatibility)
    
    Returns:
    dict with keys:
    - 'encoding': str | None - Detected encoding name (chardet-compatible)
    - 'language': str - Detected language or empty string
    - 'confidence': float | None - Confidence score (0.0-1.0)
    
    Raises:
    TypeError: If byte_str is not bytes or bytearray
    
    Note: This function is deprecated for new code. Use from_bytes() instead.
    """

Usage Example:

import charset_normalizer

# Basic chardet-compatible usage
raw_data = b'\xe4\xb8\xad\xe6\x96\x87'  # Chinese text
result = charset_normalizer.detect(raw_data)

print(f"Encoding: {result['encoding']}")      # utf_8 or utf-8
print(f"Language: {result['language']}")      # Chinese or empty string  
print(f"Confidence: {result['confidence']}")  # 0.99 (0.0-1.0 scale)

# Handle None results
if result['encoding']:
    try:
        decoded_text = raw_data.decode(result['encoding'])
        print(f"Text: {decoded_text}")
    except UnicodeDecodeError:
        print("Decoding failed despite detection")
else:
    print("No encoding detected")

Migration from Chardet

Direct replacement patterns for common chardet usage:

# Old chardet code:
# import chardet
# result = chardet.detect(raw_bytes)

# Direct replacement:
import charset_normalizer
result = charset_normalizer.detect(raw_bytes)

# Access same result structure
encoding = result['encoding']
confidence = result['confidence']
language = result['language']

Legacy Encoding Names

Control whether legacy encoding names are modernized:

import charset_normalizer

# With legacy names (default - matches chardet output)
result = charset_normalizer.detect(raw_data, should_rename_legacy=False)
print(result['encoding'])  # May be 'ISO-8859-1' (chardet style)

# With modern names  
result = charset_normalizer.detect(raw_data, should_rename_legacy=True)
print(result['encoding'])  # Will be 'iso-8859-1' (IANA standard)

Compatibility Notes

Return Format Differences

While the basic structure matches chardet, there are subtle differences:

# Chardet typical result:
{
    'encoding': 'utf-8',
    'confidence': 0.99,
    'language': ''
}

# Charset-normalizer detect() result:
{
    'encoding': 'utf_8',      # IANA standard names by default
    'confidence': 0.98,       # May differ due to improved algorithms
    'language': 'English'     # More comprehensive language detection
}

BOM Handling

Charset-normalizer handles BOM (Byte Order Mark) differently:

# UTF-8 with BOM
utf8_bom_data = b'\xef\xbb\xbfHello World'

# Chardet returns: 'UTF-8-SIG'
# Charset-normalizer detect() returns: 'utf_8_sig' (when BOM detected)
result = charset_normalizer.detect(utf8_bom_data)
print(result['encoding'])  # 'utf_8_sig'

Confidence Scoring

Confidence calculation differs between libraries:

# For comparison with modern API
modern_result = charset_normalizer.from_bytes(raw_data).best()
legacy_result = charset_normalizer.detect(raw_data)

# Modern confidence (inverse of chaos ratio)
modern_confidence = 1.0 - modern_result.chaos

# Legacy confidence (direct from detect)
legacy_confidence = legacy_result['confidence']

# Values may differ due to different calculation methods
print(f"Modern: {modern_confidence:.3f}")
print(f"Legacy: {legacy_confidence:.3f}")

Migration Recommendations

Gradual Migration Strategy

Phase 1: Direct replacement

# Replace import only
# from chardet import detect
from charset_normalizer import detect

# Keep existing code unchanged
result = detect(raw_bytes)

Phase 2: Enhanced error handling

import charset_normalizer

def safe_detect(raw_bytes):
    """Enhanced wrapper with better error handling."""
    try:
        result = charset_normalizer.detect(raw_bytes)
        if result['encoding'] and result['confidence'] > 0.7:
            return result
        else:
            # Fallback to modern API for better results
            modern_result = charset_normalizer.from_bytes(raw_bytes).best()
            if modern_result:
                return {
                    'encoding': modern_result.encoding,
                    'confidence': 1.0 - modern_result.chaos,
                    'language': modern_result.language
                }
    except Exception:
        pass
    
    return {'encoding': None, 'confidence': None, 'language': ''}

Phase 3: Modern API adoption

import charset_normalizer

# Migrate to modern API for new code
results = charset_normalizer.from_bytes(raw_bytes)
best = results.best()

if best:
    # More detailed information available
    encoding = best.encoding
    confidence = 1.0 - best.chaos
    language = best.language
    alphabets = best.alphabets
    text = str(best)

Performance Considerations

The legacy detect() function has different performance characteristics:

import time
import charset_normalizer

# Legacy function (single result)
start = time.time()
result = charset_normalizer.detect(large_data)
legacy_time = time.time() - start

# Modern API (multiple candidates)
start = time.time()
results = charset_normalizer.from_bytes(large_data)
best = results.best()
modern_time = time.time() - start

# Legacy is typically faster for simple detection
# Modern API provides more comprehensive analysis

Debugging Legacy Issues

When migrating from chardet, enable detailed logging:

import charset_normalizer
import logging

# Enable debug logging to compare with chardet behavior
result = charset_normalizer.detect(raw_data, explain=True)  # Note: explain ignored but documented

For actual debugging, use the modern API:

# Better debugging with modern API
results = charset_normalizer.from_bytes(raw_data, explain=True)
# This will show detailed detection process

Install with Tessl CLI

npx tessl i tessl/pypi-charset-normalizer

docs

legacy-compatibility.md

tile.json

tessl/pypi-charset-normalizer