The Real First Universal Charset Detector providing modern, fast, and reliable character encoding detection as an alternative to chardet.
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Chardet-compatible detection function that provides backward compatibility for applications migrating from chardet to charset-normalizer. This function maintains the same API interface and return format as chardet while leveraging charset-normalizer's improved detection algorithms.
Drop-in replacement for chardet.detect() with improved accuracy and performance while maintaining the same return format.
def detect(
byte_str: bytes,
should_rename_legacy: bool = False,
**kwargs: Any
) -> ResultDict:
"""
Chardet-compatible charset detection function.
Provides backward compatibility with chardet API while using
charset-normalizer's advanced detection algorithms. Maintained
for migration purposes but not recommended for new projects.
Parameters:
- byte_str: Raw bytes to analyze for encoding detection
- should_rename_legacy: Whether to rename legacy encodings to modern equivalents
- **kwargs: Additional arguments (ignored with warning for compatibility)
Returns:
dict with keys:
- 'encoding': str | None - Detected encoding name (chardet-compatible)
- 'language': str - Detected language or empty string
- 'confidence': float | None - Confidence score (0.0-1.0)
Raises:
TypeError: If byte_str is not bytes or bytearray
Note: This function is deprecated for new code. Use from_bytes() instead.
"""Usage Example:
import charset_normalizer
# Basic chardet-compatible usage
raw_data = b'\xe4\xb8\xad\xe6\x96\x87' # Chinese text
result = charset_normalizer.detect(raw_data)
print(f"Encoding: {result['encoding']}") # utf_8 or utf-8
print(f"Language: {result['language']}") # Chinese or empty string
print(f"Confidence: {result['confidence']}") # 0.99 (0.0-1.0 scale)
# Handle None results
if result['encoding']:
try:
decoded_text = raw_data.decode(result['encoding'])
print(f"Text: {decoded_text}")
except UnicodeDecodeError:
print("Decoding failed despite detection")
else:
print("No encoding detected")Direct replacement patterns for common chardet usage:
# Old chardet code:
# import chardet
# result = chardet.detect(raw_bytes)
# Direct replacement:
import charset_normalizer
result = charset_normalizer.detect(raw_bytes)
# Access same result structure
encoding = result['encoding']
confidence = result['confidence']
language = result['language']Control whether legacy encoding names are modernized:
import charset_normalizer
# With legacy names (default - matches chardet output)
result = charset_normalizer.detect(raw_data, should_rename_legacy=False)
print(result['encoding']) # May be 'ISO-8859-1' (chardet style)
# With modern names
result = charset_normalizer.detect(raw_data, should_rename_legacy=True)
print(result['encoding']) # Will be 'iso-8859-1' (IANA standard)While the basic structure matches chardet, there are subtle differences:
# Chardet typical result:
{
'encoding': 'utf-8',
'confidence': 0.99,
'language': ''
}
# Charset-normalizer detect() result:
{
'encoding': 'utf_8', # IANA standard names by default
'confidence': 0.98, # May differ due to improved algorithms
'language': 'English' # More comprehensive language detection
}Charset-normalizer handles BOM (Byte Order Mark) differently:
# UTF-8 with BOM
utf8_bom_data = b'\xef\xbb\xbfHello World'
# Chardet returns: 'UTF-8-SIG'
# Charset-normalizer detect() returns: 'utf_8_sig' (when BOM detected)
result = charset_normalizer.detect(utf8_bom_data)
print(result['encoding']) # 'utf_8_sig'Confidence calculation differs between libraries:
# For comparison with modern API
modern_result = charset_normalizer.from_bytes(raw_data).best()
legacy_result = charset_normalizer.detect(raw_data)
# Modern confidence (inverse of chaos ratio)
modern_confidence = 1.0 - modern_result.chaos
# Legacy confidence (direct from detect)
legacy_confidence = legacy_result['confidence']
# Values may differ due to different calculation methods
print(f"Modern: {modern_confidence:.3f}")
print(f"Legacy: {legacy_confidence:.3f}")# Replace import only
# from chardet import detect
from charset_normalizer import detect
# Keep existing code unchanged
result = detect(raw_bytes)import charset_normalizer
def safe_detect(raw_bytes):
"""Enhanced wrapper with better error handling."""
try:
result = charset_normalizer.detect(raw_bytes)
if result['encoding'] and result['confidence'] > 0.7:
return result
else:
# Fallback to modern API for better results
modern_result = charset_normalizer.from_bytes(raw_bytes).best()
if modern_result:
return {
'encoding': modern_result.encoding,
'confidence': 1.0 - modern_result.chaos,
'language': modern_result.language
}
except Exception:
pass
return {'encoding': None, 'confidence': None, 'language': ''}import charset_normalizer
# Migrate to modern API for new code
results = charset_normalizer.from_bytes(raw_bytes)
best = results.best()
if best:
# More detailed information available
encoding = best.encoding
confidence = 1.0 - best.chaos
language = best.language
alphabets = best.alphabets
text = str(best)The legacy detect() function has different performance characteristics:
import time
import charset_normalizer
# Legacy function (single result)
start = time.time()
result = charset_normalizer.detect(large_data)
legacy_time = time.time() - start
# Modern API (multiple candidates)
start = time.time()
results = charset_normalizer.from_bytes(large_data)
best = results.best()
modern_time = time.time() - start
# Legacy is typically faster for simple detection
# Modern API provides more comprehensive analysisWhen migrating from chardet, enable detailed logging:
import charset_normalizer
import logging
# Enable debug logging to compare with chardet behavior
result = charset_normalizer.detect(raw_data, explain=True) # Note: explain ignored but documentedFor actual debugging, use the modern API:
# Better debugging with modern API
results = charset_normalizer.from_bytes(raw_data, explain=True)
# This will show detailed detection processInstall with Tessl CLI
npx tessl i tessl/pypi-charset-normalizer