tessl/pypi-charset-normalizer

The Real First Universal Charset Detector providing modern, fast, and reliable character encoding detection as an alternative to chardet.

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview

Eval results

Files

Core Detection Functions

Name: tessl/pypi-charset-normalizer
Author: tessl

Primary charset detection methods that analyze raw bytes, file pointers, or file paths to determine character encoding. These functions form the core of charset-normalizer's detection capabilities and support extensive customization through parameters.

Capabilities

Bytes Detection

Detects character encoding from raw bytes or bytearray sequences using advanced heuristic analysis.

def from_bytes(
    sequences: bytes | bytearray,
    steps: int = 5,
    chunk_size: int = 512,
    threshold: float = 0.2,
    cp_isolation: list[str] | None = None,
    cp_exclusion: list[str] | None = None,
    preemptive_behaviour: bool = True,
    explain: bool = False,
    language_threshold: float = 0.1,
    enable_fallback: bool = True,
) -> CharsetMatches:
    """
    Detect charset from raw bytes sequence.
    
    Parameters:
    - sequences: Raw bytes or bytearray to analyze
    - steps: Number of analysis steps (default: 5)
    - chunk_size: Size of data chunks for analysis (default: 512)
    - threshold: Mess ratio threshold for encoding rejection (default: 0.2)
    - cp_isolation: List of encodings to test exclusively
    - cp_exclusion: List of encodings to exclude from testing
    - preemptive_behaviour: Enable BOM/signature priority detection (default: True)
    - explain: Enable detailed logging for debugging (default: False)
    - language_threshold: Minimum coherence for language detection (default: 0.1)
    - enable_fallback: Enable fallback to common encodings (default: True)
    
    Returns:
    CharsetMatches: Ordered collection of detection results
    
    Raises:
    TypeError: If sequences is not bytes or bytearray
    """

Usage Example:

import charset_normalizer

# Basic detection
raw_data = b'\xe4\xb8\xad\xe6\x96\x87'  # Chinese text in UTF-8
results = charset_normalizer.from_bytes(raw_data)
best_match = results.best()
print(f"Encoding: {best_match.encoding}")  # utf_8
print(f"Language: {best_match.language}")  # Chinese

# Advanced detection with custom parameters
results = charset_normalizer.from_bytes(
    raw_data,
    steps=10,  # More thorough analysis
    threshold=0.1,  # Stricter mess threshold
    cp_isolation=['utf_8', 'gb2312', 'big5'],  # Test only Chinese encodings
    explain=True  # Enable debug logging
)

File Pointer Detection

Detects character encoding from an open file pointer without closing it.

def from_fp(
    fp: BinaryIO,
    steps: int = 5,
    chunk_size: int = 512,
    threshold: float = 0.20,
    cp_isolation: list[str] | None = None,
    cp_exclusion: list[str] | None = None,
    preemptive_behaviour: bool = True,
    explain: bool = False,
    language_threshold: float = 0.1,
    enable_fallback: bool = True,
) -> CharsetMatches:
    """
    Detect charset from file pointer.
    
    Parameters:
    - fp: Open binary file pointer
    - Other parameters: Same as from_bytes
    
    Returns:
    CharsetMatches: Ordered collection of detection results
    
    Note: Does not close the file pointer
    """

Usage Example:

import charset_normalizer

with open('document.txt', 'rb') as fp:
    results = charset_normalizer.from_fp(fp)
    best_match = results.best()
    if best_match:
        print(f"File encoding: {best_match.encoding}")
        # File pointer remains open for further operations

File Path Detection

Detects character encoding by opening and reading a file from its path.

def from_path(
    path: str | bytes | PathLike,
    steps: int = 5,
    chunk_size: int = 512,
    threshold: float = 0.20,
    cp_isolation: list[str] | None = None,
    cp_exclusion: list[str] | None = None,
    preemptive_behaviour: bool = True,
    explain: bool = False,
    language_threshold: float = 0.1,
    enable_fallback: bool = True,
) -> CharsetMatches:
    """
    Detect charset from file path.
    
    Parameters:
    - path: Path to file (string, bytes, or PathLike object)
    - Other parameters: Same as from_bytes
    
    Returns:
    CharsetMatches: Ordered collection of detection results
    
    Raises:
    IOError: If file cannot be opened or read
    """

Usage Example:

import charset_normalizer
from pathlib import Path

# Using string path
results = charset_normalizer.from_path('data/sample.txt')

# Using Path object
file_path = Path('documents/report.csv')
results = charset_normalizer.from_path(file_path)

# With custom settings for CSV files
results = charset_normalizer.from_path(
    'data.csv',
    cp_isolation=['utf_8', 'iso-8859-1', 'windows-1252'],  # Common for CSV
    threshold=0.15  # Slightly stricter for structured data
)

Binary Detection

Determines whether input data represents binary (non-text) content.

def is_binary(
    fp_or_path_or_payload: PathLike | str | BinaryIO | bytes,
    steps: int = 5,
    chunk_size: int = 512,
    threshold: float = 0.20,
    cp_isolation: list[str] | None = None,
    cp_exclusion: list[str] | None = None,
    preemptive_behaviour: bool = True,
    explain: bool = False,
    language_threshold: float = 0.1,
    enable_fallback: bool = False,
) -> bool:
    """
    Detect if input is binary (non-text) content.
    
    Parameters:
    - fp_or_path_or_payload: File path, file pointer, or raw bytes
    - Other parameters: Same as from_bytes (enable_fallback defaults to False)
    
    Returns:
    bool: True if content appears to be binary, False if text
    
    Note: Uses stricter criteria than text detection to avoid false positives
    """

Usage Example:

import charset_normalizer

# Check if file is binary
if charset_normalizer.is_binary('image.jpg'):
    print("Binary file detected")
else:
    print("Text file detected")

# Check raw bytes
data = b'\x89PNG\r\n\x1a\n'  # PNG file header
if charset_normalizer.is_binary(data):
    print("Binary data")

# Check with file pointer
with open('document.pdf', 'rb') as fp:
    if charset_normalizer.is_binary(fp):
        print("Binary document")

Parameter Guidelines

Performance Tuning

steps: Higher values (7-10) for more accuracy, lower (3-5) for speed
chunk_size: Larger chunks (1024-2048) for large files, smaller (256-512) for small files
threshold: Lower values (0.1-0.15) for stricter detection, higher (0.25-0.3) for permissive

Encoding Control

cp_isolation: Use when you know the likely encoding family (e.g., ['utf_8', 'utf_16'] for Unicode)
cp_exclusion: Exclude problematic encodings that cause false positives
preemptive_behaviour: Disable (False) for pure heuristic analysis without BOM priority

Language Detection

language_threshold: Lower values (0.05) for better language detection, higher (0.2) to reduce false positives
enable_fallback: Keep True for safety, set False for stricter binary detection

Install with Tessl CLI

npx tessl i tessl/pypi-charset-normalizer

docs

legacy-compatibility.md

tile.json

tessl/pypi-charset-normalizer

core-detection.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

Core Detection Functions

Capabilities

Bytes Detection

File Pointer Detection

File Path Detection

Binary Detection

Parameter Guidelines

Performance Tuning

Encoding Control

Language Detection

core-detection.mddocs/