tessl/pypi-webencodings

Character encoding aliases for legacy web content implementing the WHATWG Encoding standard

Overview

Eval results

Files

String Processing

Name: tessl/pypi-webencodings
Author: tessl

Simple encoding and decoding functions for processing individual strings. These functions provide the most common use case for encoding/decoding with proper BOM detection and WHATWG-compliant behavior.

Capabilities

Single String Decoding

Decode a byte string to Unicode with BOM detection that takes precedence over the fallback encoding declaration.

def decode(input: bytes, fallback_encoding: Encoding | str, errors: str = 'replace') -> tuple[str, Encoding]:
    """
    Decode a single byte string with BOM detection.
    
    Args:
        input: Byte string to decode
        fallback_encoding: Encoding object or label string to use if no BOM detected
        errors: Error handling strategy ('replace', 'strict', 'ignore', etc.)
        
    Returns:
        Tuple of (decoded_unicode_string, encoding_used)
        
    Raises:
        LookupError: If fallback_encoding label is unknown
    """

The function first checks for UTF-8, UTF-16LE, or UTF-16BE BOMs. If found, the BOM is removed and the detected encoding is used. Otherwise, the fallback encoding is used for decoding.

Single String Encoding

Encode a Unicode string to bytes using the specified encoding.

def encode(input: str, encoding: Encoding | str = UTF8, errors: str = 'strict') -> bytes:
    """
    Encode a Unicode string to bytes.
    
    Args:
        input: Unicode string to encode
        encoding: Encoding object or label string (defaults to UTF-8)
        errors: Error handling strategy ('strict', 'replace', 'ignore', etc.)
        
    Returns:
        Encoded byte string
        
    Raises:
        LookupError: If encoding label is unknown
    """

Usage Examples

import webencodings

# Decode with BOM detection
utf8_bom_data = b'\xef\xbb\xbfHello World'
text, encoding = webencodings.decode(utf8_bom_data, 'iso-8859-1')
print(text)  # 'Hello World'
print(encoding.name)  # 'utf-8' (BOM detected, fallback ignored)

# Decode without BOM uses fallback
latin_data = b'caf\xe9'  # 'café' in latin-1
text, encoding = webencodings.decode(latin_data, 'iso-8859-1')
print(text)  # 'café'
print(encoding.name)  # 'windows-1252' (iso-8859-1 maps to windows-1252)

# Handle UTF-16 BOM
utf16_data = b'\xff\xfeH\x00e\x00l\x00l\x00o\x00'  # UTF-16LE BOM + 'Hello'
text, encoding = webencodings.decode(utf16_data, 'utf-8')
print(text)  # 'Hello'
print(encoding.name)  # 'utf-16le'

# Encoding strings
text = "Hello World"
data = webencodings.encode(text, 'utf-8')
print(data)  # b'Hello World'

# Use predefined UTF8 constant
data = webencodings.encode(text, webencodings.UTF8)
print(data)  # b'Hello World'

# Handle encoding errors
text = "café"
data = webencodings.encode(text, 'ascii', errors='replace')
print(data)  # b'caf?'

# Encode with different encodings
text = "café"
utf8_data = webencodings.encode(text, 'utf-8')
latin1_data = webencodings.encode(text, 'latin-1')
print(utf8_data)  # b'caf\xc3\xa9'
print(latin1_data)  # b'caf\xe9'

Install with Tessl CLI