tessl/pypi-webencodings

Character encoding aliases for legacy web content implementing the WHATWG Encoding standard

Overview

Eval results

Files

Core Objects

Name: tessl/pypi-webencodings
Author: tessl

Fundamental classes and lookup functionality that form the foundation of the webencodings package. These provide the core abstractions for working with character encodings according to the WHATWG Encoding standard.

Capabilities

Encoding Lookup

Look up character encodings by their labels using WHATWG-standard label matching rules. Handles encoding aliases and normalization according to the specification.

def lookup(label: str) -> Encoding | None:
    """
    Look for an encoding by its label following WHATWG Encoding standard.
    
    Args:
        label: An encoding label string (case-insensitive, whitespace-stripped)
        
    Returns:
        An Encoding object for the canonical encoding, or None if unknown
        
    Examples:
        - lookup('utf-8') -> UTF-8 Encoding
        - lookup('latin1') -> windows-1252 Encoding  
        - lookup('unknown') -> None
    """

The lookup function implements ASCII case-insensitive matching and strips ASCII whitespace (tabs, newlines, form feeds, carriage returns, and spaces) before matching against the standard label mappings.

Encoding Class

Represents a character encoding with both a canonical name and the underlying Python codec implementation.

class Encoding:
    """
    Represents a character encoding such as UTF-8.
    
    Attributes:
        name: Canonical name of the encoding according to WHATWG standard
        codec_info: Python CodecInfo object providing the actual implementation
    """
    
    def __init__(self, name: str, codec_info: codecs.CodecInfo) -> None: ...

The Encoding class serves as a wrapper around Python's codec system, providing standardized names while leveraging Python's existing encoding implementations. This ensures compatibility with both web standards and Python's encoding ecosystem.

Usage Examples

import webencodings

# Look up encodings by various labels
utf8 = webencodings.lookup('utf-8')
print(utf8.name)  # 'utf-8'

# Handle aliases - latin1 maps to windows-1252 per WHATWG spec
latin1 = webencodings.lookup('latin1')  
print(latin1.name)  # 'windows-1252'

# Case insensitive and whitespace handling
encoding = webencodings.lookup('  UTF-8  ')
print(encoding.name)  # 'utf-8'

# Unknown labels return None
unknown = webencodings.lookup('made-up-encoding')
print(unknown)  # None

# Access underlying Python codec
utf8 = webencodings.lookup('utf-8')
decoded_text = utf8.codec_info.decode(b'Hello')[0]
print(decoded_text)  # 'Hello'

Install with Tessl CLI