or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

encoding-detection.md html-processing.md http-utilities.md index.md url-handling.md utilities.md

tile.json

tessl/pypi-w3lib

Library of web-related functions for HTML manipulation, HTTP processing, URL handling, and encoding detection

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/w3lib@2.3.x

To install, run

npx @tessl/cli install tessl/pypi-w3lib@2.3.0

w3lib

A comprehensive Python library providing essential web-related utility functions for HTML manipulation, HTTP header processing, URL handling, and character encoding detection. Originally developed as a foundational component of the Scrapy web scraping framework, w3lib offers production-tested utilities for web crawlers, data extraction tools, and content processing pipelines.

Package Information

Package Name: w3lib
Language: Python
Installation: pip install w3lib
Version: 2.3.1
License: BSD
Documentation: https://w3lib.readthedocs.io/en/latest/
Repository: https://github.com/scrapy/w3lib

Core Imports

import w3lib

Module-specific imports:

from w3lib.html import replace_entities, remove_tags, get_base_url
from w3lib.http import basic_auth_header, headers_raw_to_dict  
from w3lib.url import safe_url_string, url_query_parameter, canonicalize_url
from w3lib.encoding import html_to_unicode, resolve_encoding
from w3lib.util import to_unicode, to_bytes

Basic Usage

from w3lib.html import replace_entities, remove_tags, get_base_url
from w3lib.url import safe_url_string, url_query_parameter
from w3lib.http import basic_auth_header
from w3lib.encoding import html_to_unicode

# HTML processing - clean up HTML content
html = '<p>Price: &pound;100 <b>only!</b></p>'
clean_text = replace_entities(html)  # 'Price: £100 <b>only!</b>'
text_only = remove_tags(clean_text)  # 'Price: £100 only!'

# URL handling - make URLs safe and extract parameters
unsafe_url = 'http://example.com/search?q=hello world&price=£100'
safe_url = safe_url_string(unsafe_url)  # Properly encoded URL
query_param = url_query_parameter(safe_url, 'q')  # 'hello world'

# HTTP utilities - create authentication headers
auth_header = basic_auth_header('user', 'password')  # b'Basic dXNlcjpwYXNzd29yZA=='

# Encoding detection - convert HTML to Unicode  
raw_html = b'<html><meta charset="utf-8"><body>Caf\xc3\xa9</body></html>'
encoding, unicode_html = html_to_unicode(None, raw_html)  # ('utf-8', '<html>...')

Architecture

w3lib is organized into focused modules, each handling specific web processing tasks:

HTML Module: Entity translation, tag manipulation, base URL extraction, meta refresh parsing
HTTP Module: Header format conversion, authentication header generation
URL Module: URL sanitization, parameter manipulation, encoding normalization, data URI parsing
Encoding Module: Character encoding detection from HTTP headers, HTML meta tags, and BOMs
Utilities Module: Core string/bytes conversion functions used throughout the library

This modular design allows developers to import only the functionality they need while maintaining consistent interfaces and error handling across all components.

Capabilities

HTML Processing

Comprehensive HTML manipulation including entity conversion, tag removal, comment stripping, base URL extraction, and meta refresh parsing. Handles both string and bytes input with robust encoding support.

def replace_entities(text, keep=(), remove_illegal=True, encoding='utf-8'): ...
def remove_tags(text, which_ones=(), keep=(), encoding=None): ...  
def remove_comments(text, encoding=None): ...
def get_base_url(text, baseurl='', encoding='utf-8'): ...
def get_meta_refresh(text, baseurl='', encoding='utf-8', ignore_tags=('script', 'noscript')): ...

HTML Processing

HTTP Utilities

HTTP header processing utilities for converting between raw header formats and dictionaries, plus HTTP Basic Authentication header generation.

def headers_raw_to_dict(headers_raw): ...
def headers_dict_to_raw(headers_dict): ...
def basic_auth_header(username, password, encoding='ISO-8859-1'): ...

HTTP Utilities

URL Handling

Comprehensive URL processing including browser-compatible URL sanitization, query parameter manipulation, data URI parsing, and canonicalization with support for various URL standards.

def safe_url_string(url, encoding='utf8', path_encoding='utf8', quote_path=True): ...
def url_query_parameter(url, parameter, default=None, keep_blank_values=0): ...
def url_query_cleaner(url, parameterlist=(), sep='&', kvsep='=', remove=False, unique=True, keep_fragments=False): ...
def canonicalize_url(url, keep_blank_values=True, keep_fragments=False, encoding=None): ...
def parse_data_uri(uri): ...

URL Handling

Encoding Detection

Character encoding detection from HTTP Content-Type headers, HTML meta tags, XML declarations, and byte order marks, with smart fallback handling and encoding alias resolution.

def html_to_unicode(content_type_header, html_body_str, default_encoding='utf8', auto_detect_fun=None): ...
def http_content_type_encoding(content_type): ...
def html_body_declared_encoding(html_body_str): ...
def resolve_encoding(encoding_alias): ...

Encoding Detection

Utilities

Core utility functions for converting between string and bytes representations with robust encoding support and error handling.

def to_unicode(text, encoding=None, errors='strict'): ...
def to_bytes(text, encoding=None, errors='strict'): ...

Utilities

Common Types

# Type aliases used across the library
StrOrBytes = Union[str, bytes]

# HTTP header types  
HeadersDictInput = Mapping[bytes, Union[Any, Sequence[bytes]]]
HeadersDictOutput = MutableMapping[bytes, list[bytes]]

# Data URI parsing result
class ParseDataURIResult(NamedTuple):
    media_type: str
    media_type_parameters: dict[str, str] 
    data: bytes

Error Handling

w3lib functions follow consistent error handling patterns:

Invalid input types raise TypeError
Encoding errors are handled gracefully with replacement characters (\ufffd)
URL parsing errors may raise ValueError for malformed input
Most functions return safe defaults (empty strings, None) rather than raising exceptions
Functions accept both string and bytes input to minimize conversion overhead

Performance Considerations

Compiled regular expressions are cached and reused across function calls
Functions are optimized for web scraping workloads with large volumes of content
Memory-efficient processing of HTML content avoids unnecessary string duplication
Support for both string and bytes inputs reduces encoding/decoding overhead
Character encoding detection uses fast heuristics before falling back to comprehensive analysis

Version

Tile

Files

tessl/pypi-w3lib

To install, run

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

w3lib

Package Information

Core Imports

Basic Usage

Architecture

Capabilities

HTML Processing

HTTP Utilities

URL Handling

Encoding Detection

Utilities

Common Types

Error Handling

Performance Considerations

index.mddocs/