tessl/pypi-tldextract

Accurately separates a URL's subdomain, domain, and public suffix using the Public Suffix List

Overview

Eval results

Files

URL Extraction

Name: tessl/pypi-tldextract
Author: tessl

Core functionality for extracting URL components using the convenience extract() function. This provides the most common use case with sensible defaults and handles the majority of URL parsing scenarios.

Capabilities

Basic Extraction

The primary extraction function that separates any URL-like string into its subdomain, domain, and public suffix components.

def extract(
    url: str,
    include_psl_private_domains: bool | None = False,
    session: requests.Session | None = None
) -> ExtractResult:
    """
    Extract subdomain, domain, and suffix from a URL string.
    
    Parameters:
    - url: URL string to parse (can include protocol, port, path)
    - include_psl_private_domains: Include PSL private domains like 'blogspot.com'
    - session: Optional requests.Session for HTTP customization
    
    Returns:
    ExtractResult with parsed components and metadata
    """

Usage Examples:

import tldextract

# Standard domains
result = tldextract.extract('http://www.google.com')
print(result)
# ExtractResult(subdomain='www', domain='google', suffix='com', is_private=False)

# Complex country code TLDs
result = tldextract.extract('http://forums.bbc.co.uk/')
print(result)
# ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk', is_private=False)

# Subdomains with multiple levels
result = tldextract.extract('http://forums.news.cnn.com/')
print(result)
# ExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)

# International domains
result = tldextract.extract('http://www.worldbank.org.kg/')
print(result)
# ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg', is_private=False)

Private Domain Handling

Control how PSL private domains are handled during extraction. Private domains are organizational domains like 'blogspot.com' that allow subdomain registration.

# Default behavior - treat private domains as regular domains
result = tldextract.extract('waiterrant.blogspot.com')
print(result)
# ExtractResult(subdomain='waiterrant', domain='blogspot', suffix='com', is_private=False)

# Include private domains in suffix
result = tldextract.extract('waiterrant.blogspot.com', include_psl_private_domains=True)
print(result)
# ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)

Edge Case Handling

The library gracefully handles various edge cases including IP addresses, invalid suffixes, and malformed URLs.

# IP addresses
result = tldextract.extract('http://127.0.0.1:8080/deployed/')
print(result)
# ExtractResult(subdomain='', domain='127.0.0.1', suffix='', is_private=False)

# IPv6 addresses
result = tldextract.extract('http://[2001:db8::1]/path')
print(result.domain)  # '[2001:db8::1]'

# No subdomain
result = tldextract.extract('google.com')
print(result)
# ExtractResult(subdomain='', domain='google', suffix='com', is_private=False)

# Invalid suffixes
result = tldextract.extract('google.notavalidsuffix')
print(result)
# ExtractResult(subdomain='google', domain='notavalidsuffix', suffix='', is_private=False)

Session Customization

Provide custom HTTP session for PSL fetching to support proxies, authentication, or other HTTP customizations.

import requests
import tldextract

# Create custom session with proxy
session = requests.Session()
session.proxies = {'http': 'http://proxy.example.com:8080'}

# Use custom session for PSL fetching
result = tldextract.extract('http://example.com', session=session)

Update Functionality

Force update of the cached Public Suffix List data to get the latest TLD definitions.

def update(fetch_now: bool = False, session: requests.Session | None = None) -> None:
    """
    Force update of cached PSL data.
    
    Parameters:
    - fetch_now: Whether to fetch immediately rather than on next extraction
    - session: Optional requests.Session for HTTP customization
    """

Usage Example:

import tldextract

# Force update of PSL data
tldextract.update(fetch_now=True)

# Use after update
result = tldextract.extract('http://example.new-tld')

Return Value

All extraction functions return an ExtractResult object with the following structure:

@dataclass
class ExtractResult:
    subdomain: str  # All subdomains, empty string if none
    domain: str     # Main domain name
    suffix: str     # Public suffix (TLD), empty string if none/invalid
    is_private: bool  # Whether suffix is from PSL private domains
    registry_suffix: str  # Registry suffix (internal)

The ExtractResult provides additional properties and methods for working with the parsed components - see Result Processing for complete details.

Error Handling

The extraction functions are designed to never raise exceptions for malformed input. Invalid or unparseable URLs will return sensible fallback values:

Invalid URLs return the entire input as the domain with empty subdomain and suffix
IP addresses are detected and returned as the domain with empty suffix
Network errors during PSL fetching fall back to the bundled snapshot
Malformed PSL data is handled gracefully with logging warnings