tessl/pypi-tldextract

Accurately separates a URL's subdomain, domain, and public suffix using the Public Suffix List

Overview

Eval results

Files

Command Line Interface

Name: tessl/pypi-tldextract
Author: tessl

Command-line tool for URL parsing with options for output formatting, cache management, PSL updates, and batch processing. The CLI provides access to tldextract functionality from shell scripts and command-line workflows.

Capabilities

Basic Command Structure

The CLI accepts URLs as positional arguments and provides various options for customizing behavior and output.

tldextract [options] <url1> [url2] ...

Options:
  --version                    Show version information
  -j, --json                   Output in JSON format
  -u, --update                 Force fetch latest TLD definitions
  --suffix_list_url URL        Use alternate PSL URL/file (can specify multiple)
  -c DIR, --cache_dir DIR      Use alternate cache directory
  -p, --include_psl_private_domains, --private_domains
                              Include PSL private domains
  --no_fallback_to_snapshot   Don't fall back to bundled PSL snapshot

Basic Usage

Extract URL components with default space-separated output:

# Single URL
tldextract 'http://forums.bbc.co.uk'
# Output: forums bbc co.uk

# Multiple URLs
tldextract 'google.com' 'http://forums.news.cnn.com/' 'https://www.example.co.uk'
# Output:
#  google com
# forums.news cnn com
# www example co.uk

# Complex domains
tldextract 'http://www.worldbank.org.kg/'
# Output: www worldbank org.kg

JSON Output

Get structured JSON output for programmatic processing:

# Single URL with JSON output
tldextract --json 'http://forums.bbc.co.uk'
# Output: {"subdomain": "forums", "domain": "bbc", "suffix": "co.uk", "is_private": false, "registry_suffix": "co.uk", "fqdn": "forums.bbc.co.uk", "ipv4": "", "ipv6": "", "registered_domain": "bbc.co.uk", "reverse_domain_name": "co.uk.bbc.forums", "top_domain_under_public_suffix": "bbc.co.uk", "top_domain_under_registry_suffix": "bbc.co.uk"}

# Multiple URLs with JSON output
tldextract --json 'google.com' 'http://127.0.0.1:8080'
# Output:
# {"subdomain": "", "domain": "google", "suffix": "com", "is_private": false, "registry_suffix": "com", "fqdn": "google.com", "ipv4": "", "ipv6": "", "registered_domain": "google.com", "reverse_domain_name": "com.google", "top_domain_under_public_suffix": "google.com", "top_domain_under_registry_suffix": "google.com"}
# {"subdomain": "", "domain": "127.0.0.1", "suffix": "", "is_private": false, "registry_suffix": "", "fqdn": "", "ipv4": "127.0.0.1", "ipv6": "", "registered_domain": "", "reverse_domain_name": "127.0.0.1", "top_domain_under_public_suffix": "", "top_domain_under_registry_suffix": ""}

Private Domain Handling

Control how PSL private domains are processed:

# Default behavior - private domains as regular domains
tldextract 'waiterrant.blogspot.com'
# Output: waiterrant blogspot com

# Include private domains in suffix
tldextract --include_psl_private_domains 'waiterrant.blogspot.com'
# Output:  waiterrant blogspot.com

# Short form of the option
tldextract -p 'waiterrant.blogspot.com'
# Output:  waiterrant blogspot.com

PSL Data Management

Update and manage Public Suffix List data:

# Force update PSL data from remote sources
tldextract --update

# Update and then process URLs
tldextract --update 'http://example.new-tld'

# Check version after update
tldextract --version

Custom PSL Sources

Use alternative or local PSL data sources:

# Use custom remote PSL source
tldextract --suffix_list_url 'http://custom.psl.mirror.com/list.dat' 'example.com'

# Use local PSL file
tldextract --suffix_list_url 'file:///path/to/custom/suffix_list.dat' 'example.com'

# Use multiple PSL sources (tried in order)
tldextract --suffix_list_url 'http://primary.psl.com/list.dat' --suffix_list_url 'http://backup.psl.com/list.dat' 'example.com'

# Disable fallback to bundled snapshot
tldextract --suffix_list_url 'http://custom.psl.com/list.dat' --no_fallback_to_snapshot 'example.com'

Cache Management

Control PSL data caching behavior:

# Use custom cache directory
tldextract --cache_dir '/path/to/custom/cache' 'example.com'

# Use environment variable for cache location
export TLDEXTRACT_CACHE="/path/to/cache"
tldextract 'example.com'

# Use environment variable for cache timeout 
export TLDEXTRACT_CACHE_TIMEOUT="10.0"
tldextract 'example.com'

Integration Examples

Shell Scripting

Extract specific components for shell scripts:

#!/bin/bash
# Extract just the domain name
URL="http://forums.news.cnn.com/"
DOMAIN=$(tldextract "$URL" | awk '{print $2}')
echo "Domain: $DOMAIN"  # Output: Domain: cnn

# Extract all components
read SUBDOMAIN DOMAIN SUFFIX <<< $(tldextract "$URL")
echo "Subdomain: $SUBDOMAIN"
echo "Domain: $DOMAIN" 
echo "Suffix: $SUFFIX"

Batch Processing

Process multiple URLs from files or pipes:

# Process URLs from file
cat urls.txt | xargs tldextract

# Process with JSON output for further processing
cat urls.txt | xargs tldextract --json | jq '.domain' | sort | uniq

# Extract domains from access logs
grep "GET" access.log | awk '{print $7}' | xargs tldextract | awk '{print $2}' | sort | uniq -c

Combined with Other Tools

Use with standard Unix tools for data processing:

# Count domains by TLD
tldextract --json 'site1.com' 'site2.org' 'site3.com' | jq -r '.suffix' | sort | uniq -c

# Extract and validate domains
echo "http://example.com" | xargs tldextract --json | jq -r 'select(.suffix != "") | .top_domain_under_public_suffix'

# Check for private domains
tldextract --json --include_psl_private_domains 'waiterrant.blogspot.com' | jq '.is_private'

Error Handling

The CLI handles various error conditions gracefully:

Invalid URLs

# Invalid URLs are processed without errors
tldextract 'not-a-url' 'google.notavalidsuffix'
# Output:
# not-a-url  
# google notavalidsuffix

Network Errors

# Network errors during PSL update are logged but don't prevent operation
tldextract --update --suffix_list_url 'http://nonexistent.example.com/list.dat' 'example.com'
# Will fall back to cached data or bundled snapshot

Missing Arguments

# No URLs provided shows usage
tldextract
# Output: usage: tldextract [-h] [--version] [-j] [-u] [--suffix_list_url SUFFIX_LIST_URL] [-c CACHE_DIR] [-p] [--no_fallback_to_snapshot] [fqdn|url ...]

# Help is available
tldextract --help

Output Format Details

Standard Output Format

Space-separated: subdomain domain suffix

Empty fields are represented as empty strings
IPv4/IPv6 addresses appear in the domain field with empty suffix
Invalid suffixes result in empty suffix field

JSON Output Format

Complete ExtractResult data including all properties:

{
  "subdomain": "forums",
  "domain": "bbc", 
  "suffix": "co.uk",
  "is_private": false,
  "registry_suffix": "co.uk",
  "fqdn": "forums.bbc.co.uk",
  "ipv4": "",
  "ipv6": "",
  "registered_domain": "bbc.co.uk",
  "reverse_domain_name": "co.uk.bbc.forums",
  "top_domain_under_public_suffix": "bbc.co.uk",
  "top_domain_under_registry_suffix": "bbc.co.uk"
}

Environment Variables

The CLI respects the following environment variables:

TLDEXTRACT_CACHE: Cache directory path (overrides default)
TLDEXTRACT_CACHE_TIMEOUT: HTTP timeout for PSL fetching (seconds)

# Set cache location
export TLDEXTRACT_CACHE="/tmp/tldextract_cache"

# Set timeout
export TLDEXTRACT_CACHE_TIMEOUT="5.0"

# Use with settings
tldextract 'example.com'

Install with Tessl CLI