0
# tldextract
1
2
Accurately separates a URL's subdomain, domain, and public suffix using the Public Suffix List (PSL). This library provides robust URL parsing that handles complex domain structures including country code TLDs (ccTLDs), generic TLDs (gTLDs), and their exceptions that naive string splitting cannot parse correctly.
3
4
## Package Information
5
6
- **Package Name**: tldextract
7
- **Language**: Python
8
- **Installation**: `pip install tldextract`
9
10
## Core Imports
11
12
```python
13
import tldextract
14
```
15
16
For basic usage, all functionality is available through the main module:
17
18
```python
19
from tldextract import extract, TLDExtract, ExtractResult, __version__
20
```
21
22
## Basic Usage
23
24
```python
25
import tldextract
26
27
# Basic URL extraction
28
result = tldextract.extract('http://forums.news.cnn.com/')
29
print(result)
30
# ExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)
31
32
# Access individual components
33
print(f"Subdomain: {result.subdomain}") # 'forums.news'
34
print(f"Domain: {result.domain}") # 'cnn'
35
print(f"Suffix: {result.suffix}") # 'com'
36
37
# Reconstruct full domain name
38
print(result.fqdn) # 'forums.news.cnn.com'
39
40
# Handle complex TLDs
41
uk_result = tldextract.extract('http://forums.bbc.co.uk/')
42
print(uk_result)
43
# ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk', is_private=False)
44
45
# Handle edge cases
46
ip_result = tldextract.extract('http://127.0.0.1:8080/path')
47
print(ip_result)
48
# ExtractResult(subdomain='', domain='127.0.0.1', suffix='', is_private=False)
49
```
50
51
## Architecture
52
53
The tldextract library uses the authoritative Public Suffix List (PSL) to make parsing decisions:
54
55
- **Public Suffix List (PSL)**: Maintained list of all known public suffixes under which domain registration is possible
56
- **Caching System**: Local caching of PSL data to avoid repeated HTTP requests
57
- **Fallback Mechanism**: Built-in snapshot for offline operation
58
- **Private Domains**: Optional support for PSL private domains (like blogspot.com)
59
60
The library automatically fetches and caches the latest PSL data on first use, with intelligent fallback to a bundled snapshot if network access is unavailable.
61
62
## Capabilities
63
64
### URL Extraction
65
66
Core functionality for extracting URL components using the convenience `extract()` function. This provides the most common use case with sensible defaults.
67
68
```python { .api }
69
def extract(
70
url: str,
71
include_psl_private_domains: bool | None = False,
72
session: requests.Session | None = None
73
) -> ExtractResult
74
```
75
76
[URL Extraction](./url-extraction.md)
77
78
### Configurable Extraction
79
80
Advanced extraction with custom configuration options including cache settings, custom suffix lists, and private domain handling through the `TLDExtract` class.
81
82
```python { .api }
83
class TLDExtract:
84
def __init__(
85
self,
86
cache_dir: str | None = None,
87
suffix_list_urls: Sequence[str] = PUBLIC_SUFFIX_LIST_URLS,
88
fallback_to_snapshot: bool = True,
89
include_psl_private_domains: bool = False,
90
extra_suffixes: Sequence[str] = (),
91
cache_fetch_timeout: str | float | None = CACHE_TIMEOUT
92
) -> None
93
94
def __call__(
95
self,
96
url: str,
97
include_psl_private_domains: bool | None = None,
98
session: requests.Session | None = None
99
) -> ExtractResult
100
101
def extract_str(
102
self,
103
url: str,
104
include_psl_private_domains: bool | None = None,
105
session: requests.Session | None = None
106
) -> ExtractResult
107
108
def extract_urllib(
109
self,
110
url: urllib.parse.ParseResult | urllib.parse.SplitResult,
111
include_psl_private_domains: bool | None = None,
112
session: requests.Session | None = None
113
) -> ExtractResult
114
115
def update(
116
self,
117
fetch_now: bool = False,
118
session: requests.Session | None = None
119
) -> None
120
121
def tlds(self, session: requests.Session | None = None) -> list[str]
122
```
123
124
[Configurable Extraction](./configurable-extraction.md)
125
126
### Result Processing
127
128
Comprehensive result handling with properties for reconstructing domains, handling IP addresses, and accessing metadata about the extraction process.
129
130
```python { .api }
131
@dataclass
132
class ExtractResult:
133
subdomain: str
134
domain: str
135
suffix: str
136
is_private: bool
137
registry_suffix: str
138
139
@property
140
def fqdn(self) -> str
141
142
@property
143
def ipv4(self) -> str
144
145
@property
146
def ipv6(self) -> str
147
148
@property
149
def registered_domain(self) -> str
150
151
@property
152
def reverse_domain_name(self) -> str
153
154
@property
155
def top_domain_under_public_suffix(self) -> str
156
157
@property
158
def top_domain_under_registry_suffix(self) -> str
159
```
160
161
[Result Processing](./result-processing.md)
162
163
### Command Line Interface
164
165
Command-line tool for URL parsing with options for output formatting, cache management, and PSL updates.
166
167
```bash { .api }
168
tldextract [options] <url1> [url2] ...
169
```
170
171
[Command Line Interface](./cli.md)
172
173
### PSL Data Management
174
175
Functions for updating and managing Public Suffix List data globally.
176
177
```python { .api }
178
def update(fetch_now: bool = False, session: requests.Session | None = None) -> None
179
```
180
181
[URL Extraction](./url-extraction.md)
182
183
## Types
184
185
```python { .api }
186
from typing import Sequence
187
from dataclasses import dataclass, field
188
import requests
189
import urllib.parse
190
191
# Module attributes
192
__version__: str
193
194
# Constants
195
PUBLIC_SUFFIX_LIST_URLS: tuple[str, ...]
196
CACHE_TIMEOUT: str | None
197
198
# Functions - detailed in respective sections
199
200
# Classes - detailed in respective sections
201
ExtractResult = dataclass # Detailed in Result Processing section
202
TLDExtract = class # Detailed in Configurable Extraction section
203
```