0
# URL Extraction
1
2
Core functionality for extracting URL components using the convenience `extract()` function. This provides the most common use case with sensible defaults and handles the majority of URL parsing scenarios.
3
4
## Capabilities
5
6
### Basic Extraction
7
8
The primary extraction function that separates any URL-like string into its subdomain, domain, and public suffix components.
9
10
```python { .api }
11
def extract(
12
url: str,
13
include_psl_private_domains: bool | None = False,
14
session: requests.Session | None = None
15
) -> ExtractResult:
16
"""
17
Extract subdomain, domain, and suffix from a URL string.
18
19
Parameters:
20
- url: URL string to parse (can include protocol, port, path)
21
- include_psl_private_domains: Include PSL private domains like 'blogspot.com'
22
- session: Optional requests.Session for HTTP customization
23
24
Returns:
25
ExtractResult with parsed components and metadata
26
"""
27
```
28
29
**Usage Examples:**
30
31
```python
32
import tldextract
33
34
# Standard domains
35
result = tldextract.extract('http://www.google.com')
36
print(result)
37
# ExtractResult(subdomain='www', domain='google', suffix='com', is_private=False)
38
39
# Complex country code TLDs
40
result = tldextract.extract('http://forums.bbc.co.uk/')
41
print(result)
42
# ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk', is_private=False)
43
44
# Subdomains with multiple levels
45
result = tldextract.extract('http://forums.news.cnn.com/')
46
print(result)
47
# ExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)
48
49
# International domains
50
result = tldextract.extract('http://www.worldbank.org.kg/')
51
print(result)
52
# ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg', is_private=False)
53
```
54
55
### Private Domain Handling
56
57
Control how PSL private domains are handled during extraction. Private domains are organizational domains like 'blogspot.com' that allow subdomain registration.
58
59
```python
60
# Default behavior - treat private domains as regular domains
61
result = tldextract.extract('waiterrant.blogspot.com')
62
print(result)
63
# ExtractResult(subdomain='waiterrant', domain='blogspot', suffix='com', is_private=False)
64
65
# Include private domains in suffix
66
result = tldextract.extract('waiterrant.blogspot.com', include_psl_private_domains=True)
67
print(result)
68
# ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)
69
```
70
71
### Edge Case Handling
72
73
The library gracefully handles various edge cases including IP addresses, invalid suffixes, and malformed URLs.
74
75
```python
76
# IP addresses
77
result = tldextract.extract('http://127.0.0.1:8080/deployed/')
78
print(result)
79
# ExtractResult(subdomain='', domain='127.0.0.1', suffix='', is_private=False)
80
81
# IPv6 addresses
82
result = tldextract.extract('http://[2001:db8::1]/path')
83
print(result.domain) # '[2001:db8::1]'
84
85
# No subdomain
86
result = tldextract.extract('google.com')
87
print(result)
88
# ExtractResult(subdomain='', domain='google', suffix='com', is_private=False)
89
90
# Invalid suffixes
91
result = tldextract.extract('google.notavalidsuffix')
92
print(result)
93
# ExtractResult(subdomain='google', domain='notavalidsuffix', suffix='', is_private=False)
94
```
95
96
### Session Customization
97
98
Provide custom HTTP session for PSL fetching to support proxies, authentication, or other HTTP customizations.
99
100
```python
101
import requests
102
import tldextract
103
104
# Create custom session with proxy
105
session = requests.Session()
106
session.proxies = {'http': 'http://proxy.example.com:8080'}
107
108
# Use custom session for PSL fetching
109
result = tldextract.extract('http://example.com', session=session)
110
```
111
112
### Update Functionality
113
114
Force update of the cached Public Suffix List data to get the latest TLD definitions.
115
116
```python { .api }
117
def update(fetch_now: bool = False, session: requests.Session | None = None) -> None:
118
"""
119
Force update of cached PSL data.
120
121
Parameters:
122
- fetch_now: Whether to fetch immediately rather than on next extraction
123
- session: Optional requests.Session for HTTP customization
124
"""
125
```
126
127
**Usage Example:**
128
129
```python
130
import tldextract
131
132
# Force update of PSL data
133
tldextract.update(fetch_now=True)
134
135
# Use after update
136
result = tldextract.extract('http://example.new-tld')
137
```
138
139
## Return Value
140
141
All extraction functions return an `ExtractResult` object with the following structure:
142
143
```python { .api }
144
@dataclass
145
class ExtractResult:
146
subdomain: str # All subdomains, empty string if none
147
domain: str # Main domain name
148
suffix: str # Public suffix (TLD), empty string if none/invalid
149
is_private: bool # Whether suffix is from PSL private domains
150
registry_suffix: str # Registry suffix (internal)
151
```
152
153
The `ExtractResult` provides additional properties and methods for working with the parsed components - see [Result Processing](./result-processing.md) for complete details.
154
155
## Error Handling
156
157
The extraction functions are designed to never raise exceptions for malformed input. Invalid or unparseable URLs will return sensible fallback values:
158
159
- Invalid URLs return the entire input as the `domain` with empty `subdomain` and `suffix`
160
- IP addresses are detected and returned as the `domain` with empty `suffix`
161
- Network errors during PSL fetching fall back to the bundled snapshot
162
- Malformed PSL data is handled gracefully with logging warnings