0
# Configurable Extraction
1
2
Advanced extraction functionality through the `TLDExtract` class, providing fine-grained control over caching, suffix list sources, private domain handling, and network behavior. Use this when you need custom configuration beyond the default `extract()` function.
3
4
## Capabilities
5
6
### TLDExtract Class
7
8
Main configurable extractor class that allows custom PSL sources, cache management, and extraction behavior.
9
10
```python { .api }
11
class TLDExtract:
12
def __init__(
13
self,
14
cache_dir: str | None = None,
15
suffix_list_urls: Sequence[str] = PUBLIC_SUFFIX_LIST_URLS,
16
fallback_to_snapshot: bool = True,
17
include_psl_private_domains: bool = False,
18
extra_suffixes: Sequence[str] = (),
19
cache_fetch_timeout: str | float | None = CACHE_TIMEOUT
20
) -> None:
21
"""
22
Create a configurable TLD extractor.
23
24
Parameters:
25
- cache_dir: Directory for caching PSL data (None disables caching)
26
- suffix_list_urls: URLs to fetch PSL data from, tried in order
27
- fallback_to_snapshot: Fall back to bundled PSL snapshot if fetch fails
28
- include_psl_private_domains: Include PSL private domains by default
29
- extra_suffixes: Additional custom suffixes to recognize
30
- cache_fetch_timeout: HTTP timeout for PSL fetching (seconds)
31
"""
32
```
33
34
### Basic Extraction Methods
35
36
Core extraction methods that parse URL strings into components.
37
38
```python { .api }
39
def __call__(
40
self,
41
url: str,
42
include_psl_private_domains: bool | None = None,
43
session: requests.Session | None = None
44
) -> ExtractResult:
45
"""
46
Extract components from URL string (alias for extract_str).
47
48
Parameters:
49
- url: URL string to parse
50
- include_psl_private_domains: Override instance default for private domains
51
- session: Optional requests.Session for HTTP customization
52
53
Returns:
54
ExtractResult with parsed components
55
"""
56
57
def extract_str(
58
self,
59
url: str,
60
include_psl_private_domains: bool | None = None,
61
session: requests.Session | None = None
62
) -> ExtractResult:
63
"""
64
Extract components from URL string.
65
66
Parameters:
67
- url: URL string to parse
68
- include_psl_private_domains: Override instance default for private domains
69
- session: Optional requests.Session for HTTP customization
70
71
Returns:
72
ExtractResult with parsed components
73
"""
74
```
75
76
### Optimized urllib Extraction
77
78
Extract from pre-parsed urllib objects for better performance when you already have parsed URL components.
79
80
```python { .api }
81
def extract_urllib(
82
self,
83
url: urllib.parse.ParseResult | urllib.parse.SplitResult,
84
include_psl_private_domains: bool | None = None,
85
session: requests.Session | None = None
86
) -> ExtractResult:
87
"""
88
Extract from urllib.parse result for better performance.
89
90
Parameters:
91
- url: Result from urllib.parse.urlparse() or urlsplit()
92
- include_psl_private_domains: Override instance default for private domains
93
- session: Optional requests.Session for HTTP customization
94
95
Returns:
96
ExtractResult with parsed components
97
"""
98
```
99
100
### Cache and Data Management
101
102
Methods for managing PSL data and caching behavior.
103
104
```python { .api }
105
def update(
106
self,
107
fetch_now: bool = False,
108
session: requests.Session | None = None
109
) -> None:
110
"""
111
Force refresh of PSL data.
112
113
Parameters:
114
- fetch_now: Fetch immediately rather than on next extraction
115
- session: Optional requests.Session for HTTP customization
116
"""
117
118
def tlds(self, session: requests.Session | None = None) -> list[str]:
119
"""
120
Get the list of TLDs currently used by this extractor.
121
122
Parameters:
123
- session: Optional requests.Session for HTTP customization
124
125
Returns:
126
List of TLD strings, varies based on include_psl_private_domains and extra_suffixes
127
"""
128
```
129
130
## Configuration Examples
131
132
### Disable Caching
133
134
Create an extractor that doesn't use disk caching for environments where disk access is restricted.
135
136
```python
137
import tldextract
138
139
# Disable caching entirely
140
no_cache_extractor = tldextract.TLDExtract(cache_dir=None)
141
result = no_cache_extractor('http://example.com')
142
```
143
144
### Custom Cache Directory
145
146
Specify a custom location for PSL data caching.
147
148
```python
149
import tldextract
150
151
# Use custom cache directory
152
custom_cache_extractor = tldextract.TLDExtract(cache_dir='/path/to/custom/cache/')
153
result = custom_cache_extractor('http://example.com')
154
```
155
156
### Offline Operation
157
158
Create an extractor that works entirely offline using the bundled PSL snapshot.
159
160
```python
161
import tldextract
162
163
# Offline-only extractor
164
offline_extractor = tldextract.TLDExtract(
165
suffix_list_urls=(), # No remote URLs
166
fallback_to_snapshot=True
167
)
168
result = offline_extractor('http://example.com')
169
```
170
171
### Custom PSL Sources
172
173
Use alternative or local PSL data sources.
174
175
```python
176
import tldextract
177
178
# Use custom PSL sources
179
custom_psl_extractor = tldextract.TLDExtract(
180
suffix_list_urls=[
181
'file:///path/to/local/suffix_list.dat',
182
'http://custom.psl.mirror.com/list.dat'
183
],
184
fallback_to_snapshot=False
185
)
186
result = custom_psl_extractor('http://example.com')
187
```
188
189
### Private Domains by Default
190
191
Configure an extractor to always include PSL private domains.
192
193
```python
194
import tldextract
195
196
# Always include private domains
197
private_extractor = tldextract.TLDExtract(include_psl_private_domains=True)
198
199
# This will treat blogspot.com as a public suffix
200
result = private_extractor('waiterrant.blogspot.com')
201
print(result)
202
# ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)
203
```
204
205
### Extra Custom Suffixes
206
207
Add custom suffixes that aren't in the PSL.
208
209
```python
210
import tldextract
211
212
# Add custom internal suffixes
213
internal_extractor = tldextract.TLDExtract(
214
extra_suffixes=['internal', 'corp.example.com']
215
)
216
217
result = internal_extractor('subdomain.example.internal')
218
print(result)
219
# ExtractResult(subdomain='subdomain', domain='example', suffix='internal', is_private=False)
220
```
221
222
### HTTP Timeout Configuration
223
224
Configure timeout for PSL fetching operations.
225
226
```python
227
import tldextract
228
229
# Set custom timeout
230
timeout_extractor = tldextract.TLDExtract(cache_fetch_timeout=10.0)
231
result = timeout_extractor('http://example.com')
232
233
# Can also be set via environment variable
234
import os
235
os.environ['TLDEXTRACT_CACHE_TIMEOUT'] = '5.0'
236
env_extractor = tldextract.TLDExtract()
237
```
238
239
### urllib Integration
240
241
Optimize performance when working with pre-parsed URLs.
242
243
```python
244
import urllib.parse
245
import tldextract
246
247
extractor = tldextract.TLDExtract()
248
249
# Parse once, extract efficiently
250
parsed_url = urllib.parse.urlparse('http://forums.news.cnn.com/path?query=value')
251
result = extractor.extract_urllib(parsed_url)
252
print(result)
253
# ExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)
254
```
255
256
### Session Customization
257
258
Use custom HTTP session for PSL fetching with proxies, authentication, or other customizations.
259
260
```python
261
import requests
262
import tldextract
263
264
# Create session with custom configuration
265
session = requests.Session()
266
session.proxies = {'http': 'http://proxy.example.com:8080'}
267
session.headers.update({'User-Agent': 'MyApp/1.0'})
268
269
extractor = tldextract.TLDExtract()
270
271
# Use custom session for PSL fetching
272
result = extractor('http://example.com', session=session)
273
274
# Force update with custom session
275
extractor.update(fetch_now=True, session=session)
276
```
277
278
## Error Handling
279
280
The `TLDExtract` class handles various error conditions gracefully:
281
282
- **Network errors**: Falls back to cached data or bundled snapshot
283
- **Invalid PSL data**: Logs warnings and continues with available data
284
- **Permission errors**: Logs cache access issues and operates without caching
285
- **Invalid configuration**: Raises `ValueError` for impossible configurations (e.g., no data sources)
286
287
```python
288
import tldextract
289
290
# This raises ValueError - no way to get PSL data
291
try:
292
bad_extractor = tldextract.TLDExtract(
293
suffix_list_urls=(),
294
cache_dir=None,
295
fallback_to_snapshot=False
296
)
297
except ValueError as e:
298
print("Configuration error:", e)
299
```
300
301
## Performance Considerations
302
303
- **Caching**: Enabled by default, provides significant performance improvement
304
- **Instance reuse**: Create once, use many times for best performance
305
- **urllib integration**: Use `extract_urllib()` when you already have parsed URLs
306
- **Session reuse**: Pass the same session object for multiple extractions with custom HTTP configuration