Tessl Tile for pypi/tldextract@5.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli.md configurable-extraction.md index.md result-processing.md url-extraction.md

configurable-extraction.mddocs/

0
# Configurable Extraction
1

2
Advanced extraction functionality through the `TLDExtract` class, providing fine-grained control over caching, suffix list sources, private domain handling, and network behavior. Use this when you need custom configuration beyond the default `extract()` function.
3

4
## Capabilities
5

6
### TLDExtract Class
7

8
Main configurable extractor class that allows custom PSL sources, cache management, and extraction behavior.
9

10
```python { .api }
11
class TLDExtract:
12
    def __init__(
13
        self,
14
        cache_dir: str | None = None,
15
        suffix_list_urls: Sequence[str] = PUBLIC_SUFFIX_LIST_URLS,
16
        fallback_to_snapshot: bool = True,
17
        include_psl_private_domains: bool = False,
18
        extra_suffixes: Sequence[str] = (),
19
        cache_fetch_timeout: str | float | None = CACHE_TIMEOUT
20
    ) -> None:
21
        """
22
        Create a configurable TLD extractor.
23
        
24
        Parameters:
25
        - cache_dir: Directory for caching PSL data (None disables caching)
26
        - suffix_list_urls: URLs to fetch PSL data from, tried in order
27
        - fallback_to_snapshot: Fall back to bundled PSL snapshot if fetch fails
28
        - include_psl_private_domains: Include PSL private domains by default
29
        - extra_suffixes: Additional custom suffixes to recognize
30
        - cache_fetch_timeout: HTTP timeout for PSL fetching (seconds)
31
        """
32
```
33

34
### Basic Extraction Methods
35

36
Core extraction methods that parse URL strings into components.
37

38
```python { .api }
39
def __call__(
40
    self,
41
    url: str,
42
    include_psl_private_domains: bool | None = None,
43
    session: requests.Session | None = None
44
) -> ExtractResult:
45
    """
46
    Extract components from URL string (alias for extract_str).
47
    
48
    Parameters:
49
    - url: URL string to parse
50
    - include_psl_private_domains: Override instance default for private domains
51
    - session: Optional requests.Session for HTTP customization
52
    
53
    Returns:
54
    ExtractResult with parsed components
55
    """
56

57
def extract_str(
58
    self,
59
    url: str,
60
    include_psl_private_domains: bool | None = None,
61
    session: requests.Session | None = None
62
) -> ExtractResult:
63
    """
64
    Extract components from URL string.
65
    
66
    Parameters:
67
    - url: URL string to parse
68
    - include_psl_private_domains: Override instance default for private domains
69
    - session: Optional requests.Session for HTTP customization
70
    
71
    Returns:
72
    ExtractResult with parsed components
73
    """
74
```
75

76
### Optimized urllib Extraction
77

78
Extract from pre-parsed urllib objects for better performance when you already have parsed URL components.
79

80
```python { .api }
81
def extract_urllib(
82
    self,
83
    url: urllib.parse.ParseResult | urllib.parse.SplitResult,
84
    include_psl_private_domains: bool | None = None,
85
    session: requests.Session | None = None
86
) -> ExtractResult:
87
    """
88
    Extract from urllib.parse result for better performance.
89
    
90
    Parameters:
91
    - url: Result from urllib.parse.urlparse() or urlsplit()
92
    - include_psl_private_domains: Override instance default for private domains
93
    - session: Optional requests.Session for HTTP customization
94
    
95
    Returns:
96
    ExtractResult with parsed components
97
    """
98
```
99

100
### Cache and Data Management
101

102
Methods for managing PSL data and caching behavior.
103

104
```python { .api }
105
def update(
106
    self,
107
    fetch_now: bool = False,
108
    session: requests.Session | None = None
109
) -> None:
110
    """
111
    Force refresh of PSL data.
112
    
113
    Parameters:
114
    - fetch_now: Fetch immediately rather than on next extraction
115
    - session: Optional requests.Session for HTTP customization
116
    """
117

118
def tlds(self, session: requests.Session | None = None) -> list[str]:
119
    """
120
    Get the list of TLDs currently used by this extractor.
121
    
122
    Parameters:
123
    - session: Optional requests.Session for HTTP customization
124
    
125
    Returns:
126
    List of TLD strings, varies based on include_psl_private_domains and extra_suffixes
127
    """
128
```
129

130
## Configuration Examples
131

132
### Disable Caching
133

134
Create an extractor that doesn't use disk caching for environments where disk access is restricted.
135

136
```python
137
import tldextract
138

139
# Disable caching entirely
140
no_cache_extractor = tldextract.TLDExtract(cache_dir=None)
141
result = no_cache_extractor('http://example.com')
142
```
143

144
### Custom Cache Directory
145

146
Specify a custom location for PSL data caching.
147

148
```python
149
import tldextract
150

151
# Use custom cache directory
152
custom_cache_extractor = tldextract.TLDExtract(cache_dir='/path/to/custom/cache/')
153
result = custom_cache_extractor('http://example.com')
154
```
155

156
### Offline Operation
157

158
Create an extractor that works entirely offline using the bundled PSL snapshot.
159

160
```python
161
import tldextract
162

163
# Offline-only extractor
164
offline_extractor = tldextract.TLDExtract(
165
    suffix_list_urls=(),  # No remote URLs
166
    fallback_to_snapshot=True
167
)
168
result = offline_extractor('http://example.com')
169
```
170

171
### Custom PSL Sources
172

173
Use alternative or local PSL data sources.
174

175
```python
176
import tldextract
177

178
# Use custom PSL sources
179
custom_psl_extractor = tldextract.TLDExtract(
180
    suffix_list_urls=[
181
        'file:///path/to/local/suffix_list.dat',
182
        'http://custom.psl.mirror.com/list.dat'
183
    ],
184
    fallback_to_snapshot=False
185
)
186
result = custom_psl_extractor('http://example.com')
187
```
188

189
### Private Domains by Default
190

191
Configure an extractor to always include PSL private domains.
192

193
```python
194
import tldextract
195

196
# Always include private domains
197
private_extractor = tldextract.TLDExtract(include_psl_private_domains=True)
198

199
# This will treat blogspot.com as a public suffix
200
result = private_extractor('waiterrant.blogspot.com')
201
print(result)
202
# ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)
203
```
204

205
### Extra Custom Suffixes
206

207
Add custom suffixes that aren't in the PSL.
208

209
```python
210
import tldextract
211

212
# Add custom internal suffixes
213
internal_extractor = tldextract.TLDExtract(
214
    extra_suffixes=['internal', 'corp.example.com']
215
)
216

217
result = internal_extractor('subdomain.example.internal')
218
print(result)
219
# ExtractResult(subdomain='subdomain', domain='example', suffix='internal', is_private=False)
220
```
221

222
### HTTP Timeout Configuration
223

224
Configure timeout for PSL fetching operations.
225

226
```python
227
import tldextract
228

229
# Set custom timeout
230
timeout_extractor = tldextract.TLDExtract(cache_fetch_timeout=10.0)
231
result = timeout_extractor('http://example.com')
232

233
# Can also be set via environment variable
234
import os
235
os.environ['TLDEXTRACT_CACHE_TIMEOUT'] = '5.0'
236
env_extractor = tldextract.TLDExtract()
237
```
238

239
### urllib Integration
240

241
Optimize performance when working with pre-parsed URLs.
242

243
```python
244
import urllib.parse
245
import tldextract
246

247
extractor = tldextract.TLDExtract()
248

249
# Parse once, extract efficiently
250
parsed_url = urllib.parse.urlparse('http://forums.news.cnn.com/path?query=value')
251
result = extractor.extract_urllib(parsed_url)
252
print(result)
253
# ExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)
254
```
255

256
### Session Customization
257

258
Use custom HTTP session for PSL fetching with proxies, authentication, or other customizations.
259

260
```python
261
import requests
262
import tldextract
263

264
# Create session with custom configuration
265
session = requests.Session()
266
session.proxies = {'http': 'http://proxy.example.com:8080'}
267
session.headers.update({'User-Agent': 'MyApp/1.0'})
268

269
extractor = tldextract.TLDExtract()
270

271
# Use custom session for PSL fetching
272
result = extractor('http://example.com', session=session)
273

274
# Force update with custom session
275
extractor.update(fetch_now=True, session=session)
276
```
277

278
## Error Handling
279

280
The `TLDExtract` class handles various error conditions gracefully:
281

282
- **Network errors**: Falls back to cached data or bundled snapshot
283
- **Invalid PSL data**: Logs warnings and continues with available data
284
- **Permission errors**: Logs cache access issues and operates without caching
285
- **Invalid configuration**: Raises `ValueError` for impossible configurations (e.g., no data sources)
286

287
```python
288
import tldextract
289

290
# This raises ValueError - no way to get PSL data
291
try:
292
    bad_extractor = tldextract.TLDExtract(
293
        suffix_list_urls=(),
294
        cache_dir=None,
295
        fallback_to_snapshot=False
296
    )
297
except ValueError as e:
298
    print("Configuration error:", e)
299
```
300

301
## Performance Considerations
302

303
- **Caching**: Enabled by default, provides significant performance improvement
304
- **Instance reuse**: Create once, use many times for best performance
305
- **urllib integration**: Use `extract_urllib()` when you already have parsed URLs
306
- **Session reuse**: Pass the same session object for multiple extractions with custom HTTP configuration

Version

Tile

Files

configurable-extraction.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

configurable-extraction.mddocs/