or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli.mdconfigurable-extraction.mdindex.mdresult-processing.mdurl-extraction.md

configurable-extraction.mddocs/

0

# Configurable Extraction

1

2

Advanced extraction functionality through the `TLDExtract` class, providing fine-grained control over caching, suffix list sources, private domain handling, and network behavior. Use this when you need custom configuration beyond the default `extract()` function.

3

4

## Capabilities

5

6

### TLDExtract Class

7

8

Main configurable extractor class that allows custom PSL sources, cache management, and extraction behavior.

9

10

```python { .api }

11

class TLDExtract:

12

def __init__(

13

self,

14

cache_dir: str | None = None,

15

suffix_list_urls: Sequence[str] = PUBLIC_SUFFIX_LIST_URLS,

16

fallback_to_snapshot: bool = True,

17

include_psl_private_domains: bool = False,

18

extra_suffixes: Sequence[str] = (),

19

cache_fetch_timeout: str | float | None = CACHE_TIMEOUT

20

) -> None:

21

"""

22

Create a configurable TLD extractor.

23

24

Parameters:

25

- cache_dir: Directory for caching PSL data (None disables caching)

26

- suffix_list_urls: URLs to fetch PSL data from, tried in order

27

- fallback_to_snapshot: Fall back to bundled PSL snapshot if fetch fails

28

- include_psl_private_domains: Include PSL private domains by default

29

- extra_suffixes: Additional custom suffixes to recognize

30

- cache_fetch_timeout: HTTP timeout for PSL fetching (seconds)

31

"""

32

```

33

34

### Basic Extraction Methods

35

36

Core extraction methods that parse URL strings into components.

37

38

```python { .api }

39

def __call__(

40

self,

41

url: str,

42

include_psl_private_domains: bool | None = None,

43

session: requests.Session | None = None

44

) -> ExtractResult:

45

"""

46

Extract components from URL string (alias for extract_str).

47

48

Parameters:

49

- url: URL string to parse

50

- include_psl_private_domains: Override instance default for private domains

51

- session: Optional requests.Session for HTTP customization

52

53

Returns:

54

ExtractResult with parsed components

55

"""

56

57

def extract_str(

58

self,

59

url: str,

60

include_psl_private_domains: bool | None = None,

61

session: requests.Session | None = None

62

) -> ExtractResult:

63

"""

64

Extract components from URL string.

65

66

Parameters:

67

- url: URL string to parse

68

- include_psl_private_domains: Override instance default for private domains

69

- session: Optional requests.Session for HTTP customization

70

71

Returns:

72

ExtractResult with parsed components

73

"""

74

```

75

76

### Optimized urllib Extraction

77

78

Extract from pre-parsed urllib objects for better performance when you already have parsed URL components.

79

80

```python { .api }

81

def extract_urllib(

82

self,

83

url: urllib.parse.ParseResult | urllib.parse.SplitResult,

84

include_psl_private_domains: bool | None = None,

85

session: requests.Session | None = None

86

) -> ExtractResult:

87

"""

88

Extract from urllib.parse result for better performance.

89

90

Parameters:

91

- url: Result from urllib.parse.urlparse() or urlsplit()

92

- include_psl_private_domains: Override instance default for private domains

93

- session: Optional requests.Session for HTTP customization

94

95

Returns:

96

ExtractResult with parsed components

97

"""

98

```

99

100

### Cache and Data Management

101

102

Methods for managing PSL data and caching behavior.

103

104

```python { .api }

105

def update(

106

self,

107

fetch_now: bool = False,

108

session: requests.Session | None = None

109

) -> None:

110

"""

111

Force refresh of PSL data.

112

113

Parameters:

114

- fetch_now: Fetch immediately rather than on next extraction

115

- session: Optional requests.Session for HTTP customization

116

"""

117

118

def tlds(self, session: requests.Session | None = None) -> list[str]:

119

"""

120

Get the list of TLDs currently used by this extractor.

121

122

Parameters:

123

- session: Optional requests.Session for HTTP customization

124

125

Returns:

126

List of TLD strings, varies based on include_psl_private_domains and extra_suffixes

127

"""

128

```

129

130

## Configuration Examples

131

132

### Disable Caching

133

134

Create an extractor that doesn't use disk caching for environments where disk access is restricted.

135

136

```python

137

import tldextract

138

139

# Disable caching entirely

140

no_cache_extractor = tldextract.TLDExtract(cache_dir=None)

141

result = no_cache_extractor('http://example.com')

142

```

143

144

### Custom Cache Directory

145

146

Specify a custom location for PSL data caching.

147

148

```python

149

import tldextract

150

151

# Use custom cache directory

152

custom_cache_extractor = tldextract.TLDExtract(cache_dir='/path/to/custom/cache/')

153

result = custom_cache_extractor('http://example.com')

154

```

155

156

### Offline Operation

157

158

Create an extractor that works entirely offline using the bundled PSL snapshot.

159

160

```python

161

import tldextract

162

163

# Offline-only extractor

164

offline_extractor = tldextract.TLDExtract(

165

suffix_list_urls=(), # No remote URLs

166

fallback_to_snapshot=True

167

)

168

result = offline_extractor('http://example.com')

169

```

170

171

### Custom PSL Sources

172

173

Use alternative or local PSL data sources.

174

175

```python

176

import tldextract

177

178

# Use custom PSL sources

179

custom_psl_extractor = tldextract.TLDExtract(

180

suffix_list_urls=[

181

'file:///path/to/local/suffix_list.dat',

182

'http://custom.psl.mirror.com/list.dat'

183

],

184

fallback_to_snapshot=False

185

)

186

result = custom_psl_extractor('http://example.com')

187

```

188

189

### Private Domains by Default

190

191

Configure an extractor to always include PSL private domains.

192

193

```python

194

import tldextract

195

196

# Always include private domains

197

private_extractor = tldextract.TLDExtract(include_psl_private_domains=True)

198

199

# This will treat blogspot.com as a public suffix

200

result = private_extractor('waiterrant.blogspot.com')

201

print(result)

202

# ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com', is_private=True)

203

```

204

205

### Extra Custom Suffixes

206

207

Add custom suffixes that aren't in the PSL.

208

209

```python

210

import tldextract

211

212

# Add custom internal suffixes

213

internal_extractor = tldextract.TLDExtract(

214

extra_suffixes=['internal', 'corp.example.com']

215

)

216

217

result = internal_extractor('subdomain.example.internal')

218

print(result)

219

# ExtractResult(subdomain='subdomain', domain='example', suffix='internal', is_private=False)

220

```

221

222

### HTTP Timeout Configuration

223

224

Configure timeout for PSL fetching operations.

225

226

```python

227

import tldextract

228

229

# Set custom timeout

230

timeout_extractor = tldextract.TLDExtract(cache_fetch_timeout=10.0)

231

result = timeout_extractor('http://example.com')

232

233

# Can also be set via environment variable

234

import os

235

os.environ['TLDEXTRACT_CACHE_TIMEOUT'] = '5.0'

236

env_extractor = tldextract.TLDExtract()

237

```

238

239

### urllib Integration

240

241

Optimize performance when working with pre-parsed URLs.

242

243

```python

244

import urllib.parse

245

import tldextract

246

247

extractor = tldextract.TLDExtract()

248

249

# Parse once, extract efficiently

250

parsed_url = urllib.parse.urlparse('http://forums.news.cnn.com/path?query=value')

251

result = extractor.extract_urllib(parsed_url)

252

print(result)

253

# ExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)

254

```

255

256

### Session Customization

257

258

Use custom HTTP session for PSL fetching with proxies, authentication, or other customizations.

259

260

```python

261

import requests

262

import tldextract

263

264

# Create session with custom configuration

265

session = requests.Session()

266

session.proxies = {'http': 'http://proxy.example.com:8080'}

267

session.headers.update({'User-Agent': 'MyApp/1.0'})

268

269

extractor = tldextract.TLDExtract()

270

271

# Use custom session for PSL fetching

272

result = extractor('http://example.com', session=session)

273

274

# Force update with custom session

275

extractor.update(fetch_now=True, session=session)

276

```

277

278

## Error Handling

279

280

The `TLDExtract` class handles various error conditions gracefully:

281

282

- **Network errors**: Falls back to cached data or bundled snapshot

283

- **Invalid PSL data**: Logs warnings and continues with available data

284

- **Permission errors**: Logs cache access issues and operates without caching

285

- **Invalid configuration**: Raises `ValueError` for impossible configurations (e.g., no data sources)

286

287

```python

288

import tldextract

289

290

# This raises ValueError - no way to get PSL data

291

try:

292

bad_extractor = tldextract.TLDExtract(

293

suffix_list_urls=(),

294

cache_dir=None,

295

fallback_to_snapshot=False

296

)

297

except ValueError as e:

298

print("Configuration error:", e)

299

```

300

301

## Performance Considerations

302

303

- **Caching**: Enabled by default, provides significant performance improvement

304

- **Instance reuse**: Create once, use many times for best performance

305

- **urllib integration**: Use `extract_urllib()` when you already have parsed URLs

306

- **Session reuse**: Pass the same session object for multiple extractions with custom HTTP configuration