Tessl Tile for pypi/tldextract@5.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli.md configurable-extraction.md index.md result-processing.md url-extraction.md

cli.mddocs/

0
# Command Line Interface
1

2
Command-line tool for URL parsing with options for output formatting, cache management, PSL updates, and batch processing. The CLI provides access to tldextract functionality from shell scripts and command-line workflows.
3

4
## Capabilities
5

6
### Basic Command Structure
7

8
The CLI accepts URLs as positional arguments and provides various options for customizing behavior and output.
9

10
```bash { .api }
11
tldextract [options] <url1> [url2] ...
12

13
Options:
14
  --version                    Show version information
15
  -j, --json                   Output in JSON format
16
  -u, --update                 Force fetch latest TLD definitions
17
  --suffix_list_url URL        Use alternate PSL URL/file (can specify multiple)
18
  -c DIR, --cache_dir DIR      Use alternate cache directory
19
  -p, --include_psl_private_domains, --private_domains
20
                              Include PSL private domains
21
  --no_fallback_to_snapshot   Don't fall back to bundled PSL snapshot
22
```
23

24
### Basic Usage
25

26
Extract URL components with default space-separated output:
27

28
```bash
29
# Single URL
30
tldextract 'http://forums.bbc.co.uk'
31
# Output: forums bbc co.uk
32

33
# Multiple URLs
34
tldextract 'google.com' 'http://forums.news.cnn.com/' 'https://www.example.co.uk'
35
# Output:
36
#  google com
37
# forums.news cnn com
38
# www example co.uk
39

40
# Complex domains
41
tldextract 'http://www.worldbank.org.kg/'
42
# Output: www worldbank org.kg
43
```
44

45
### JSON Output
46

47
Get structured JSON output for programmatic processing:
48

49
```bash
50
# Single URL with JSON output
51
tldextract --json 'http://forums.bbc.co.uk'
52
# Output: {"subdomain": "forums", "domain": "bbc", "suffix": "co.uk", "is_private": false, "registry_suffix": "co.uk", "fqdn": "forums.bbc.co.uk", "ipv4": "", "ipv6": "", "registered_domain": "bbc.co.uk", "reverse_domain_name": "co.uk.bbc.forums", "top_domain_under_public_suffix": "bbc.co.uk", "top_domain_under_registry_suffix": "bbc.co.uk"}
53

54
# Multiple URLs with JSON output
55
tldextract --json 'google.com' 'http://127.0.0.1:8080'
56
# Output:
57
# {"subdomain": "", "domain": "google", "suffix": "com", "is_private": false, "registry_suffix": "com", "fqdn": "google.com", "ipv4": "", "ipv6": "", "registered_domain": "google.com", "reverse_domain_name": "com.google", "top_domain_under_public_suffix": "google.com", "top_domain_under_registry_suffix": "google.com"}
58
# {"subdomain": "", "domain": "127.0.0.1", "suffix": "", "is_private": false, "registry_suffix": "", "fqdn": "", "ipv4": "127.0.0.1", "ipv6": "", "registered_domain": "", "reverse_domain_name": "127.0.0.1", "top_domain_under_public_suffix": "", "top_domain_under_registry_suffix": ""}
59
```
60

61
### Private Domain Handling
62

63
Control how PSL private domains are processed:
64

65
```bash
66
# Default behavior - private domains as regular domains
67
tldextract 'waiterrant.blogspot.com'
68
# Output: waiterrant blogspot com
69

70
# Include private domains in suffix
71
tldextract --include_psl_private_domains 'waiterrant.blogspot.com'
72
# Output:  waiterrant blogspot.com
73

74
# Short form of the option
75
tldextract -p 'waiterrant.blogspot.com'
76
# Output:  waiterrant blogspot.com
77
```
78

79
### PSL Data Management
80

81
Update and manage Public Suffix List data:
82

83
```bash
84
# Force update PSL data from remote sources
85
tldextract --update
86

87
# Update and then process URLs
88
tldextract --update 'http://example.new-tld'
89

90
# Check version after update
91
tldextract --version
92
```
93

94
### Custom PSL Sources
95

96
Use alternative or local PSL data sources:
97

98
```bash
99
# Use custom remote PSL source
100
tldextract --suffix_list_url 'http://custom.psl.mirror.com/list.dat' 'example.com'
101

102
# Use local PSL file
103
tldextract --suffix_list_url 'file:///path/to/custom/suffix_list.dat' 'example.com'
104

105
# Use multiple PSL sources (tried in order)
106
tldextract --suffix_list_url 'http://primary.psl.com/list.dat' --suffix_list_url 'http://backup.psl.com/list.dat' 'example.com'
107

108
# Disable fallback to bundled snapshot
109
tldextract --suffix_list_url 'http://custom.psl.com/list.dat' --no_fallback_to_snapshot 'example.com'
110
```
111

112
### Cache Management
113

114
Control PSL data caching behavior:
115

116
```bash
117
# Use custom cache directory
118
tldextract --cache_dir '/path/to/custom/cache' 'example.com'
119

120
# Use environment variable for cache location
121
export TLDEXTRACT_CACHE="/path/to/cache"
122
tldextract 'example.com'
123

124
# Use environment variable for cache timeout 
125
export TLDEXTRACT_CACHE_TIMEOUT="10.0"
126
tldextract 'example.com'
127
```
128

129
## Integration Examples
130

131
### Shell Scripting
132

133
Extract specific components for shell scripts:
134

135
```bash
136
#!/bin/bash
137
# Extract just the domain name
138
URL="http://forums.news.cnn.com/"
139
DOMAIN=$(tldextract "$URL" | awk '{print $2}')
140
echo "Domain: $DOMAIN"  # Output: Domain: cnn
141

142
# Extract all components
143
read SUBDOMAIN DOMAIN SUFFIX <<< $(tldextract "$URL")
144
echo "Subdomain: $SUBDOMAIN"
145
echo "Domain: $DOMAIN" 
146
echo "Suffix: $SUFFIX"
147
```
148

149
### Batch Processing
150

151
Process multiple URLs from files or pipes:
152

153
```bash
154
# Process URLs from file
155
cat urls.txt | xargs tldextract
156

157
# Process with JSON output for further processing
158
cat urls.txt | xargs tldextract --json | jq '.domain' | sort | uniq
159

160
# Extract domains from access logs
161
grep "GET" access.log | awk '{print $7}' | xargs tldextract | awk '{print $2}' | sort | uniq -c
162
```
163

164
### Combined with Other Tools
165

166
Use with standard Unix tools for data processing:
167

168
```bash
169
# Count domains by TLD
170
tldextract --json 'site1.com' 'site2.org' 'site3.com' | jq -r '.suffix' | sort | uniq -c
171

172
# Extract and validate domains
173
echo "http://example.com" | xargs tldextract --json | jq -r 'select(.suffix != "") | .top_domain_under_public_suffix'
174

175
# Check for private domains
176
tldextract --json --include_psl_private_domains 'waiterrant.blogspot.com' | jq '.is_private'
177
```
178

179
## Error Handling
180

181
The CLI handles various error conditions gracefully:
182

183
### Invalid URLs
184

185
```bash
186
# Invalid URLs are processed without errors
187
tldextract 'not-a-url' 'google.notavalidsuffix'
188
# Output:
189
# not-a-url  
190
# google notavalidsuffix
191
```
192

193
### Network Errors
194

195
```bash
196
# Network errors during PSL update are logged but don't prevent operation
197
tldextract --update --suffix_list_url 'http://nonexistent.example.com/list.dat' 'example.com'
198
# Will fall back to cached data or bundled snapshot
199
```
200

201
### Missing Arguments
202

203
```bash
204
# No URLs provided shows usage
205
tldextract
206
# Output: usage: tldextract [-h] [--version] [-j] [-u] [--suffix_list_url SUFFIX_LIST_URL] [-c CACHE_DIR] [-p] [--no_fallback_to_snapshot] [fqdn|url ...]
207

208
# Help is available
209
tldextract --help
210
```
211

212
## Output Format Details
213

214
### Standard Output Format
215

216
Space-separated: `subdomain domain suffix`
217

218
- Empty fields are represented as empty strings
219
- IPv4/IPv6 addresses appear in the domain field with empty suffix
220
- Invalid suffixes result in empty suffix field
221

222
### JSON Output Format
223

224
Complete ExtractResult data including all properties:
225

226
```json
227
{
228
  "subdomain": "forums",
229
  "domain": "bbc", 
230
  "suffix": "co.uk",
231
  "is_private": false,
232
  "registry_suffix": "co.uk",
233
  "fqdn": "forums.bbc.co.uk",
234
  "ipv4": "",
235
  "ipv6": "",
236
  "registered_domain": "bbc.co.uk",
237
  "reverse_domain_name": "co.uk.bbc.forums",
238
  "top_domain_under_public_suffix": "bbc.co.uk",
239
  "top_domain_under_registry_suffix": "bbc.co.uk"
240
}
241
```
242

243
## Environment Variables
244

245
The CLI respects the following environment variables:
246

247
- `TLDEXTRACT_CACHE`: Cache directory path (overrides default)
248
- `TLDEXTRACT_CACHE_TIMEOUT`: HTTP timeout for PSL fetching (seconds)
249

250
```bash
251
# Set cache location
252
export TLDEXTRACT_CACHE="/tmp/tldextract_cache"
253

254
# Set timeout
255
export TLDEXTRACT_CACHE_TIMEOUT="5.0"
256

257
# Use with settings
258
tldextract 'example.com'
259
```

Version

Tile

Files

cli.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

cli.mddocs/