or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli.mdconfigurable-extraction.mdindex.mdresult-processing.mdurl-extraction.md

index.mddocs/

0

# tldextract

1

2

Accurately separates a URL's subdomain, domain, and public suffix using the Public Suffix List (PSL). This library provides robust URL parsing that handles complex domain structures including country code TLDs (ccTLDs), generic TLDs (gTLDs), and their exceptions that naive string splitting cannot parse correctly.

3

4

## Package Information

5

6

- **Package Name**: tldextract

7

- **Language**: Python

8

- **Installation**: `pip install tldextract`

9

10

## Core Imports

11

12

```python

13

import tldextract

14

```

15

16

For basic usage, all functionality is available through the main module:

17

18

```python

19

from tldextract import extract, TLDExtract, ExtractResult, __version__

20

```

21

22

## Basic Usage

23

24

```python

25

import tldextract

26

27

# Basic URL extraction

28

result = tldextract.extract('http://forums.news.cnn.com/')

29

print(result)

30

# ExtractResult(subdomain='forums.news', domain='cnn', suffix='com', is_private=False)

31

32

# Access individual components

33

print(f"Subdomain: {result.subdomain}") # 'forums.news'

34

print(f"Domain: {result.domain}") # 'cnn'

35

print(f"Suffix: {result.suffix}") # 'com'

36

37

# Reconstruct full domain name

38

print(result.fqdn) # 'forums.news.cnn.com'

39

40

# Handle complex TLDs

41

uk_result = tldextract.extract('http://forums.bbc.co.uk/')

42

print(uk_result)

43

# ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk', is_private=False)

44

45

# Handle edge cases

46

ip_result = tldextract.extract('http://127.0.0.1:8080/path')

47

print(ip_result)

48

# ExtractResult(subdomain='', domain='127.0.0.1', suffix='', is_private=False)

49

```

50

51

## Architecture

52

53

The tldextract library uses the authoritative Public Suffix List (PSL) to make parsing decisions:

54

55

- **Public Suffix List (PSL)**: Maintained list of all known public suffixes under which domain registration is possible

56

- **Caching System**: Local caching of PSL data to avoid repeated HTTP requests

57

- **Fallback Mechanism**: Built-in snapshot for offline operation

58

- **Private Domains**: Optional support for PSL private domains (like blogspot.com)

59

60

The library automatically fetches and caches the latest PSL data on first use, with intelligent fallback to a bundled snapshot if network access is unavailable.

61

62

## Capabilities

63

64

### URL Extraction

65

66

Core functionality for extracting URL components using the convenience `extract()` function. This provides the most common use case with sensible defaults.

67

68

```python { .api }

69

def extract(

70

url: str,

71

include_psl_private_domains: bool | None = False,

72

session: requests.Session | None = None

73

) -> ExtractResult

74

```

75

76

[URL Extraction](./url-extraction.md)

77

78

### Configurable Extraction

79

80

Advanced extraction with custom configuration options including cache settings, custom suffix lists, and private domain handling through the `TLDExtract` class.

81

82

```python { .api }

83

class TLDExtract:

84

def __init__(

85

self,

86

cache_dir: str | None = None,

87

suffix_list_urls: Sequence[str] = PUBLIC_SUFFIX_LIST_URLS,

88

fallback_to_snapshot: bool = True,

89

include_psl_private_domains: bool = False,

90

extra_suffixes: Sequence[str] = (),

91

cache_fetch_timeout: str | float | None = CACHE_TIMEOUT

92

) -> None

93

94

def __call__(

95

self,

96

url: str,

97

include_psl_private_domains: bool | None = None,

98

session: requests.Session | None = None

99

) -> ExtractResult

100

101

def extract_str(

102

self,

103

url: str,

104

include_psl_private_domains: bool | None = None,

105

session: requests.Session | None = None

106

) -> ExtractResult

107

108

def extract_urllib(

109

self,

110

url: urllib.parse.ParseResult | urllib.parse.SplitResult,

111

include_psl_private_domains: bool | None = None,

112

session: requests.Session | None = None

113

) -> ExtractResult

114

115

def update(

116

self,

117

fetch_now: bool = False,

118

session: requests.Session | None = None

119

) -> None

120

121

def tlds(self, session: requests.Session | None = None) -> list[str]

122

```

123

124

[Configurable Extraction](./configurable-extraction.md)

125

126

### Result Processing

127

128

Comprehensive result handling with properties for reconstructing domains, handling IP addresses, and accessing metadata about the extraction process.

129

130

```python { .api }

131

@dataclass

132

class ExtractResult:

133

subdomain: str

134

domain: str

135

suffix: str

136

is_private: bool

137

registry_suffix: str

138

139

@property

140

def fqdn(self) -> str

141

142

@property

143

def ipv4(self) -> str

144

145

@property

146

def ipv6(self) -> str

147

148

@property

149

def registered_domain(self) -> str

150

151

@property

152

def reverse_domain_name(self) -> str

153

154

@property

155

def top_domain_under_public_suffix(self) -> str

156

157

@property

158

def top_domain_under_registry_suffix(self) -> str

159

```

160

161

[Result Processing](./result-processing.md)

162

163

### Command Line Interface

164

165

Command-line tool for URL parsing with options for output formatting, cache management, and PSL updates.

166

167

```bash { .api }

168

tldextract [options] <url1> [url2] ...

169

```

170

171

[Command Line Interface](./cli.md)

172

173

### PSL Data Management

174

175

Functions for updating and managing Public Suffix List data globally.

176

177

```python { .api }

178

def update(fetch_now: bool = False, session: requests.Session | None = None) -> None

179

```

180

181

[URL Extraction](./url-extraction.md)

182

183

## Types

184

185

```python { .api }

186

from typing import Sequence

187

from dataclasses import dataclass, field

188

import requests

189

import urllib.parse

190

191

# Module attributes

192

__version__: str

193

194

# Constants

195

PUBLIC_SUFFIX_LIST_URLS: tuple[str, ...]

196

CACHE_TIMEOUT: str | None

197

198

# Functions - detailed in respective sections

199

200

# Classes - detailed in respective sections

201

ExtractResult = dataclass # Detailed in Result Processing section

202

TLDExtract = class # Detailed in Configurable Extraction section

203

```