or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

downloaders.mdextractors.mdindex.mdmain-downloader.mdpost-processors.mdutilities.md

extractors.mddocs/

0

# Information Extractors

1

2

Information extractors are site-specific modules that handle video metadata extraction from over 1000 supported sites. Each extractor understands the specific URL patterns, API interfaces, and data structures for its target site.

3

4

## Capabilities

5

6

### Extractor Management

7

8

Functions for discovering, listing, and managing available extractors.

9

10

```python { .api }

11

def gen_extractors():

12

"""

13

Return a list of instances of every supported extractor.

14

The order matters; the first extractor matched handles the URL.

15

16

Returns:

17

list: List of extractor instances

18

"""

19

20

def gen_extractor_classes():

21

"""

22

Return a list of supported extractor classes.

23

The order matters; the first extractor matched handles the URL.

24

25

Returns:

26

list: List of extractor classes

27

"""

28

29

def list_extractors(age_limit):

30

"""

31

Return a list of extractors suitable for the given age limit,

32

sorted by extractor ID.

33

34

Parameters:

35

- age_limit (int): Age limit for content filtering

36

37

Returns:

38

list: List of suitable extractor instances

39

"""

40

41

def get_info_extractor(ie_name):

42

"""

43

Returns the info extractor class with the given name.

44

45

Parameters:

46

- ie_name (str): Extractor name (without 'IE' suffix)

47

48

Returns:

49

class: Extractor class

50

"""

51

```

52

53

### Base InfoExtractor Class

54

55

Base class that all site-specific extractors inherit from, providing common functionality and interfaces.

56

57

```python { .api }

58

class InfoExtractor:

59

def __init__(self, downloader=None):

60

"""

61

Base class for information extractors.

62

63

Parameters:

64

- downloader: YoutubeDL instance

65

"""

66

67

def suitable(self, url):

68

"""

69

Check if the extractor is suitable for the given URL.

70

71

Parameters:

72

- url (str): URL to check

73

74

Returns:

75

bool: True if suitable, False otherwise

76

"""

77

78

def extract(self, url):

79

"""

80

Extract information from the given URL.

81

82

Parameters:

83

- url (str): URL to extract from

84

85

Returns:

86

dict: Extracted information dictionary

87

"""

88

89

def _real_extract(self, url):

90

"""

91

Actual extraction logic (implemented by subclasses).

92

93

Parameters:

94

- url (str): URL to extract from

95

96

Returns:

97

dict: Extracted information dictionary

98

"""

99

```

100

101

### Common Extractor Methods

102

103

Utility methods available to all extractors for common operations.

104

105

```python { .api }

106

def _download_webpage(self, url_or_request, video_id, note=None, errnote=None, fatal=True, tries=1, timeout=5, encoding=None, data=None, headers={}, query={}):

107

"""

108

Download webpage content.

109

110

Parameters:

111

- url_or_request: URL string or Request object

112

- video_id (str): Video identifier for error reporting

113

- note (str): Progress note to display

114

- errnote (str): Error note for failures

115

- fatal (bool): Whether to raise error on failure

116

- tries (int): Number of retry attempts

117

- timeout (float): Request timeout

118

- encoding (str): Character encoding

119

- data: POST data

120

- headers (dict): HTTP headers

121

- query (dict): URL query parameters

122

123

Returns:

124

str: Webpage content

125

"""

126

127

def _download_json(self, url_or_request, video_id, note='Downloading JSON metadata', errnote='Unable to download JSON metadata', transform_source=None, fatal=True, encoding=None, data=None, headers={}, query={}):

128

"""

129

Download and parse JSON data.

130

131

Parameters:

132

- url_or_request: URL string or Request object

133

- video_id (str): Video identifier

134

- note (str): Progress note

135

- errnote (str): Error note

136

- transform_source (callable): Function to transform JSON source

137

- fatal (bool): Whether to raise error on failure

138

- encoding (str): Character encoding

139

- data: POST data

140

- headers (dict): HTTP headers

141

- query (dict): URL query parameters

142

143

Returns:

144

dict: Parsed JSON data

145

"""

146

147

def _html_search_regex(self, pattern, string, name, default=None, fatal=True, flags=0, group=None):

148

"""

149

Search for regex pattern in HTML string.

150

151

Parameters:

152

- pattern (str): Regex pattern

153

- string (str): HTML string to search

154

- name (str): Description for error messages

155

- default: Default value if not found

156

- fatal (bool): Whether to raise error if not found

157

- flags (int): Regex flags

158

- group (int/str): Capture group to return

159

160

Returns:

161

str: Matched text

162

"""

163

```

164

165

### Extractor Information Dictionary

166

167

Standard format for information returned by extractors.

168

169

```python { .api }

170

InfoDict = {

171

'id': str, # Video identifier

172

'title': str, # Video title

173

'url': str, # Video URL (for single videos)

174

'ext': str, # File extension

175

'format': str, # Format description

176

'format_id': str, # Format identifier

177

'uploader': str, # Video uploader name

178

'uploader_id': str, # Uploader identifier

179

'uploader_url': str, # Uploader profile URL

180

'upload_date': str, # Upload date (YYYYMMDD format)

181

'timestamp': int, # Upload timestamp (Unix)

182

'duration': int, # Duration in seconds

183

'view_count': int, # View count

184

'like_count': int, # Like count

185

'dislike_count': int, # Dislike count

186

'description': str, # Video description

187

'tags': list, # List of tags

188

'thumbnail': str, # Thumbnail URL

189

'thumbnails': list, # List of thumbnail dictionaries

190

'subtitles': dict, # Subtitle tracks

191

'automatic_captions': dict, # Auto-generated captions

192

'formats': list, # List of available formats

193

'playlist': str, # Playlist title (for playlist entries)

194

'playlist_id': str, # Playlist identifier

195

'playlist_index': int, # Position in playlist

196

'webpage_url': str, # Original webpage URL

197

'webpage_url_basename': str, # Basename of webpage URL

198

'extractor': str, # Extractor name

199

'extractor_key': str, # Extractor key

200

}

201

```

202

203

### Format Dictionary

204

205

Structure for individual video/audio format information.

206

207

```python { .api }

208

FormatDict = {

209

'format_id': str, # Unique format identifier

210

'url': str, # Direct media URL

211

'ext': str, # File extension

212

'width': int, # Video width

213

'height': int, # Video height

214

'resolution': str, # Resolution string

215

'fps': float, # Frames per second

216

'vcodec': str, # Video codec

217

'vbr': float, # Video bitrate

218

'acodec': str, # Audio codec

219

'abr': float, # Audio bitrate

220

'asr': int, # Audio sample rate

221

'filesize': int, # File size in bytes

222

'tbr': float, # Total bitrate

223

'protocol': str, # Download protocol

224

'preference': int, # Format preference (-1 to 100)

225

'quality': int, # Quality metric

226

'format_note': str, # Additional format info

227

'language': str, # Language code

228

'http_headers': dict, # Required HTTP headers

229

}

230

```

231

232

## Site-Specific Extractors

233

234

youtube-dl includes extractors for over 1000 sites. Some notable ones include:

235

236

### Video Platforms

237

- **YoutubeIE**: YouTube videos, playlists, channels, live streams

238

- **VimeoIE**: Vimeo videos and channels

239

- **DailymotionIE**: Dailymotion videos and playlists

240

- **TwitchIE**: Twitch streams and VODs

241

- **FacebookIE**: Facebook videos

242

243

### News and Media

244

- **BBCIE**: BBC iPlayer content

245

- **CNNIE**: CNN video content

246

- **NBCIE**: NBC video content

247

- **CBSIE**: CBS video content

248

249

### Social Media

250

- **TwitterIE**: Twitter videos

251

- **InstagramIE**: Instagram videos and stories

252

- **TikTokIE**: TikTok videos

253

254

### Educational

255

- **CourseraIE**: Coursera course videos

256

- **KhanAcademyIE**: Khan Academy content

257

- **TedIE**: TED Talks

258

259

## Usage Examples

260

261

### List Available Extractors

262

```python

263

from youtube_dl import list_extractors

264

265

# Get all extractors

266

extractors = list_extractors(age_limit=18)

267

for extractor in extractors:

268

print(f"{extractor.IE_NAME}: {extractor.IE_DESC}")

269

```

270

271

### Get Specific Extractor

272

```python

273

from youtube_dl.extractor import get_info_extractor

274

275

# Get YouTube extractor class

276

YoutubeIE = get_info_extractor('Youtube')

277

extractor = YoutubeIE()

278

```

279

280

### Extract Information Only

281

```python

282

from youtube_dl import YoutubeDL

283

284

ydl_opts = {'quiet': True}

285

with YoutubeDL(ydl_opts) as ydl:

286

info = ydl.extract_info('https://www.youtube.com/watch?v=dQw4w9WgXcQ', download=False)

287

print(f"Title: {info['title']}")

288

print(f"Duration: {info['duration']} seconds")

289

print(f"Uploader: {info['uploader']}")

290

291

# List available formats

292

for fmt in info['formats']:

293

print(f"Format: {fmt['format_id']} - {fmt['ext']} - {fmt.get('height', 'audio')}p")

294

```