0
# Information Extractors
1
2
Information extractors are site-specific modules that handle video metadata extraction from over 1000 supported sites. Each extractor understands the specific URL patterns, API interfaces, and data structures for its target site.
3
4
## Capabilities
5
6
### Extractor Management
7
8
Functions for discovering, listing, and managing available extractors.
9
10
```python { .api }
11
def gen_extractors():
12
"""
13
Return a list of instances of every supported extractor.
14
The order matters; the first extractor matched handles the URL.
15
16
Returns:
17
list: List of extractor instances
18
"""
19
20
def gen_extractor_classes():
21
"""
22
Return a list of supported extractor classes.
23
The order matters; the first extractor matched handles the URL.
24
25
Returns:
26
list: List of extractor classes
27
"""
28
29
def list_extractors(age_limit):
30
"""
31
Return a list of extractors suitable for the given age limit,
32
sorted by extractor ID.
33
34
Parameters:
35
- age_limit (int): Age limit for content filtering
36
37
Returns:
38
list: List of suitable extractor instances
39
"""
40
41
def get_info_extractor(ie_name):
42
"""
43
Returns the info extractor class with the given name.
44
45
Parameters:
46
- ie_name (str): Extractor name (without 'IE' suffix)
47
48
Returns:
49
class: Extractor class
50
"""
51
```
52
53
### Base InfoExtractor Class
54
55
Base class that all site-specific extractors inherit from, providing common functionality and interfaces.
56
57
```python { .api }
58
class InfoExtractor:
59
def __init__(self, downloader=None):
60
"""
61
Base class for information extractors.
62
63
Parameters:
64
- downloader: YoutubeDL instance
65
"""
66
67
def suitable(self, url):
68
"""
69
Check if the extractor is suitable for the given URL.
70
71
Parameters:
72
- url (str): URL to check
73
74
Returns:
75
bool: True if suitable, False otherwise
76
"""
77
78
def extract(self, url):
79
"""
80
Extract information from the given URL.
81
82
Parameters:
83
- url (str): URL to extract from
84
85
Returns:
86
dict: Extracted information dictionary
87
"""
88
89
def _real_extract(self, url):
90
"""
91
Actual extraction logic (implemented by subclasses).
92
93
Parameters:
94
- url (str): URL to extract from
95
96
Returns:
97
dict: Extracted information dictionary
98
"""
99
```
100
101
### Common Extractor Methods
102
103
Utility methods available to all extractors for common operations.
104
105
```python { .api }
106
def _download_webpage(self, url_or_request, video_id, note=None, errnote=None, fatal=True, tries=1, timeout=5, encoding=None, data=None, headers={}, query={}):
107
"""
108
Download webpage content.
109
110
Parameters:
111
- url_or_request: URL string or Request object
112
- video_id (str): Video identifier for error reporting
113
- note (str): Progress note to display
114
- errnote (str): Error note for failures
115
- fatal (bool): Whether to raise error on failure
116
- tries (int): Number of retry attempts
117
- timeout (float): Request timeout
118
- encoding (str): Character encoding
119
- data: POST data
120
- headers (dict): HTTP headers
121
- query (dict): URL query parameters
122
123
Returns:
124
str: Webpage content
125
"""
126
127
def _download_json(self, url_or_request, video_id, note='Downloading JSON metadata', errnote='Unable to download JSON metadata', transform_source=None, fatal=True, encoding=None, data=None, headers={}, query={}):
128
"""
129
Download and parse JSON data.
130
131
Parameters:
132
- url_or_request: URL string or Request object
133
- video_id (str): Video identifier
134
- note (str): Progress note
135
- errnote (str): Error note
136
- transform_source (callable): Function to transform JSON source
137
- fatal (bool): Whether to raise error on failure
138
- encoding (str): Character encoding
139
- data: POST data
140
- headers (dict): HTTP headers
141
- query (dict): URL query parameters
142
143
Returns:
144
dict: Parsed JSON data
145
"""
146
147
def _html_search_regex(self, pattern, string, name, default=None, fatal=True, flags=0, group=None):
148
"""
149
Search for regex pattern in HTML string.
150
151
Parameters:
152
- pattern (str): Regex pattern
153
- string (str): HTML string to search
154
- name (str): Description for error messages
155
- default: Default value if not found
156
- fatal (bool): Whether to raise error if not found
157
- flags (int): Regex flags
158
- group (int/str): Capture group to return
159
160
Returns:
161
str: Matched text
162
"""
163
```
164
165
### Extractor Information Dictionary
166
167
Standard format for information returned by extractors.
168
169
```python { .api }
170
InfoDict = {
171
'id': str, # Video identifier
172
'title': str, # Video title
173
'url': str, # Video URL (for single videos)
174
'ext': str, # File extension
175
'format': str, # Format description
176
'format_id': str, # Format identifier
177
'uploader': str, # Video uploader name
178
'uploader_id': str, # Uploader identifier
179
'uploader_url': str, # Uploader profile URL
180
'upload_date': str, # Upload date (YYYYMMDD format)
181
'timestamp': int, # Upload timestamp (Unix)
182
'duration': int, # Duration in seconds
183
'view_count': int, # View count
184
'like_count': int, # Like count
185
'dislike_count': int, # Dislike count
186
'description': str, # Video description
187
'tags': list, # List of tags
188
'thumbnail': str, # Thumbnail URL
189
'thumbnails': list, # List of thumbnail dictionaries
190
'subtitles': dict, # Subtitle tracks
191
'automatic_captions': dict, # Auto-generated captions
192
'formats': list, # List of available formats
193
'playlist': str, # Playlist title (for playlist entries)
194
'playlist_id': str, # Playlist identifier
195
'playlist_index': int, # Position in playlist
196
'webpage_url': str, # Original webpage URL
197
'webpage_url_basename': str, # Basename of webpage URL
198
'extractor': str, # Extractor name
199
'extractor_key': str, # Extractor key
200
}
201
```
202
203
### Format Dictionary
204
205
Structure for individual video/audio format information.
206
207
```python { .api }
208
FormatDict = {
209
'format_id': str, # Unique format identifier
210
'url': str, # Direct media URL
211
'ext': str, # File extension
212
'width': int, # Video width
213
'height': int, # Video height
214
'resolution': str, # Resolution string
215
'fps': float, # Frames per second
216
'vcodec': str, # Video codec
217
'vbr': float, # Video bitrate
218
'acodec': str, # Audio codec
219
'abr': float, # Audio bitrate
220
'asr': int, # Audio sample rate
221
'filesize': int, # File size in bytes
222
'tbr': float, # Total bitrate
223
'protocol': str, # Download protocol
224
'preference': int, # Format preference (-1 to 100)
225
'quality': int, # Quality metric
226
'format_note': str, # Additional format info
227
'language': str, # Language code
228
'http_headers': dict, # Required HTTP headers
229
}
230
```
231
232
## Site-Specific Extractors
233
234
youtube-dl includes extractors for over 1000 sites. Some notable ones include:
235
236
### Video Platforms
237
- **YoutubeIE**: YouTube videos, playlists, channels, live streams
238
- **VimeoIE**: Vimeo videos and channels
239
- **DailymotionIE**: Dailymotion videos and playlists
240
- **TwitchIE**: Twitch streams and VODs
241
- **FacebookIE**: Facebook videos
242
243
### News and Media
244
- **BBCIE**: BBC iPlayer content
245
- **CNNIE**: CNN video content
246
- **NBCIE**: NBC video content
247
- **CBSIE**: CBS video content
248
249
### Social Media
250
- **TwitterIE**: Twitter videos
251
- **InstagramIE**: Instagram videos and stories
252
- **TikTokIE**: TikTok videos
253
254
### Educational
255
- **CourseraIE**: Coursera course videos
256
- **KhanAcademyIE**: Khan Academy content
257
- **TedIE**: TED Talks
258
259
## Usage Examples
260
261
### List Available Extractors
262
```python
263
from youtube_dl import list_extractors
264
265
# Get all extractors
266
extractors = list_extractors(age_limit=18)
267
for extractor in extractors:
268
print(f"{extractor.IE_NAME}: {extractor.IE_DESC}")
269
```
270
271
### Get Specific Extractor
272
```python
273
from youtube_dl.extractor import get_info_extractor
274
275
# Get YouTube extractor class
276
YoutubeIE = get_info_extractor('Youtube')
277
extractor = YoutubeIE()
278
```
279
280
### Extract Information Only
281
```python
282
from youtube_dl import YoutubeDL
283
284
ydl_opts = {'quiet': True}
285
with YoutubeDL(ydl_opts) as ydl:
286
info = ydl.extract_info('https://www.youtube.com/watch?v=dQw4w9WgXcQ', download=False)
287
print(f"Title: {info['title']}")
288
print(f"Duration: {info['duration']} seconds")
289
print(f"Uploader: {info['uploader']}")
290
291
# List available formats
292
for fmt in info['formats']:
293
print(f"Format: {fmt['format_id']} - {fmt['ext']} - {fmt.get('height', 'audio')}p")
294
```