0
# HTTP Features
1
2
Feedparser provides comprehensive HTTP client capabilities for fetching feeds from URLs, including conditional requests, custom headers, authentication support, and redirect handling.
3
4
## Capabilities
5
6
### Global Configuration Constants
7
8
Configure default HTTP behavior for all parsing operations.
9
10
```python { .api }
11
USER_AGENT: str = "feedparser/{version} +https://github.com/kurtmckee/feedparser/"
12
# Default HTTP User-Agent header sent with requests
13
14
RESOLVE_RELATIVE_URIS: int = 1
15
# Global setting: resolve relative URIs to absolute (1=enabled, 0=disabled)
16
17
SANITIZE_HTML: int = 1
18
# Global setting: sanitize HTML content (1=enabled, 0=disabled)
19
```
20
21
### HTTP Response Information
22
23
When parsing from URLs, the result contains comprehensive HTTP response data:
24
25
```python { .api }
26
# HTTP response fields in result
27
result = {
28
'status': int, # HTTP status code (200, 304, 404, etc.)
29
'headers': dict, # All HTTP response headers
30
'etag': str, # HTTP ETag header for caching
31
'modified': str, # HTTP Last-Modified header
32
'href': str, # Final URL after redirects
33
}
34
```
35
36
## HTTP Client Features
37
38
### User-Agent Configuration
39
40
Set custom User-Agent strings for identification:
41
42
```python
43
import feedparser
44
45
# Set global User-Agent for all requests
46
feedparser.USER_AGENT = 'MyFeedReader/1.0 (+https://example.com/bot.html)'
47
48
# Or specify per-request
49
result = feedparser.parse(
50
url,
51
agent='MyBot/2.0 (contact@example.com)'
52
)
53
```
54
55
### Custom Request Headers
56
57
Add custom HTTP headers to requests:
58
59
```python
60
# Add authorization
61
result = feedparser.parse(
62
url,
63
request_headers={
64
'Authorization': 'Bearer your-token-here',
65
'Accept-Language': 'en-US,en;q=0.9',
66
'Accept-Encoding': 'gzip, deflate',
67
}
68
)
69
70
# Override default headers
71
result = feedparser.parse(
72
url,
73
request_headers={
74
'User-Agent': 'CustomBot/1.0', # Overrides agent parameter
75
'Referer': 'https://example.com', # Custom referer
76
}
77
)
78
```
79
80
### Conditional Requests (Caching)
81
82
Use ETags and Last-Modified headers for efficient feed polling:
83
84
```python
85
# Initial request - save caching headers
86
result = feedparser.parse('https://example.com/feed.xml')
87
88
# Store caching information
89
etag = result.get('etag')
90
modified = result.get('modified')
91
92
# Subsequent conditional request
93
result = feedparser.parse(
94
'https://example.com/feed.xml',
95
etag=etag,
96
modified=modified
97
)
98
99
# Check if content was modified
100
if result.status == 304:
101
print("Feed not modified - use cached version")
102
else:
103
print(f"Feed updated - {len(result.entries)} entries")
104
```
105
106
### HTTP Authentication
107
108
Feedparser supports various authentication methods through custom handlers:
109
110
```python
111
import urllib.request
112
import feedparser
113
114
# Basic authentication
115
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
116
password_mgr.add_password(None, 'https://example.com/', 'username', 'password')
117
118
auth_handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
119
120
result = feedparser.parse(
121
'https://example.com/protected-feed.xml',
122
handlers=[auth_handler]
123
)
124
125
# Digest authentication
126
digest_handler = urllib.request.HTTPDigestAuthHandler(password_mgr)
127
128
result = feedparser.parse(
129
url,
130
handlers=[digest_handler]
131
)
132
```
133
134
### Proxy Support
135
136
Configure proxy settings using urllib handlers:
137
138
```python
139
import urllib.request
140
import feedparser
141
142
# HTTP proxy
143
proxy_handler = urllib.request.ProxyHandler({
144
'http': 'http://proxy.example.com:8080',
145
'https': 'https://proxy.example.com:8080'
146
})
147
148
result = feedparser.parse(
149
url,
150
handlers=[proxy_handler]
151
)
152
153
# Authenticated proxy
154
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
155
proxy_auth_handler.add_password('realm', 'proxy.example.com', 'username', 'password')
156
157
result = feedparser.parse(
158
url,
159
handlers=[proxy_handler, proxy_auth_handler]
160
)
161
```
162
163
### Custom URL Handlers
164
165
Extend feedparser with custom protocol handlers:
166
167
```python
168
import urllib.request
169
import feedparser
170
171
class CustomHTTPHandler(urllib.request.HTTPHandler):
172
def http_open(self, req):
173
# Custom HTTP handling logic
174
print(f"Fetching: {req.get_full_url()}")
175
return super().http_open(req)
176
177
custom_handler = CustomHTTPHandler()
178
179
result = feedparser.parse(
180
url,
181
handlers=[custom_handler]
182
)
183
```
184
185
### SSL/TLS Configuration
186
187
Configure SSL settings for HTTPS requests:
188
189
```python
190
import ssl
191
import urllib.request
192
import feedparser
193
194
# Create SSL context with custom settings
195
ssl_context = ssl.create_default_context()
196
ssl_context.check_hostname = False # Disable hostname verification
197
ssl_context.verify_mode = ssl.CERT_NONE # Disable certificate verification
198
199
# Create HTTPS handler with custom context
200
https_handler = urllib.request.HTTPSHandler(context=ssl_context)
201
202
result = feedparser.parse(
203
'https://example.com/feed.xml',
204
handlers=[https_handler]
205
)
206
```
207
208
### Redirect Handling
209
210
Feedparser automatically follows redirects and provides final URL:
211
212
```python
213
result = feedparser.parse('https://example.com/redirect-to-feed')
214
215
# Check if redirects occurred
216
original_url = 'https://example.com/redirect-to-feed'
217
final_url = result.get('href', '')
218
219
if final_url and final_url != original_url:
220
print(f"Redirected from {original_url} to {final_url}")
221
222
# Access redirect history through headers
223
if 'location' in result.headers:
224
print(f"Redirect location: {result.headers['location']}")
225
```
226
227
## Response Header Handling
228
229
### Accessing Response Headers
230
231
```python
232
result = feedparser.parse(url)
233
234
# Access all headers
235
headers = result.headers
236
print(f"Content-Type: {headers.get('content-type')}")
237
print(f"Content-Length: {headers.get('content-length')}")
238
print(f"Server: {headers.get('server')}")
239
240
# Check for specific caching headers
241
if 'etag' in headers:
242
print(f"ETag: {headers['etag']}")
243
244
if 'last-modified' in headers:
245
print(f"Last-Modified: {headers['last-modified']}")
246
247
# Check content encoding
248
if 'content-encoding' in headers:
249
print(f"Compression: {headers['content-encoding']}")
250
```
251
252
### Overriding Response Headers
253
254
Useful for testing or when parsing content without HTTP:
255
256
```python
257
# Override/supplement response headers
258
result = feedparser.parse(
259
content_string,
260
response_headers={
261
'content-type': 'application/rss+xml; charset=utf-8',
262
'content-location': 'https://example.com/feed.xml',
263
'last-modified': 'Mon, 06 Sep 2021 12:00:00 GMT',
264
'etag': '"abc123"'
265
}
266
)
267
268
# Headers affect base URI resolution and caching behavior
269
print(f"Base URI: {result.href}")
270
```
271
272
## Error Handling
273
274
### HTTP Status Codes
275
276
```python
277
result = feedparser.parse(url)
278
279
# Check HTTP status
280
status = result.get('status', 0)
281
282
if status == 200:
283
print("Feed fetched successfully")
284
elif status == 304:
285
print("Feed not modified (cached version is current)")
286
elif status == 404:
287
print("Feed not found")
288
elif status == 403:
289
print("Access forbidden")
290
elif status >= 500:
291
print(f"Server error: {status}")
292
elif status >= 400:
293
print(f"Client error: {status}")
294
else:
295
print(f"Unexpected status: {status}")
296
297
# Process feed data regardless of minor HTTP issues
298
if result.entries:
299
print(f"Found {len(result.entries)} entries despite HTTP status {status}")
300
```
301
302
### Network Error Handling
303
304
```python
305
import urllib.error
306
import feedparser
307
308
try:
309
result = feedparser.parse(url)
310
311
# Check for network-related bozo exceptions
312
if result.bozo and isinstance(result.bozo_exception, urllib.error.URLError):
313
print(f"Network error: {result.bozo_exception}")
314
315
# Specific error types
316
if isinstance(result.bozo_exception, urllib.error.HTTPError):
317
print(f"HTTP Error {result.bozo_exception.code}: {result.bozo_exception.reason}")
318
else:
319
print(f"URL Error: {result.bozo_exception.reason}")
320
321
# Process any data that was retrieved
322
if result.entries:
323
print("Some data was retrieved despite errors")
324
325
except Exception as e:
326
print(f"Unexpected error: {e}")
327
```
328
329
### Timeout Configuration
330
331
```python
332
import socket
333
import urllib.request
334
import feedparser
335
336
# Set global socket timeout
337
socket.setdefaulttimeout(30) # 30 seconds
338
339
# Or create custom opener with timeout
340
opener = urllib.request.build_opener()
341
342
result = feedparser.parse(
343
url,
344
handlers=[opener]
345
)
346
```
347
348
## Content-Type Handling
349
350
Feedparser handles various content types gracefully:
351
352
```python
353
result = feedparser.parse(url)
354
355
# Check detected content type
356
content_type = result.headers.get('content-type', '')
357
358
if 'xml' in content_type.lower():
359
print("XML content detected")
360
elif 'html' in content_type.lower():
361
print("HTML content - may use loose parser")
362
363
# Check for non-XML content type exception
364
if result.bozo and isinstance(result.bozo_exception, feedparser.NonXMLContentType):
365
print(f"Non-XML content type: {content_type}")
366
# Feedparser will still attempt to parse
367
```
368
369
## Compression Support
370
371
Feedparser automatically handles compressed responses:
372
373
```python
374
# Automatic gzip/deflate decompression
375
result = feedparser.parse(url)
376
377
# Check if content was compressed
378
content_encoding = result.headers.get('content-encoding', '')
379
if content_encoding:
380
print(f"Content was compressed with: {content_encoding}")
381
382
# Request specific compression
383
result = feedparser.parse(
384
url,
385
request_headers={
386
'Accept-Encoding': 'gzip, deflate, br'
387
}
388
)
389
```
390
391
## Global Configuration Examples
392
393
```python
394
import feedparser
395
396
# Configure global defaults
397
feedparser.USER_AGENT = 'MyFeedAggregator/1.0 (+https://example.com)'
398
feedparser.RESOLVE_RELATIVE_URIS = 1 # Enable URI resolution
399
feedparser.SANITIZE_HTML = 1 # Enable HTML sanitization
400
401
# All subsequent parse() calls use these defaults
402
result1 = feedparser.parse(url1)
403
result2 = feedparser.parse(url2)
404
405
# Override global settings per-request
406
result3 = feedparser.parse(
407
url3,
408
agent='SpecialBot/2.0', # Override global USER_AGENT
409
sanitize_html=False # Override global SANITIZE_HTML
410
)
411
```