0
# Core Parsing
1
2
Feedparser's core parsing functionality supports multiple input sources, extensive configuration options, and automatic format detection across RSS and Atom feed formats.
3
4
## Capabilities
5
6
### Main Parse Function
7
8
The primary parsing function that handles URLs, files, streams, and strings with comprehensive configuration options.
9
10
```python { .api }
11
def parse(url_file_stream_or_string, etag=None, modified=None, agent=None, referrer=None, handlers=None, request_headers=None, response_headers=None, resolve_relative_uris=None, sanitize_html=None):
12
"""
13
Parse a feed from a URL, file, stream, or string.
14
15
Args:
16
url_file_stream_or_string: File-like object, URL, file path, or string.
17
Both byte and text strings are accepted. If necessary, encoding will
18
be derived from response headers or automatically detected.
19
20
Note: Strings may trigger network I/O or filesystem access depending
21
on the value. Wrap untrusted strings in io.StringIO or io.BytesIO
22
to avoid this. Do not pass untrusted strings to this function.
23
24
etag (str, optional): HTTP ETag request header for conditional requests.
25
26
modified (str/time.struct_time/datetime, optional): HTTP Last-Modified
27
request header for conditional requests. Can be a string, 9-tuple
28
from gmtime(), or datetime object. Must be in GMT.
29
30
agent (str, optional): HTTP User-Agent request header. Defaults to
31
feedparser.USER_AGENT if not specified.
32
33
referrer (str, optional): HTTP Referer request header.
34
35
handlers (list, optional): List of urllib handlers to build custom opener.
36
37
request_headers (dict, optional): Mapping of HTTP header names to values
38
that will override internally generated request headers.
39
40
response_headers (dict, optional): Mapping of HTTP header names to values.
41
If an HTTP request was made, these override matching response headers.
42
Otherwise, this specifies the entirety of response headers.
43
44
resolve_relative_uris (bool, optional): Whether to resolve relative URIs
45
to absolute ones within HTML content. Defaults to RESOLVE_RELATIVE_URIS.
46
47
sanitize_html (bool, optional): Whether to sanitize HTML content.
48
Only disable if you know what you're doing! Defaults to SANITIZE_HTML.
49
50
Returns:
51
FeedParserDict: Parsed feed data containing:
52
- bozo: Boolean indicating parsing issues
53
- bozo_exception: Exception if parsing errors occurred
54
- encoding: Character encoding used
55
- etag: HTTP ETag from response
56
- headers: HTTP response headers dict
57
- href: Final URL after redirects
58
- modified: HTTP Last-Modified header
59
- namespaces: XML namespaces used
60
- status: HTTP status code
61
- version: Feed format version
62
- entries: List of entry/item dictionaries
63
- feed: Feed-level metadata dictionary
64
"""
65
```
66
67
### Input Source Types
68
69
Feedparser accepts multiple input source types:
70
71
```python
72
# Parse from URL
73
result = feedparser.parse('https://example.com/feed.xml')
74
75
# Parse from local file path
76
result = feedparser.parse('/path/to/feed.xml')
77
78
# Parse from file-like object
79
with open('feed.xml', 'rb') as f:
80
result = feedparser.parse(f)
81
82
# Parse from string content (XML/HTML)
83
xml_content = """<?xml version="1.0"?>
84
<rss version="2.0">
85
<channel>
86
<title>Example Feed</title>
87
<item><title>Test Item</title></item>
88
</channel>
89
</rss>"""
90
result = feedparser.parse(xml_content)
91
92
# Parse from bytes
93
result = feedparser.parse(xml_content.encode('utf-8'))
94
95
# Parse with StringIO/BytesIO for untrusted content
96
import io
97
result = feedparser.parse(io.StringIO(untrusted_content))
98
```
99
100
### Conditional Requests
101
102
Use ETags and Last-Modified headers for efficient feed polling:
103
104
```python
105
# Initial request
106
result = feedparser.parse('https://example.com/feed.xml')
107
etag = result.etag
108
modified = result.modified
109
110
# Subsequent conditional request
111
result = feedparser.parse(
112
'https://example.com/feed.xml',
113
etag=etag,
114
modified=modified
115
)
116
117
# Check if feed was modified
118
if result.status == 304:
119
print("Feed not modified")
120
else:
121
print("Feed was updated")
122
```
123
124
### Custom HTTP Configuration
125
126
Configure HTTP behavior with custom headers and agents:
127
128
```python
129
# Custom User-Agent
130
result = feedparser.parse(
131
url,
132
agent='MyApplication/1.0 (+https://example.com/bot.html)'
133
)
134
135
# Custom request headers
136
result = feedparser.parse(
137
url,
138
request_headers={
139
'Authorization': 'Bearer token123',
140
'Accept-Language': 'en-US,en;q=0.9'
141
}
142
)
143
144
# Custom response headers (for testing or overrides)
145
result = feedparser.parse(
146
content,
147
response_headers={
148
'Content-Type': 'application/rss+xml',
149
'Content-Location': 'https://example.com/feed.xml'
150
}
151
)
152
```
153
154
### Content Processing Options
155
156
Control URI resolution and HTML sanitization:
157
158
```python
159
# Disable relative URI resolution
160
result = feedparser.parse(url, resolve_relative_uris=False)
161
162
# Disable HTML sanitization (use with caution!)
163
result = feedparser.parse(url, sanitize_html=False)
164
165
# Combine multiple options
166
result = feedparser.parse(
167
url,
168
agent='MyBot/1.0',
169
resolve_relative_uris=True,
170
sanitize_html=True,
171
request_headers={'Accept': 'application/atom+xml,application/rss+xml'}
172
)
173
```
174
175
### Format Detection
176
177
Feedparser automatically detects and handles multiple feed formats:
178
179
```python
180
result = feedparser.parse(url)
181
182
# Check detected format
183
print(f"Feed version: {result.version}")
184
# Possible values: 'rss090', 'rss091n', 'rss091u', 'rss092', 'rss093',
185
# 'rss094', 'rss20', 'rss10', 'rss', 'atom01', 'atom02', 'atom03',
186
# 'atom10', 'atom', 'cdf', or '' (unknown)
187
188
# Version indicates the feed format detected
189
# Common values: 'rss20', 'atom10', 'rss10', etc.
190
if result.version:
191
print(f"Detected feed format: {result.version}")
192
else:
193
print("Unknown feed format")
194
```
195
196
### Global Configuration
197
198
Set global defaults for all parsing operations:
199
200
```python
201
import feedparser
202
203
# Set global User-Agent
204
feedparser.USER_AGENT = 'MyApplication/2.0 (+https://example.com)'
205
206
# Disable global URI resolution
207
feedparser.RESOLVE_RELATIVE_URIS = 0
208
209
# Disable global HTML sanitization
210
feedparser.SANITIZE_HTML = 0
211
212
# These settings affect all subsequent parse() calls unless overridden
213
result = feedparser.parse(url) # Uses global settings
214
```
215
216
### Error Handling During Parsing
217
218
Handle various parsing scenarios:
219
220
```python
221
import urllib.error
222
223
try:
224
result = feedparser.parse(url)
225
226
# Check for well-formedness issues
227
if result.bozo:
228
print(f"Feed had issues: {result.bozo_exception}")
229
230
# Common exception types
231
if isinstance(result.bozo_exception, feedparser.NonXMLContentType):
232
print("Content was not XML")
233
elif isinstance(result.bozo_exception, feedparser.CharacterEncodingUnknown):
234
print("Could not determine character encoding")
235
236
# Check HTTP status
237
if hasattr(result, 'status'):
238
if result.status == 404:
239
print("Feed not found")
240
elif result.status >= 400:
241
print(f"HTTP error: {result.status}")
242
243
# Process feed data
244
if result.entries:
245
print(f"Found {len(result.entries)} entries")
246
else:
247
print("No entries found")
248
249
except Exception as e:
250
print(f"Parsing failed: {e}")
251
```
252
253
## Parser Selection
254
255
Feedparser automatically selects between strict and lenient parsing modes based on content:
256
257
- **Strict parsing**: Used for well-formed XML feeds, leverages xml.sax with namespace support
258
- **Lenient parsing**: Used for malformed content, provides HTML-style parsing with error recovery
259
260
Parser selection is automatic and internal - users don't need to interact with parser classes directly.
261
262
## Internal Implementation Notes
263
264
The following are internal implementation details not exposed in the public API:
265
266
- Parser classes (StrictFeedParser, LooseFeedParser) are created dynamically
267
- SUPPORTED_VERSIONS mapping is available in feedparser.api module but not exported
268
- PREFERRED_XML_PARSERS list controls SAX parser selection
269
270
For format detection, use the `result.version` field from parse() results.