Tessl Tile for pypi/feedparser@6.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

data-structures.md date-handling.md error-handling.md http-features.md index.md parsing.md

parsing.mddocs/

0
# Core Parsing
1

2
Feedparser's core parsing functionality supports multiple input sources, extensive configuration options, and automatic format detection across RSS and Atom feed formats.
3

4
## Capabilities
5

6
### Main Parse Function
7

8
The primary parsing function that handles URLs, files, streams, and strings with comprehensive configuration options.
9

10
```python { .api }
11
def parse(url_file_stream_or_string, etag=None, modified=None, agent=None, referrer=None, handlers=None, request_headers=None, response_headers=None, resolve_relative_uris=None, sanitize_html=None):
12
    """
13
    Parse a feed from a URL, file, stream, or string.
14

15
    Args:
16
        url_file_stream_or_string: File-like object, URL, file path, or string.
17
            Both byte and text strings are accepted. If necessary, encoding will
18
            be derived from response headers or automatically detected.
19
            
20
            Note: Strings may trigger network I/O or filesystem access depending
21
            on the value. Wrap untrusted strings in io.StringIO or io.BytesIO
22
            to avoid this. Do not pass untrusted strings to this function.
23

24
        etag (str, optional): HTTP ETag request header for conditional requests.
25
        
26
        modified (str/time.struct_time/datetime, optional): HTTP Last-Modified
27
            request header for conditional requests. Can be a string, 9-tuple
28
            from gmtime(), or datetime object. Must be in GMT.
29
            
30
        agent (str, optional): HTTP User-Agent request header. Defaults to
31
            feedparser.USER_AGENT if not specified.
32
            
33
        referrer (str, optional): HTTP Referer request header.
34
        
35
        handlers (list, optional): List of urllib handlers to build custom opener.
36
        
37
        request_headers (dict, optional): Mapping of HTTP header names to values
38
            that will override internally generated request headers.
39
            
40
        response_headers (dict, optional): Mapping of HTTP header names to values.
41
            If an HTTP request was made, these override matching response headers.
42
            Otherwise, this specifies the entirety of response headers.
43
            
44
        resolve_relative_uris (bool, optional): Whether to resolve relative URIs
45
            to absolute ones within HTML content. Defaults to RESOLVE_RELATIVE_URIS.
46
            
47
        sanitize_html (bool, optional): Whether to sanitize HTML content.
48
            Only disable if you know what you're doing! Defaults to SANITIZE_HTML.
49

50
    Returns:
51
        FeedParserDict: Parsed feed data containing:
52
            - bozo: Boolean indicating parsing issues
53
            - bozo_exception: Exception if parsing errors occurred  
54
            - encoding: Character encoding used
55
            - etag: HTTP ETag from response
56
            - headers: HTTP response headers dict
57
            - href: Final URL after redirects
58
            - modified: HTTP Last-Modified header
59
            - namespaces: XML namespaces used
60
            - status: HTTP status code
61
            - version: Feed format version
62
            - entries: List of entry/item dictionaries
63
            - feed: Feed-level metadata dictionary
64
    """
65
```
66

67
### Input Source Types
68

69
Feedparser accepts multiple input source types:
70

71
```python
72
# Parse from URL
73
result = feedparser.parse('https://example.com/feed.xml')
74

75
# Parse from local file path
76
result = feedparser.parse('/path/to/feed.xml')
77

78
# Parse from file-like object
79
with open('feed.xml', 'rb') as f:
80
    result = feedparser.parse(f)
81

82
# Parse from string content (XML/HTML)
83
xml_content = """<?xml version="1.0"?>
84
<rss version="2.0">
85
  <channel>
86
    <title>Example Feed</title>
87
    <item><title>Test Item</title></item>
88
  </channel>
89
</rss>"""
90
result = feedparser.parse(xml_content)
91

92
# Parse from bytes
93
result = feedparser.parse(xml_content.encode('utf-8'))
94

95
# Parse with StringIO/BytesIO for untrusted content
96
import io
97
result = feedparser.parse(io.StringIO(untrusted_content))
98
```
99

100
### Conditional Requests
101

102
Use ETags and Last-Modified headers for efficient feed polling:
103

104
```python
105
# Initial request
106
result = feedparser.parse('https://example.com/feed.xml')
107
etag = result.etag
108
modified = result.modified
109

110
# Subsequent conditional request
111
result = feedparser.parse(
112
    'https://example.com/feed.xml',
113
    etag=etag,
114
    modified=modified
115
)
116

117
# Check if feed was modified
118
if result.status == 304:
119
    print("Feed not modified")
120
else:
121
    print("Feed was updated")
122
```
123

124
### Custom HTTP Configuration
125

126
Configure HTTP behavior with custom headers and agents:
127

128
```python
129
# Custom User-Agent
130
result = feedparser.parse(
131
    url,
132
    agent='MyApplication/1.0 (+https://example.com/bot.html)'
133
)
134

135
# Custom request headers
136
result = feedparser.parse(
137
    url,
138
    request_headers={
139
        'Authorization': 'Bearer token123',
140
        'Accept-Language': 'en-US,en;q=0.9'
141
    }
142
)
143

144
# Custom response headers (for testing or overrides)
145
result = feedparser.parse(
146
    content,
147
    response_headers={
148
        'Content-Type': 'application/rss+xml',
149
        'Content-Location': 'https://example.com/feed.xml'
150
    }
151
)
152
```
153

154
### Content Processing Options
155

156
Control URI resolution and HTML sanitization:
157

158
```python
159
# Disable relative URI resolution
160
result = feedparser.parse(url, resolve_relative_uris=False)
161

162
# Disable HTML sanitization (use with caution!)
163
result = feedparser.parse(url, sanitize_html=False)
164

165
# Combine multiple options
166
result = feedparser.parse(
167
    url,
168
    agent='MyBot/1.0',
169
    resolve_relative_uris=True,
170
    sanitize_html=True,
171
    request_headers={'Accept': 'application/atom+xml,application/rss+xml'}
172
)
173
```
174

175
### Format Detection
176

177
Feedparser automatically detects and handles multiple feed formats:
178

179
```python
180
result = feedparser.parse(url)
181

182
# Check detected format
183
print(f"Feed version: {result.version}")
184
# Possible values: 'rss090', 'rss091n', 'rss091u', 'rss092', 'rss093', 
185
# 'rss094', 'rss20', 'rss10', 'rss', 'atom01', 'atom02', 'atom03', 
186
# 'atom10', 'atom', 'cdf', or '' (unknown)
187

188
# Version indicates the feed format detected
189
# Common values: 'rss20', 'atom10', 'rss10', etc.
190
if result.version:
191
    print(f"Detected feed format: {result.version}")
192
else:
193
    print("Unknown feed format")
194
```
195

196
### Global Configuration
197

198
Set global defaults for all parsing operations:
199

200
```python
201
import feedparser
202

203
# Set global User-Agent
204
feedparser.USER_AGENT = 'MyApplication/2.0 (+https://example.com)'
205

206
# Disable global URI resolution
207
feedparser.RESOLVE_RELATIVE_URIS = 0
208

209
# Disable global HTML sanitization
210
feedparser.SANITIZE_HTML = 0
211

212
# These settings affect all subsequent parse() calls unless overridden
213
result = feedparser.parse(url)  # Uses global settings
214
```
215

216
### Error Handling During Parsing
217

218
Handle various parsing scenarios:
219

220
```python
221
import urllib.error
222

223
try:
224
    result = feedparser.parse(url)
225
    
226
    # Check for well-formedness issues
227
    if result.bozo:
228
        print(f"Feed had issues: {result.bozo_exception}")
229
        
230
        # Common exception types
231
        if isinstance(result.bozo_exception, feedparser.NonXMLContentType):
232
            print("Content was not XML")
233
        elif isinstance(result.bozo_exception, feedparser.CharacterEncodingUnknown):
234
            print("Could not determine character encoding")
235
    
236
    # Check HTTP status
237
    if hasattr(result, 'status'):
238
        if result.status == 404:
239
            print("Feed not found")
240
        elif result.status >= 400:
241
            print(f"HTTP error: {result.status}")
242
    
243
    # Process feed data
244
    if result.entries:
245
        print(f"Found {len(result.entries)} entries")
246
    else:
247
        print("No entries found")
248
        
249
except Exception as e:
250
    print(f"Parsing failed: {e}")
251
```
252

253
## Parser Selection
254

255
Feedparser automatically selects between strict and lenient parsing modes based on content:
256

257
- **Strict parsing**: Used for well-formed XML feeds, leverages xml.sax with namespace support
258
- **Lenient parsing**: Used for malformed content, provides HTML-style parsing with error recovery
259

260
Parser selection is automatic and internal - users don't need to interact with parser classes directly.
261

262
## Internal Implementation Notes
263

264
The following are internal implementation details not exposed in the public API:
265

266
- Parser classes (StrictFeedParser, LooseFeedParser) are created dynamically
267
- SUPPORTED_VERSIONS mapping is available in feedparser.api module but not exported
268
- PREFERRED_XML_PARSERS list controls SAX parser selection
269

270
For format detection, use the `result.version` field from parse() results.

Version

Tile

Files

parsing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

parsing.mddocs/