or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

data-structures.mddate-handling.mderror-handling.mdhttp-features.mdindex.mdparsing.md

parsing.mddocs/

0

# Core Parsing

1

2

Feedparser's core parsing functionality supports multiple input sources, extensive configuration options, and automatic format detection across RSS and Atom feed formats.

3

4

## Capabilities

5

6

### Main Parse Function

7

8

The primary parsing function that handles URLs, files, streams, and strings with comprehensive configuration options.

9

10

```python { .api }

11

def parse(url_file_stream_or_string, etag=None, modified=None, agent=None, referrer=None, handlers=None, request_headers=None, response_headers=None, resolve_relative_uris=None, sanitize_html=None):

12

"""

13

Parse a feed from a URL, file, stream, or string.

14

15

Args:

16

url_file_stream_or_string: File-like object, URL, file path, or string.

17

Both byte and text strings are accepted. If necessary, encoding will

18

be derived from response headers or automatically detected.

19

20

Note: Strings may trigger network I/O or filesystem access depending

21

on the value. Wrap untrusted strings in io.StringIO or io.BytesIO

22

to avoid this. Do not pass untrusted strings to this function.

23

24

etag (str, optional): HTTP ETag request header for conditional requests.

25

26

modified (str/time.struct_time/datetime, optional): HTTP Last-Modified

27

request header for conditional requests. Can be a string, 9-tuple

28

from gmtime(), or datetime object. Must be in GMT.

29

30

agent (str, optional): HTTP User-Agent request header. Defaults to

31

feedparser.USER_AGENT if not specified.

32

33

referrer (str, optional): HTTP Referer request header.

34

35

handlers (list, optional): List of urllib handlers to build custom opener.

36

37

request_headers (dict, optional): Mapping of HTTP header names to values

38

that will override internally generated request headers.

39

40

response_headers (dict, optional): Mapping of HTTP header names to values.

41

If an HTTP request was made, these override matching response headers.

42

Otherwise, this specifies the entirety of response headers.

43

44

resolve_relative_uris (bool, optional): Whether to resolve relative URIs

45

to absolute ones within HTML content. Defaults to RESOLVE_RELATIVE_URIS.

46

47

sanitize_html (bool, optional): Whether to sanitize HTML content.

48

Only disable if you know what you're doing! Defaults to SANITIZE_HTML.

49

50

Returns:

51

FeedParserDict: Parsed feed data containing:

52

- bozo: Boolean indicating parsing issues

53

- bozo_exception: Exception if parsing errors occurred

54

- encoding: Character encoding used

55

- etag: HTTP ETag from response

56

- headers: HTTP response headers dict

57

- href: Final URL after redirects

58

- modified: HTTP Last-Modified header

59

- namespaces: XML namespaces used

60

- status: HTTP status code

61

- version: Feed format version

62

- entries: List of entry/item dictionaries

63

- feed: Feed-level metadata dictionary

64

"""

65

```

66

67

### Input Source Types

68

69

Feedparser accepts multiple input source types:

70

71

```python

72

# Parse from URL

73

result = feedparser.parse('https://example.com/feed.xml')

74

75

# Parse from local file path

76

result = feedparser.parse('/path/to/feed.xml')

77

78

# Parse from file-like object

79

with open('feed.xml', 'rb') as f:

80

result = feedparser.parse(f)

81

82

# Parse from string content (XML/HTML)

83

xml_content = """<?xml version="1.0"?>

84

<rss version="2.0">

85

<channel>

86

<title>Example Feed</title>

87

<item><title>Test Item</title></item>

88

</channel>

89

</rss>"""

90

result = feedparser.parse(xml_content)

91

92

# Parse from bytes

93

result = feedparser.parse(xml_content.encode('utf-8'))

94

95

# Parse with StringIO/BytesIO for untrusted content

96

import io

97

result = feedparser.parse(io.StringIO(untrusted_content))

98

```

99

100

### Conditional Requests

101

102

Use ETags and Last-Modified headers for efficient feed polling:

103

104

```python

105

# Initial request

106

result = feedparser.parse('https://example.com/feed.xml')

107

etag = result.etag

108

modified = result.modified

109

110

# Subsequent conditional request

111

result = feedparser.parse(

112

'https://example.com/feed.xml',

113

etag=etag,

114

modified=modified

115

)

116

117

# Check if feed was modified

118

if result.status == 304:

119

print("Feed not modified")

120

else:

121

print("Feed was updated")

122

```

123

124

### Custom HTTP Configuration

125

126

Configure HTTP behavior with custom headers and agents:

127

128

```python

129

# Custom User-Agent

130

result = feedparser.parse(

131

url,

132

agent='MyApplication/1.0 (+https://example.com/bot.html)'

133

)

134

135

# Custom request headers

136

result = feedparser.parse(

137

url,

138

request_headers={

139

'Authorization': 'Bearer token123',

140

'Accept-Language': 'en-US,en;q=0.9'

141

}

142

)

143

144

# Custom response headers (for testing or overrides)

145

result = feedparser.parse(

146

content,

147

response_headers={

148

'Content-Type': 'application/rss+xml',

149

'Content-Location': 'https://example.com/feed.xml'

150

}

151

)

152

```

153

154

### Content Processing Options

155

156

Control URI resolution and HTML sanitization:

157

158

```python

159

# Disable relative URI resolution

160

result = feedparser.parse(url, resolve_relative_uris=False)

161

162

# Disable HTML sanitization (use with caution!)

163

result = feedparser.parse(url, sanitize_html=False)

164

165

# Combine multiple options

166

result = feedparser.parse(

167

url,

168

agent='MyBot/1.0',

169

resolve_relative_uris=True,

170

sanitize_html=True,

171

request_headers={'Accept': 'application/atom+xml,application/rss+xml'}

172

)

173

```

174

175

### Format Detection

176

177

Feedparser automatically detects and handles multiple feed formats:

178

179

```python

180

result = feedparser.parse(url)

181

182

# Check detected format

183

print(f"Feed version: {result.version}")

184

# Possible values: 'rss090', 'rss091n', 'rss091u', 'rss092', 'rss093',

185

# 'rss094', 'rss20', 'rss10', 'rss', 'atom01', 'atom02', 'atom03',

186

# 'atom10', 'atom', 'cdf', or '' (unknown)

187

188

# Version indicates the feed format detected

189

# Common values: 'rss20', 'atom10', 'rss10', etc.

190

if result.version:

191

print(f"Detected feed format: {result.version}")

192

else:

193

print("Unknown feed format")

194

```

195

196

### Global Configuration

197

198

Set global defaults for all parsing operations:

199

200

```python

201

import feedparser

202

203

# Set global User-Agent

204

feedparser.USER_AGENT = 'MyApplication/2.0 (+https://example.com)'

205

206

# Disable global URI resolution

207

feedparser.RESOLVE_RELATIVE_URIS = 0

208

209

# Disable global HTML sanitization

210

feedparser.SANITIZE_HTML = 0

211

212

# These settings affect all subsequent parse() calls unless overridden

213

result = feedparser.parse(url) # Uses global settings

214

```

215

216

### Error Handling During Parsing

217

218

Handle various parsing scenarios:

219

220

```python

221

import urllib.error

222

223

try:

224

result = feedparser.parse(url)

225

226

# Check for well-formedness issues

227

if result.bozo:

228

print(f"Feed had issues: {result.bozo_exception}")

229

230

# Common exception types

231

if isinstance(result.bozo_exception, feedparser.NonXMLContentType):

232

print("Content was not XML")

233

elif isinstance(result.bozo_exception, feedparser.CharacterEncodingUnknown):

234

print("Could not determine character encoding")

235

236

# Check HTTP status

237

if hasattr(result, 'status'):

238

if result.status == 404:

239

print("Feed not found")

240

elif result.status >= 400:

241

print(f"HTTP error: {result.status}")

242

243

# Process feed data

244

if result.entries:

245

print(f"Found {len(result.entries)} entries")

246

else:

247

print("No entries found")

248

249

except Exception as e:

250

print(f"Parsing failed: {e}")

251

```

252

253

## Parser Selection

254

255

Feedparser automatically selects between strict and lenient parsing modes based on content:

256

257

- **Strict parsing**: Used for well-formed XML feeds, leverages xml.sax with namespace support

258

- **Lenient parsing**: Used for malformed content, provides HTML-style parsing with error recovery

259

260

Parser selection is automatic and internal - users don't need to interact with parser classes directly.

261

262

## Internal Implementation Notes

263

264

The following are internal implementation details not exposed in the public API:

265

266

- Parser classes (StrictFeedParser, LooseFeedParser) are created dynamically

267

- SUPPORTED_VERSIONS mapping is available in feedparser.api module but not exported

268

- PREFERRED_XML_PARSERS list controls SAX parser selection

269

270

For format detection, use the `result.version` field from parse() results.