tessl/pypi-feedparser

Universal feed parser for RSS, Atom, and CDF feeds with comprehensive format support and robust parsing capabilities

—

Pending

Overview

Eval results

Files

Core Parsing

Name: tessl/pypi-feedparser
Author: tessl

Feedparser's core parsing functionality supports multiple input sources, extensive configuration options, and automatic format detection across RSS and Atom feed formats.

Capabilities

Main Parse Function

The primary parsing function that handles URLs, files, streams, and strings with comprehensive configuration options.

def parse(url_file_stream_or_string, etag=None, modified=None, agent=None, referrer=None, handlers=None, request_headers=None, response_headers=None, resolve_relative_uris=None, sanitize_html=None):
    """
    Parse a feed from a URL, file, stream, or string.

    Args:
        url_file_stream_or_string: File-like object, URL, file path, or string.
            Both byte and text strings are accepted. If necessary, encoding will
            be derived from response headers or automatically detected.
            
            Note: Strings may trigger network I/O or filesystem access depending
            on the value. Wrap untrusted strings in io.StringIO or io.BytesIO
            to avoid this. Do not pass untrusted strings to this function.

        etag (str, optional): HTTP ETag request header for conditional requests.
        
        modified (str/time.struct_time/datetime, optional): HTTP Last-Modified
            request header for conditional requests. Can be a string, 9-tuple
            from gmtime(), or datetime object. Must be in GMT.
            
        agent (str, optional): HTTP User-Agent request header. Defaults to
            feedparser.USER_AGENT if not specified.
            
        referrer (str, optional): HTTP Referer request header.
        
        handlers (list, optional): List of urllib handlers to build custom opener.
        
        request_headers (dict, optional): Mapping of HTTP header names to values
            that will override internally generated request headers.
            
        response_headers (dict, optional): Mapping of HTTP header names to values.
            If an HTTP request was made, these override matching response headers.
            Otherwise, this specifies the entirety of response headers.
            
        resolve_relative_uris (bool, optional): Whether to resolve relative URIs
            to absolute ones within HTML content. Defaults to RESOLVE_RELATIVE_URIS.
            
        sanitize_html (bool, optional): Whether to sanitize HTML content.
            Only disable if you know what you're doing! Defaults to SANITIZE_HTML.

    Returns:
        FeedParserDict: Parsed feed data containing:
            - bozo: Boolean indicating parsing issues
            - bozo_exception: Exception if parsing errors occurred  
            - encoding: Character encoding used
            - etag: HTTP ETag from response
            - headers: HTTP response headers dict
            - href: Final URL after redirects
            - modified: HTTP Last-Modified header
            - namespaces: XML namespaces used
            - status: HTTP status code
            - version: Feed format version
            - entries: List of entry/item dictionaries
            - feed: Feed-level metadata dictionary
    """

Input Source Types

Feedparser accepts multiple input source types:

# Parse from URL
result = feedparser.parse('https://example.com/feed.xml')

# Parse from local file path
result = feedparser.parse('/path/to/feed.xml')

# Parse from file-like object
with open('feed.xml', 'rb') as f:
    result = feedparser.parse(f)

# Parse from string content (XML/HTML)
xml_content = """<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>Example Feed</title>
    <item><title>Test Item</title></item>
  </channel>
</rss>"""
result = feedparser.parse(xml_content)

# Parse from bytes
result = feedparser.parse(xml_content.encode('utf-8'))

# Parse with StringIO/BytesIO for untrusted content
import io
result = feedparser.parse(io.StringIO(untrusted_content))

Conditional Requests

Use ETags and Last-Modified headers for efficient feed polling:

# Initial request
result = feedparser.parse('https://example.com/feed.xml')
etag = result.etag
modified = result.modified

# Subsequent conditional request
result = feedparser.parse(
    'https://example.com/feed.xml',
    etag=etag,
    modified=modified
)

# Check if feed was modified
if result.status == 304:
    print("Feed not modified")
else:
    print("Feed was updated")

Custom HTTP Configuration

Configure HTTP behavior with custom headers and agents:

# Custom User-Agent
result = feedparser.parse(
    url,
    agent='MyApplication/1.0 (+https://example.com/bot.html)'
)

# Custom request headers
result = feedparser.parse(
    url,
    request_headers={
        'Authorization': 'Bearer token123',
        'Accept-Language': 'en-US,en;q=0.9'
    }
)

# Custom response headers (for testing or overrides)
result = feedparser.parse(
    content,
    response_headers={
        'Content-Type': 'application/rss+xml',
        'Content-Location': 'https://example.com/feed.xml'
    }
)

Content Processing Options

Control URI resolution and HTML sanitization:

# Disable relative URI resolution
result = feedparser.parse(url, resolve_relative_uris=False)

# Disable HTML sanitization (use with caution!)
result = feedparser.parse(url, sanitize_html=False)

# Combine multiple options
result = feedparser.parse(
    url,
    agent='MyBot/1.0',
    resolve_relative_uris=True,
    sanitize_html=True,
    request_headers={'Accept': 'application/atom+xml,application/rss+xml'}
)

Format Detection

Feedparser automatically detects and handles multiple feed formats:

result = feedparser.parse(url)

# Check detected format
print(f"Feed version: {result.version}")
# Possible values: 'rss090', 'rss091n', 'rss091u', 'rss092', 'rss093', 
# 'rss094', 'rss20', 'rss10', 'rss', 'atom01', 'atom02', 'atom03', 
# 'atom10', 'atom', 'cdf', or '' (unknown)

# Version indicates the feed format detected
# Common values: 'rss20', 'atom10', 'rss10', etc.
if result.version:
    print(f"Detected feed format: {result.version}")
else:
    print("Unknown feed format")

Global Configuration

Set global defaults for all parsing operations:

import feedparser

# Set global User-Agent
feedparser.USER_AGENT = 'MyApplication/2.0 (+https://example.com)'

# Disable global URI resolution
feedparser.RESOLVE_RELATIVE_URIS = 0

# Disable global HTML sanitization
feedparser.SANITIZE_HTML = 0

# These settings affect all subsequent parse() calls unless overridden
result = feedparser.parse(url)  # Uses global settings

Error Handling During Parsing

Handle various parsing scenarios:

import urllib.error

try:
    result = feedparser.parse(url)
    
    # Check for well-formedness issues
    if result.bozo:
        print(f"Feed had issues: {result.bozo_exception}")
        
        # Common exception types
        if isinstance(result.bozo_exception, feedparser.NonXMLContentType):
            print("Content was not XML")
        elif isinstance(result.bozo_exception, feedparser.CharacterEncodingUnknown):
            print("Could not determine character encoding")
    
    # Check HTTP status
    if hasattr(result, 'status'):
        if result.status == 404:
            print("Feed not found")
        elif result.status >= 400:
            print(f"HTTP error: {result.status}")
    
    # Process feed data
    if result.entries:
        print(f"Found {len(result.entries)} entries")
    else:
        print("No entries found")
        
except Exception as e:
    print(f"Parsing failed: {e}")

Parser Selection

Feedparser automatically selects between strict and lenient parsing modes based on content:

Strict parsing: Used for well-formed XML feeds, leverages xml.sax with namespace support
Lenient parsing: Used for malformed content, provides HTML-style parsing with error recovery

Parser selection is automatic and internal - users don't need to interact with parser classes directly.

Internal Implementation Notes

The following are internal implementation details not exposed in the public API:

Parser classes (StrictFeedParser, LooseFeedParser) are created dynamically
SUPPORTED_VERSIONS mapping is available in feedparser.api module but not exported
PREFERRED_XML_PARSERS list controls SAX parser selection

For format detection, use the result.version field from parse() results.

Install with Tessl CLI