A simple library to convert Rich Text Format (RTF) files to plain text
npx @tessl/cli install tessl/pypi-striprtf@0.0.0A simple Python library to convert Rich Text Format (RTF) files to plain text. The library is specifically designed to handle medical documents and other RTF files that need to be parsed and processed, providing flexible encoding options and robust error handling for Unicode decoding issues.
pip install striprtffrom striprtf.striprtf import rtf_to_textFor advanced use cases:
from striprtf.striprtf import rtf_to_text, remove_pict_groupsVersion information:
from striprtf import __version__from striprtf.striprtf import rtf_to_text
# Convert RTF string to plain text
rtf = "some RTF encoded string"
text = rtf_to_text(rtf)
print(text)
# With custom encoding
rtf = "some RTF encoded string in latin1"
text = rtf_to_text(rtf, encoding="latin-1")
print(text)
# With error handling for problematic encodings
rtf = "some RTF encoded string"
text = rtf_to_text(rtf, errors="ignore")
print(text)Converts Rich Text Format (RTF) text to plain text with full Unicode support, automatic encoding detection, and robust error handling.
def rtf_to_text(text, encoding="cp1252", errors="strict"):
"""
Converts RTF text to plain text.
Parameters:
- text (str): The RTF text to convert
- encoding (str): Input encoding, defaults to "cp1252". Ignored if RTF file contains explicit codepage directive
- errors (str): How to handle encoding errors. "strict" (default) raises errors, "ignore" skips problematic characters
Returns:
str: The converted RTF text as a Python unicode string
Raises:
UnicodeDecodeError: When encoding errors occur and errors="strict"
"""Removes binary picture data from RTF text that can cause parsing issues. This function is automatically called by rtf_to_text but can be used independently for preprocessing.
def remove_pict_groups(rtf_text):
"""
Remove all \\pict groups with binary data from the RTF text.
Parameters:
- rtf_text (str): The RTF text containing potentially problematic \\pict groups
Returns:
str: The RTF text with binary-encoded \\pict groups removed
Note: Returns original text if no binary-encoded \\pict groups are found
"""Command-line tool for converting RTF files to plain text. The CLI is implemented as a separate script that imports and uses the rtf_to_text function.
def main():
"""
Command-line entry point for converting RTF files to text.
Located in striprtf/striprtf script file.
Usage: striprtf <rtf_file>
Arguments:
- rtf_file: Path to RTF file to convert (required, file opened with UTF-8 encoding)
Options:
- --version: Show version and exit
Note: Installed as 'striprtf' command via package scripts configuration
"""charset_map: dict
# Mapping of RTF charset numbers to Python encoding names
# Contains mappings for major character sets including cp1252, cp932, cp949, etc.
destinations: frozenset
# Set of RTF control words that specify "destinations" to ignore during parsing
# Contains RTF keywords like 'fonttbl', 'colortbl', 'stylesheet', etc.
specialchars: dict
# Translation mapping for special RTF characters to Unicode equivalents
# Maps RTF escape sequences to actual characters (e.g., 'emdash' -> '\\u2014')
sectionchars: dict
# Translation mapping for RTF section and paragraph control words
# Maps section-related RTF keywords to line break characters (e.g., 'par' -> '\\n')PATTERN: re.Pattern
# Main regex pattern for parsing RTF tokens and control words
HYPERLINKS: re.Pattern
# Regex pattern for extracting hyperlinks from RTF HYPERLINK fields
FONTTABLE: re.Pattern
# Regex pattern for parsing font table informationfrom striprtf.striprtf import rtf_to_text
# Read RTF file and convert to text
with open('document.rtf', 'r', encoding='utf-8') as f:
rtf_content = f.read()
plain_text = rtf_to_text(rtf_content)
print(plain_text)from striprtf.striprtf import rtf_to_text
# For problematic RTF files with encoding issues
try:
text = rtf_to_text(rtf_content, encoding="cp1252", errors="strict")
except UnicodeDecodeError:
# Fallback to ignore encoding errors
text = rtf_to_text(rtf_content, errors="ignore")from striprtf.striprtf import rtf_to_text, remove_pict_groups
# For RTF files with known binary picture issues, preprocess first
rtf_content = "\\rtf1\\pict\\bin1024{binary data here}\\par text"
cleaned_rtf = remove_pict_groups(rtf_content)
text = rtf_to_text(cleaned_rtf)# Convert RTF file to plain text
striprtf document.rtf
# Check version
striprtf --versionThe library handles various RTF parsing challenges:
Common exceptions:
UnicodeDecodeError: Raised when character encoding fails with errors="strict"LookupError: Raised internally when unknown encoding is encountered (falls back to UTF-8)