or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md

tile.json

tessl/pypi-striprtf

A simple library to convert Rich Text Format (RTF) files to plain text

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/striprtf@0.0.x

To install, run

npx @tessl/cli install tessl/pypi-striprtf@0.0.0

striprtf

A simple Python library to convert Rich Text Format (RTF) files to plain text. The library is specifically designed to handle medical documents and other RTF files that need to be parsed and processed, providing flexible encoding options and robust error handling for Unicode decoding issues.

Package Information

Package Name: striprtf
Language: Python
Installation: pip install striprtf
Minimum Python Version: 3.8+

Core Imports

from striprtf.striprtf import rtf_to_text

For advanced use cases:

from striprtf.striprtf import rtf_to_text, remove_pict_groups

Version information:

from striprtf import __version__

Basic Usage

from striprtf.striprtf import rtf_to_text

# Convert RTF string to plain text
rtf = "some RTF encoded string"
text = rtf_to_text(rtf)
print(text)

# With custom encoding
rtf = "some RTF encoded string in latin1"  
text = rtf_to_text(rtf, encoding="latin-1")
print(text)

# With error handling for problematic encodings
rtf = "some RTF encoded string"
text = rtf_to_text(rtf, errors="ignore")
print(text)

Capabilities

RTF to Text Conversion

Converts Rich Text Format (RTF) text to plain text with full Unicode support, automatic encoding detection, and robust error handling.

def rtf_to_text(text, encoding="cp1252", errors="strict"):
    """
    Converts RTF text to plain text.

    Parameters:
    - text (str): The RTF text to convert
    - encoding (str): Input encoding, defaults to "cp1252". Ignored if RTF file contains explicit codepage directive
    - errors (str): How to handle encoding errors. "strict" (default) raises errors, "ignore" skips problematic characters

    Returns:
    str: The converted RTF text as a Python unicode string
    
    Raises:
    UnicodeDecodeError: When encoding errors occur and errors="strict"
    """

Binary Data Processing

Removes binary picture data from RTF text that can cause parsing issues. This function is automatically called by rtf_to_text but can be used independently for preprocessing.

def remove_pict_groups(rtf_text):
    """
    Remove all \\pict groups with binary data from the RTF text.
    
    Parameters:
    - rtf_text (str): The RTF text containing potentially problematic \\pict groups
    
    Returns:
    str: The RTF text with binary-encoded \\pict groups removed
    
    Note: Returns original text if no binary-encoded \\pict groups are found
    """

Command Line Interface

Command-line tool for converting RTF files to plain text. The CLI is implemented as a separate script that imports and uses the rtf_to_text function.

def main():
    """
    Command-line entry point for converting RTF files to text.
    Located in striprtf/striprtf script file.
    
    Usage: striprtf <rtf_file>
    
    Arguments:
    - rtf_file: Path to RTF file to convert (required, file opened with UTF-8 encoding)
    
    Options:  
    - --version: Show version and exit
    
    Note: Installed as 'striprtf' command via package scripts configuration
    """

Constants and Data Structures

Character Set Mappings

charset_map: dict
# Mapping of RTF charset numbers to Python encoding names
# Contains mappings for major character sets including cp1252, cp932, cp949, etc.

destinations: frozenset  
# Set of RTF control words that specify "destinations" to ignore during parsing
# Contains RTF keywords like 'fonttbl', 'colortbl', 'stylesheet', etc.

specialchars: dict
# Translation mapping for special RTF characters to Unicode equivalents  
# Maps RTF escape sequences to actual characters (e.g., 'emdash' -> '\\u2014')

sectionchars: dict
# Translation mapping for RTF section and paragraph control words
# Maps section-related RTF keywords to line break characters (e.g., 'par' -> '\\n')

Regular Expression Patterns

PATTERN: re.Pattern
# Main regex pattern for parsing RTF tokens and control words

HYPERLINKS: re.Pattern  
# Regex pattern for extracting hyperlinks from RTF HYPERLINK fields

FONTTABLE: re.Pattern
# Regex pattern for parsing font table information

Usage Examples

Processing RTF Files

from striprtf.striprtf import rtf_to_text

# Read RTF file and convert to text
with open('document.rtf', 'r', encoding='utf-8') as f:
    rtf_content = f.read()

plain_text = rtf_to_text(rtf_content)
print(plain_text)

Handling Encoding Issues

from striprtf.striprtf import rtf_to_text

# For problematic RTF files with encoding issues
try:
    text = rtf_to_text(rtf_content, encoding="cp1252", errors="strict")
except UnicodeDecodeError:
    # Fallback to ignore encoding errors
    text = rtf_to_text(rtf_content, errors="ignore")

Advanced Binary Data Processing

from striprtf.striprtf import rtf_to_text, remove_pict_groups

# For RTF files with known binary picture issues, preprocess first
rtf_content = "\\rtf1\\pict\\bin1024{binary data here}\\par text"
cleaned_rtf = remove_pict_groups(rtf_content)
text = rtf_to_text(cleaned_rtf)

Command Line Usage

# Convert RTF file to plain text
striprtf document.rtf

# Check version
striprtf --version

Error Handling

The library handles various RTF parsing challenges:

Encoding Detection: Automatically detects codepage directives in RTF files
Unicode Decoding: Handles Unicode characters and escape sequences
Binary Data: Removes binary picture data that can cause parsing issues
Malformed RTF: Gracefully handles malformed or incomplete RTF structures
Font Tables: Processes font table information for proper character rendering

Common exceptions:

UnicodeDecodeError: Raised when character encoding fails with errors="strict"
LookupError: Raised internally when unknown encoding is encountered (falls back to UTF-8)

Notes

No external dependencies - uses only Python standard library
Optimized for medical documents and other text-heavy RTF files
Handles hyperlinks by converting them to "text(url)" format
Preserves paragraph breaks and basic text structure
Supports all common RTF character encodings via charset_map
Table cells are converted using pipe (|) separators