A simple library to convert Rich Text Format (RTF) files to plain text
npx @tessl/cli install tessl/pypi-striprtf@0.0.00
# striprtf
1
2
A simple Python library to convert Rich Text Format (RTF) files to plain text. The library is specifically designed to handle medical documents and other RTF files that need to be parsed and processed, providing flexible encoding options and robust error handling for Unicode decoding issues.
3
4
## Package Information
5
6
- **Package Name**: striprtf
7
- **Language**: Python
8
- **Installation**: `pip install striprtf`
9
- **Minimum Python Version**: 3.8+
10
11
## Core Imports
12
13
```python
14
from striprtf.striprtf import rtf_to_text
15
```
16
17
For advanced use cases:
18
19
```python
20
from striprtf.striprtf import rtf_to_text, remove_pict_groups
21
```
22
23
Version information:
24
25
```python
26
from striprtf import __version__
27
```
28
29
## Basic Usage
30
31
```python
32
from striprtf.striprtf import rtf_to_text
33
34
# Convert RTF string to plain text
35
rtf = "some RTF encoded string"
36
text = rtf_to_text(rtf)
37
print(text)
38
39
# With custom encoding
40
rtf = "some RTF encoded string in latin1"
41
text = rtf_to_text(rtf, encoding="latin-1")
42
print(text)
43
44
# With error handling for problematic encodings
45
rtf = "some RTF encoded string"
46
text = rtf_to_text(rtf, errors="ignore")
47
print(text)
48
```
49
50
## Capabilities
51
52
### RTF to Text Conversion
53
54
Converts Rich Text Format (RTF) text to plain text with full Unicode support, automatic encoding detection, and robust error handling.
55
56
```python { .api }
57
def rtf_to_text(text, encoding="cp1252", errors="strict"):
58
"""
59
Converts RTF text to plain text.
60
61
Parameters:
62
- text (str): The RTF text to convert
63
- encoding (str): Input encoding, defaults to "cp1252". Ignored if RTF file contains explicit codepage directive
64
- errors (str): How to handle encoding errors. "strict" (default) raises errors, "ignore" skips problematic characters
65
66
Returns:
67
str: The converted RTF text as a Python unicode string
68
69
Raises:
70
UnicodeDecodeError: When encoding errors occur and errors="strict"
71
"""
72
```
73
74
### Binary Data Processing
75
76
Removes binary picture data from RTF text that can cause parsing issues. This function is automatically called by rtf_to_text but can be used independently for preprocessing.
77
78
```python { .api }
79
def remove_pict_groups(rtf_text):
80
"""
81
Remove all \\pict groups with binary data from the RTF text.
82
83
Parameters:
84
- rtf_text (str): The RTF text containing potentially problematic \\pict groups
85
86
Returns:
87
str: The RTF text with binary-encoded \\pict groups removed
88
89
Note: Returns original text if no binary-encoded \\pict groups are found
90
"""
91
```
92
93
### Command Line Interface
94
95
Command-line tool for converting RTF files to plain text. The CLI is implemented as a separate script that imports and uses the rtf_to_text function.
96
97
```python { .api }
98
def main():
99
"""
100
Command-line entry point for converting RTF files to text.
101
Located in striprtf/striprtf script file.
102
103
Usage: striprtf <rtf_file>
104
105
Arguments:
106
- rtf_file: Path to RTF file to convert (required, file opened with UTF-8 encoding)
107
108
Options:
109
- --version: Show version and exit
110
111
Note: Installed as 'striprtf' command via package scripts configuration
112
"""
113
```
114
115
## Constants and Data Structures
116
117
### Character Set Mappings
118
119
```python { .api }
120
charset_map: dict
121
# Mapping of RTF charset numbers to Python encoding names
122
# Contains mappings for major character sets including cp1252, cp932, cp949, etc.
123
124
destinations: frozenset
125
# Set of RTF control words that specify "destinations" to ignore during parsing
126
# Contains RTF keywords like 'fonttbl', 'colortbl', 'stylesheet', etc.
127
128
specialchars: dict
129
# Translation mapping for special RTF characters to Unicode equivalents
130
# Maps RTF escape sequences to actual characters (e.g., 'emdash' -> '\\u2014')
131
132
sectionchars: dict
133
# Translation mapping for RTF section and paragraph control words
134
# Maps section-related RTF keywords to line break characters (e.g., 'par' -> '\\n')
135
```
136
137
### Regular Expression Patterns
138
139
```python { .api }
140
PATTERN: re.Pattern
141
# Main regex pattern for parsing RTF tokens and control words
142
143
HYPERLINKS: re.Pattern
144
# Regex pattern for extracting hyperlinks from RTF HYPERLINK fields
145
146
FONTTABLE: re.Pattern
147
# Regex pattern for parsing font table information
148
```
149
150
## Usage Examples
151
152
### Processing RTF Files
153
154
```python
155
from striprtf.striprtf import rtf_to_text
156
157
# Read RTF file and convert to text
158
with open('document.rtf', 'r', encoding='utf-8') as f:
159
rtf_content = f.read()
160
161
plain_text = rtf_to_text(rtf_content)
162
print(plain_text)
163
```
164
165
### Handling Encoding Issues
166
167
```python
168
from striprtf.striprtf import rtf_to_text
169
170
# For problematic RTF files with encoding issues
171
try:
172
text = rtf_to_text(rtf_content, encoding="cp1252", errors="strict")
173
except UnicodeDecodeError:
174
# Fallback to ignore encoding errors
175
text = rtf_to_text(rtf_content, errors="ignore")
176
```
177
178
### Advanced Binary Data Processing
179
180
```python
181
from striprtf.striprtf import rtf_to_text, remove_pict_groups
182
183
# For RTF files with known binary picture issues, preprocess first
184
rtf_content = "\\rtf1\\pict\\bin1024{binary data here}\\par text"
185
cleaned_rtf = remove_pict_groups(rtf_content)
186
text = rtf_to_text(cleaned_rtf)
187
```
188
189
### Command Line Usage
190
191
```bash
192
# Convert RTF file to plain text
193
striprtf document.rtf
194
195
# Check version
196
striprtf --version
197
```
198
199
## Error Handling
200
201
The library handles various RTF parsing challenges:
202
203
- **Encoding Detection**: Automatically detects codepage directives in RTF files
204
- **Unicode Decoding**: Handles Unicode characters and escape sequences
205
- **Binary Data**: Removes binary picture data that can cause parsing issues
206
- **Malformed RTF**: Gracefully handles malformed or incomplete RTF structures
207
- **Font Tables**: Processes font table information for proper character rendering
208
209
Common exceptions:
210
- `UnicodeDecodeError`: Raised when character encoding fails with `errors="strict"`
211
- `LookupError`: Raised internally when unknown encoding is encountered (falls back to UTF-8)
212
213
## Notes
214
215
- No external dependencies - uses only Python standard library
216
- Optimized for medical documents and other text-heavy RTF files
217
- Handles hyperlinks by converting them to "text(url)" format
218
- Preserves paragraph breaks and basic text structure
219
- Supports all common RTF character encodings via charset_map
220
- Table cells are converted using pipe (|) separators