or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-striprtf

A simple library to convert Rich Text Format (RTF) files to plain text

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/striprtf@0.0.x

To install, run

npx @tessl/cli install tessl/pypi-striprtf@0.0.0

0

# striprtf

1

2

A simple Python library to convert Rich Text Format (RTF) files to plain text. The library is specifically designed to handle medical documents and other RTF files that need to be parsed and processed, providing flexible encoding options and robust error handling for Unicode decoding issues.

3

4

## Package Information

5

6

- **Package Name**: striprtf

7

- **Language**: Python

8

- **Installation**: `pip install striprtf`

9

- **Minimum Python Version**: 3.8+

10

11

## Core Imports

12

13

```python

14

from striprtf.striprtf import rtf_to_text

15

```

16

17

For advanced use cases:

18

19

```python

20

from striprtf.striprtf import rtf_to_text, remove_pict_groups

21

```

22

23

Version information:

24

25

```python

26

from striprtf import __version__

27

```

28

29

## Basic Usage

30

31

```python

32

from striprtf.striprtf import rtf_to_text

33

34

# Convert RTF string to plain text

35

rtf = "some RTF encoded string"

36

text = rtf_to_text(rtf)

37

print(text)

38

39

# With custom encoding

40

rtf = "some RTF encoded string in latin1"

41

text = rtf_to_text(rtf, encoding="latin-1")

42

print(text)

43

44

# With error handling for problematic encodings

45

rtf = "some RTF encoded string"

46

text = rtf_to_text(rtf, errors="ignore")

47

print(text)

48

```

49

50

## Capabilities

51

52

### RTF to Text Conversion

53

54

Converts Rich Text Format (RTF) text to plain text with full Unicode support, automatic encoding detection, and robust error handling.

55

56

```python { .api }

57

def rtf_to_text(text, encoding="cp1252", errors="strict"):

58

"""

59

Converts RTF text to plain text.

60

61

Parameters:

62

- text (str): The RTF text to convert

63

- encoding (str): Input encoding, defaults to "cp1252". Ignored if RTF file contains explicit codepage directive

64

- errors (str): How to handle encoding errors. "strict" (default) raises errors, "ignore" skips problematic characters

65

66

Returns:

67

str: The converted RTF text as a Python unicode string

68

69

Raises:

70

UnicodeDecodeError: When encoding errors occur and errors="strict"

71

"""

72

```

73

74

### Binary Data Processing

75

76

Removes binary picture data from RTF text that can cause parsing issues. This function is automatically called by rtf_to_text but can be used independently for preprocessing.

77

78

```python { .api }

79

def remove_pict_groups(rtf_text):

80

"""

81

Remove all \\pict groups with binary data from the RTF text.

82

83

Parameters:

84

- rtf_text (str): The RTF text containing potentially problematic \\pict groups

85

86

Returns:

87

str: The RTF text with binary-encoded \\pict groups removed

88

89

Note: Returns original text if no binary-encoded \\pict groups are found

90

"""

91

```

92

93

### Command Line Interface

94

95

Command-line tool for converting RTF files to plain text. The CLI is implemented as a separate script that imports and uses the rtf_to_text function.

96

97

```python { .api }

98

def main():

99

"""

100

Command-line entry point for converting RTF files to text.

101

Located in striprtf/striprtf script file.

102

103

Usage: striprtf <rtf_file>

104

105

Arguments:

106

- rtf_file: Path to RTF file to convert (required, file opened with UTF-8 encoding)

107

108

Options:

109

- --version: Show version and exit

110

111

Note: Installed as 'striprtf' command via package scripts configuration

112

"""

113

```

114

115

## Constants and Data Structures

116

117

### Character Set Mappings

118

119

```python { .api }

120

charset_map: dict

121

# Mapping of RTF charset numbers to Python encoding names

122

# Contains mappings for major character sets including cp1252, cp932, cp949, etc.

123

124

destinations: frozenset

125

# Set of RTF control words that specify "destinations" to ignore during parsing

126

# Contains RTF keywords like 'fonttbl', 'colortbl', 'stylesheet', etc.

127

128

specialchars: dict

129

# Translation mapping for special RTF characters to Unicode equivalents

130

# Maps RTF escape sequences to actual characters (e.g., 'emdash' -> '\\u2014')

131

132

sectionchars: dict

133

# Translation mapping for RTF section and paragraph control words

134

# Maps section-related RTF keywords to line break characters (e.g., 'par' -> '\\n')

135

```

136

137

### Regular Expression Patterns

138

139

```python { .api }

140

PATTERN: re.Pattern

141

# Main regex pattern for parsing RTF tokens and control words

142

143

HYPERLINKS: re.Pattern

144

# Regex pattern for extracting hyperlinks from RTF HYPERLINK fields

145

146

FONTTABLE: re.Pattern

147

# Regex pattern for parsing font table information

148

```

149

150

## Usage Examples

151

152

### Processing RTF Files

153

154

```python

155

from striprtf.striprtf import rtf_to_text

156

157

# Read RTF file and convert to text

158

with open('document.rtf', 'r', encoding='utf-8') as f:

159

rtf_content = f.read()

160

161

plain_text = rtf_to_text(rtf_content)

162

print(plain_text)

163

```

164

165

### Handling Encoding Issues

166

167

```python

168

from striprtf.striprtf import rtf_to_text

169

170

# For problematic RTF files with encoding issues

171

try:

172

text = rtf_to_text(rtf_content, encoding="cp1252", errors="strict")

173

except UnicodeDecodeError:

174

# Fallback to ignore encoding errors

175

text = rtf_to_text(rtf_content, errors="ignore")

176

```

177

178

### Advanced Binary Data Processing

179

180

```python

181

from striprtf.striprtf import rtf_to_text, remove_pict_groups

182

183

# For RTF files with known binary picture issues, preprocess first

184

rtf_content = "\\rtf1\\pict\\bin1024{binary data here}\\par text"

185

cleaned_rtf = remove_pict_groups(rtf_content)

186

text = rtf_to_text(cleaned_rtf)

187

```

188

189

### Command Line Usage

190

191

```bash

192

# Convert RTF file to plain text

193

striprtf document.rtf

194

195

# Check version

196

striprtf --version

197

```

198

199

## Error Handling

200

201

The library handles various RTF parsing challenges:

202

203

- **Encoding Detection**: Automatically detects codepage directives in RTF files

204

- **Unicode Decoding**: Handles Unicode characters and escape sequences

205

- **Binary Data**: Removes binary picture data that can cause parsing issues

206

- **Malformed RTF**: Gracefully handles malformed or incomplete RTF structures

207

- **Font Tables**: Processes font table information for proper character rendering

208

209

Common exceptions:

210

- `UnicodeDecodeError`: Raised when character encoding fails with `errors="strict"`

211

- `LookupError`: Raised internally when unknown encoding is encountered (falls back to UTF-8)

212

213

## Notes

214

215

- No external dependencies - uses only Python standard library

216

- Optimized for medical documents and other text-heavy RTF files

217

- Handles hyperlinks by converting them to "text(url)" format

218

- Preserves paragraph breaks and basic text structure

219

- Supports all common RTF character encodings via charset_map

220

- Table cells are converted using pipe (|) separators