Extract text from any document format without worrying about underlying complexities.
npx @tessl/cli install tessl/pypi-textract@1.6.00
# Textract
1
2
A comprehensive Python library for extracting text from any document format without worrying about underlying complexities. Textract provides a unified interface that automatically detects file types and applies appropriate extraction methods for 25+ document formats including PDFs, Word documents, images, audio files, and more.
3
4
## Package Information
5
6
- **Package Name**: textract
7
- **Language**: Python
8
- **Installation**: `pip install textract`
9
10
## Core Imports
11
12
```python
13
import textract
14
```
15
16
For accessing exceptions:
17
18
```python
19
from textract import exceptions
20
```
21
22
For accessing constants:
23
24
```python
25
from textract.parsers import DEFAULT_OUTPUT_ENCODING, EXTENSION_SYNONYMS
26
```
27
28
For accessing color utilities:
29
30
```python
31
from textract.colors import red, green, blue
32
```
33
34
## Basic Usage
35
36
```python
37
import textract
38
39
# Extract text from any supported file format
40
text = textract.process('/path/to/document.pdf')
41
print(text)
42
43
# Extract with specific encoding
44
text = textract.process('/path/to/document.docx', output_encoding='utf-8')
45
46
# Extract with parser-specific options
47
text = textract.process('/path/to/document.pdf', method='pdfminer')
48
49
# Extract with language specification for OCR
50
text = textract.process('/path/to/image.png', language='eng')
51
52
# Handle files without extensions
53
text = textract.process('/path/to/file', extension='.txt')
54
```
55
56
## Architecture
57
58
Textract is built on a modular parser architecture that provides:
59
60
- **Unified Interface**: Single `process()` function handles all file types automatically
61
- **Format Detection**: Automatic file type detection based on extensions with override support
62
- **Parser Registry**: Extensible system supporting 25+ document formats via specialized parsers
63
- **Method Selection**: Multiple extraction methods for certain formats (PDFs, audio, images)
64
- **Encoding Handling**: Robust text encoding support with intelligent defaults
65
- **External Tool Integration**: Seamless integration with tools like tesseract, pdftotext, antiword, etc.
66
67
This design enables users to extract text from virtually any document format with a single function call while providing advanced options for specialized use cases.
68
69
## Capabilities
70
71
### Text Extraction
72
73
The core functionality for extracting text from any supported document format with automatic format detection and method selection.
74
75
```python { .api }
76
def process(filename, input_encoding=None, output_encoding='utf_8', extension=None, **kwargs):
77
"""
78
Extract text from any supported document format.
79
80
Parameters:
81
- filename (str): Path to the file to extract text from
82
- input_encoding (str, optional): Input encoding specification
83
- output_encoding (str): Output encoding (default: 'utf_8')
84
- extension (str, optional): Manual extension override for format detection
85
- **kwargs: Parser-specific options including:
86
- method (str): Extraction method ('pdftotext', 'pdfminer', 'tesseract', 'google', 'sphinx')
87
- language (str): Language code for OCR (e.g., 'eng', 'fra', 'deu')
88
- layout (bool): Preserve layout in PDF extraction (pdftotext method)
89
90
Returns:
91
str: Extracted text content
92
93
Raises:
94
- ExtensionNotSupported: When file extension is not supported
95
- MissingFileError: When specified file cannot be found
96
- UnknownMethod: When specified extraction method is unknown
97
- ShellError: When external command execution fails
98
"""
99
```
100
101
### Package Metadata
102
103
Package name and version identifiers for compatibility checking and debugging.
104
105
```python { .api }
106
__name__: str = "textract"
107
VERSION: str = "1.6.5"
108
```
109
110
### Error Handling
111
112
Comprehensive exception classes for robust error handling and user feedback.
113
114
```python { .api }
115
class CommandLineError(Exception):
116
"""Base exception class for CLI errors with suppressed tracebacks."""
117
118
def render(self, msg: str) -> str:
119
"""
120
Format error messages for display.
121
122
Parameters:
123
- msg (str): Message template with variable placeholders
124
125
Returns:
126
str: Formatted message string
127
"""
128
129
class ExtensionNotSupported(CommandLineError):
130
"""Raised when file extension is not supported."""
131
132
def __init__(self, ext):
133
"""
134
Parameters:
135
- ext (str): The unsupported extension
136
"""
137
138
class MissingFileError(CommandLineError):
139
"""Raised when specified file cannot be found."""
140
141
def __init__(self, filename):
142
"""
143
Parameters:
144
- filename (str): The missing file path
145
"""
146
147
class UnknownMethod(CommandLineError):
148
"""Raised when specified extraction method is unknown."""
149
150
def __init__(self, method):
151
"""
152
Parameters:
153
- method (str): The unknown method name
154
"""
155
156
class ShellError(CommandLineError):
157
"""Raised when shell command execution fails."""
158
159
def __init__(self, command, exit_code, stdout, stderr):
160
"""
161
Parameters:
162
- command (str): Command that failed
163
- exit_code (int): Process exit code
164
- stdout (str): Standard output
165
- stderr (str): Standard error
166
"""
167
168
def is_not_installed(self):
169
"""Check if error is due to missing executable."""
170
171
def not_installed_message(self):
172
"""Get missing dependency message."""
173
174
def failed_message(self):
175
"""Get command failure message."""
176
```
177
178
### Parser Constants
179
180
Constants for encoding and extension handling used throughout the parsing system.
181
182
```python { .api }
183
EXTENSION_SYNONYMS: dict = {
184
".jpeg": ".jpg",
185
".tff": ".tiff",
186
".tif": ".tiff",
187
".htm": ".html",
188
"": ".txt",
189
".log": ".txt",
190
".tab": ".tsv"
191
}
192
193
DEFAULT_OUTPUT_ENCODING: str = 'utf_8'
194
195
DEFAULT_ENCODING: str = 'utf_8'
196
```
197
198
### Color Utilities
199
200
Terminal color formatting functions for enhanced CLI output and user interfaces.
201
202
```python { .api }
203
red: function
204
"""Apply red ANSI color codes to text string."""
205
206
green: function
207
"""Apply green ANSI color codes to text string."""
208
209
yellow: function
210
"""Apply yellow ANSI color codes to text string."""
211
212
blue: function
213
"""Apply blue ANSI color codes to text string."""
214
215
magenta: function
216
"""Apply magenta ANSI color codes to text string."""
217
218
cyan: function
219
"""Apply cyan ANSI color codes to text string."""
220
221
white: function
222
"""Apply white ANSI color codes to text string."""
223
224
bold_red: function
225
"""Apply bold red ANSI color codes to text string."""
226
227
bold_green: function
228
"""Apply bold green ANSI color codes to text string."""
229
230
bold_yellow: function
231
"""Apply bold yellow ANSI color codes to text string."""
232
233
bold_blue: function
234
"""Apply bold blue ANSI color codes to text string."""
235
236
bold_magenta: function
237
"""Apply bold magenta ANSI color codes to text string."""
238
239
bold_cyan: function
240
"""Apply bold cyan ANSI color codes to text string."""
241
242
bold_white: function
243
"""Apply bold white ANSI color codes to text string."""
244
245
def colorless(text: str) -> str:
246
"""
247
Remove ANSI color codes from text.
248
249
Parameters:
250
- text (str): Text containing ANSI color codes
251
252
Returns:
253
str: Text with color codes removed
254
"""
255
```
256
257
## Supported File Formats
258
259
Textract supports 25 distinct file formats through specialized parsers:
260
261
### Document Formats
262
- **`.txt`** - Plain text files (direct reading)
263
- **`.doc`** - Microsoft Word documents (via antiword/catdoc)
264
- **`.docx`** - Microsoft Word XML documents (via docx2txt)
265
- **`.pdf`** - PDF documents (multiple methods: pdftotext, pdfminer, tesseract OCR)
266
- **`.rtf`** - Rich Text Format (via unrtf)
267
- **`.odt`** - OpenDocument Text (via odt2txt)
268
- **`.epub`** - Electronic publication format (via zipfile + BeautifulSoup)
269
- **`.html/.htm`** - HTML documents (via BeautifulSoup with table parsing)
270
271
### Spreadsheet Formats
272
- **`.xls`** - Excel 97-2003 format (via xlrd)
273
- **`.xlsx`** - Excel 2007+ format (via xlrd)
274
- **`.csv`** - Comma-separated values (via csv module)
275
- **`.tsv`** - Tab-separated values (via csv module)
276
- **`.psv`** - Pipe-separated values (via csv module)
277
278
### Presentation Formats
279
- **`.pptx`** - PowerPoint presentations (via pptx)
280
281
### Image Formats (OCR)
282
- **`.jpg/.jpeg`** - JPEG images (via tesseract OCR)
283
- **`.png`** - PNG images (via tesseract OCR)
284
- **`.gif`** - GIF images (via tesseract OCR)
285
- **`.tiff/.tif`** - TIFF images (via tesseract OCR)
286
287
### Audio Formats (Speech Recognition)
288
- **`.wav`** - WAV audio files (via SpeechRecognition)
289
- **`.mp3`** - MP3 audio files (converted to WAV then processed)
290
- **`.ogg`** - OGG audio files (converted to WAV then processed)
291
292
### Email Formats
293
- **`.eml`** - Email message files (via email.parser)
294
- **`.msg`** - Outlook message files (via msg-extractor)
295
296
### Other Formats
297
- **`.json`** - JSON files (extracts all string values recursively)
298
- **`.ps`** - PostScript files (via ps2ascii)
299
300
## Parser Method Options
301
302
Several file formats support multiple extraction methods via the `method` parameter:
303
304
### PDF Extraction Methods
305
```python
306
# Default method using pdftotext utility
307
text = textract.process('document.pdf', method='pdftotext')
308
309
# Use pdfminer library for extraction
310
text = textract.process('document.pdf', method='pdfminer')
311
312
# OCR-based extraction for scanned PDFs
313
text = textract.process('document.pdf', method='tesseract')
314
315
# Preserve layout with pdftotext
316
text = textract.process('document.pdf', method='pdftotext', layout=True)
317
```
318
319
### Audio Recognition Methods
320
```python
321
# Google Speech Recognition (default)
322
text = textract.process('audio.wav', method='google')
323
324
# PocketSphinx offline recognition
325
text = textract.process('audio.wav', method='sphinx')
326
```
327
328
### Image OCR Options
329
```python
330
# Specify language for OCR recognition
331
text = textract.process('image.png', language='eng') # English
332
text = textract.process('image.png', language='fra') # French
333
text = textract.process('image.png', language='deu') # German
334
```
335
336
## Command-Line Interface
337
338
Textract provides a full-featured CLI with the same capabilities as the Python API:
339
340
```bash
341
# Basic text extraction
342
textract document.pdf
343
344
# Specify output encoding
345
textract --encoding utf-8 document.docx
346
347
# Override file extension detection
348
textract --extension .txt unknown_file
349
350
# Use specific extraction method
351
textract --method pdfminer document.pdf
352
353
# Save output to file
354
textract --output extracted.txt document.pdf
355
356
# Use parser-specific options
357
textract --option layout=True document.pdf
358
359
# Show version information
360
textract --version
361
```
362
363
### CLI Options
364
- **`filename`** - Required input file path
365
- **`-e/--encoding`** - Output encoding specification
366
- **`--extension`** - Manual extension override for format detection
367
- **`-m/--method`** - Extraction method selection
368
- **`-o/--output`** - Output file specification
369
- **`-O/--option`** - Parser options in KEYWORD=VALUE format
370
- **`-v/--version`** - Display version information