Tessl Tile for pypi/textract@1.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-textract

Extract text from any document format without worrying about underlying complexities.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/textract@1.6.x

To install, run

npx @tessl/cli install tessl/pypi-textract@1.6.0

0
# Textract
1

2
A comprehensive Python library for extracting text from any document format without worrying about underlying complexities. Textract provides a unified interface that automatically detects file types and applies appropriate extraction methods for 25+ document formats including PDFs, Word documents, images, audio files, and more.
3

4
## Package Information
5

6
- **Package Name**: textract
7
- **Language**: Python  
8
- **Installation**: `pip install textract`
9

10
## Core Imports
11

12
```python
13
import textract
14
```
15

16
For accessing exceptions:
17

18
```python
19
from textract import exceptions
20
```
21

22
For accessing constants:
23

24
```python
25
from textract.parsers import DEFAULT_OUTPUT_ENCODING, EXTENSION_SYNONYMS
26
```
27

28
For accessing color utilities:
29

30
```python
31
from textract.colors import red, green, blue
32
```
33

34
## Basic Usage
35

36
```python
37
import textract
38

39
# Extract text from any supported file format
40
text = textract.process('/path/to/document.pdf')
41
print(text)
42

43
# Extract with specific encoding
44
text = textract.process('/path/to/document.docx', output_encoding='utf-8')
45

46
# Extract with parser-specific options
47
text = textract.process('/path/to/document.pdf', method='pdfminer')
48

49
# Extract with language specification for OCR
50
text = textract.process('/path/to/image.png', language='eng')
51

52
# Handle files without extensions
53
text = textract.process('/path/to/file', extension='.txt')
54
```
55

56
## Architecture
57

58
Textract is built on a modular parser architecture that provides:
59

60
- **Unified Interface**: Single `process()` function handles all file types automatically
61
- **Format Detection**: Automatic file type detection based on extensions with override support
62
- **Parser Registry**: Extensible system supporting 25+ document formats via specialized parsers
63
- **Method Selection**: Multiple extraction methods for certain formats (PDFs, audio, images)
64
- **Encoding Handling**: Robust text encoding support with intelligent defaults
65
- **External Tool Integration**: Seamless integration with tools like tesseract, pdftotext, antiword, etc.
66

67
This design enables users to extract text from virtually any document format with a single function call while providing advanced options for specialized use cases.
68

69
## Capabilities
70

71
### Text Extraction
72

73
The core functionality for extracting text from any supported document format with automatic format detection and method selection.
74

75
```python { .api }
76
def process(filename, input_encoding=None, output_encoding='utf_8', extension=None, **kwargs):
77
    """
78
    Extract text from any supported document format.
79
    
80
    Parameters:
81
    - filename (str): Path to the file to extract text from
82
    - input_encoding (str, optional): Input encoding specification
83
    - output_encoding (str): Output encoding (default: 'utf_8')
84
    - extension (str, optional): Manual extension override for format detection
85
    - **kwargs: Parser-specific options including:
86
        - method (str): Extraction method ('pdftotext', 'pdfminer', 'tesseract', 'google', 'sphinx')  
87
        - language (str): Language code for OCR (e.g., 'eng', 'fra', 'deu')
88
        - layout (bool): Preserve layout in PDF extraction (pdftotext method)
89
    
90
    Returns:
91
    str: Extracted text content
92
    
93
    Raises:
94
    - ExtensionNotSupported: When file extension is not supported
95
    - MissingFileError: When specified file cannot be found
96
    - UnknownMethod: When specified extraction method is unknown
97
    - ShellError: When external command execution fails
98
    """
99
```
100

101
### Package Metadata
102

103
Package name and version identifiers for compatibility checking and debugging.
104

105
```python { .api }
106
__name__: str = "textract"
107
VERSION: str = "1.6.5"
108
```
109

110
### Error Handling
111

112
Comprehensive exception classes for robust error handling and user feedback.
113

114
```python { .api }
115
class CommandLineError(Exception):
116
    """Base exception class for CLI errors with suppressed tracebacks."""
117
    
118
    def render(self, msg: str) -> str:
119
        """
120
        Format error messages for display.
121
        
122
        Parameters:
123
        - msg (str): Message template with variable placeholders
124
        
125
        Returns:
126
        str: Formatted message string
127
        """
128

129
class ExtensionNotSupported(CommandLineError):
130
    """Raised when file extension is not supported."""
131
    
132
    def __init__(self, ext):
133
        """
134
        Parameters:
135
        - ext (str): The unsupported extension
136
        """
137

138
class MissingFileError(CommandLineError):
139
    """Raised when specified file cannot be found."""
140
    
141
    def __init__(self, filename):
142
        """
143
        Parameters:
144
        - filename (str): The missing file path
145
        """
146

147
class UnknownMethod(CommandLineError):
148
    """Raised when specified extraction method is unknown."""
149
    
150
    def __init__(self, method):
151
        """
152
        Parameters:
153
        - method (str): The unknown method name
154
        """
155

156
class ShellError(CommandLineError):
157
    """Raised when shell command execution fails."""
158
    
159
    def __init__(self, command, exit_code, stdout, stderr):
160
        """
161
        Parameters:
162
        - command (str): Command that failed
163
        - exit_code (int): Process exit code  
164
        - stdout (str): Standard output
165
        - stderr (str): Standard error
166
        """
167
    
168
    def is_not_installed(self):
169
        """Check if error is due to missing executable."""
170
        
171
    def not_installed_message(self):
172
        """Get missing dependency message."""
173
        
174
    def failed_message(self):
175
        """Get command failure message."""
176
```
177

178
### Parser Constants
179

180
Constants for encoding and extension handling used throughout the parsing system.
181

182
```python { .api }
183
EXTENSION_SYNONYMS: dict = {
184
    ".jpeg": ".jpg", 
185
    ".tff": ".tiff", 
186
    ".tif": ".tiff", 
187
    ".htm": ".html", 
188
    "": ".txt", 
189
    ".log": ".txt", 
190
    ".tab": ".tsv"
191
}
192

193
DEFAULT_OUTPUT_ENCODING: str = 'utf_8'
194

195
DEFAULT_ENCODING: str = 'utf_8'
196
```
197

198
### Color Utilities
199

200
Terminal color formatting functions for enhanced CLI output and user interfaces.
201

202
```python { .api }
203
red: function
204
"""Apply red ANSI color codes to text string."""
205

206
green: function
207
"""Apply green ANSI color codes to text string."""
208

209
yellow: function
210
"""Apply yellow ANSI color codes to text string."""
211

212
blue: function
213
"""Apply blue ANSI color codes to text string."""
214

215
magenta: function
216
"""Apply magenta ANSI color codes to text string."""
217

218
cyan: function
219
"""Apply cyan ANSI color codes to text string."""
220

221
white: function
222
"""Apply white ANSI color codes to text string."""
223

224
bold_red: function
225
"""Apply bold red ANSI color codes to text string."""
226

227
bold_green: function
228
"""Apply bold green ANSI color codes to text string."""
229

230
bold_yellow: function
231
"""Apply bold yellow ANSI color codes to text string."""
232

233
bold_blue: function
234
"""Apply bold blue ANSI color codes to text string."""
235

236
bold_magenta: function
237
"""Apply bold magenta ANSI color codes to text string."""
238

239
bold_cyan: function
240
"""Apply bold cyan ANSI color codes to text string."""
241

242
bold_white: function
243
"""Apply bold white ANSI color codes to text string."""
244

245
def colorless(text: str) -> str:
246
    """
247
    Remove ANSI color codes from text.
248
    
249
    Parameters:
250
    - text (str): Text containing ANSI color codes
251
    
252
    Returns:
253
    str: Text with color codes removed
254
    """
255
```
256

257
## Supported File Formats
258

259
Textract supports 25 distinct file formats through specialized parsers:
260

261
### Document Formats
262
- **`.txt`** - Plain text files (direct reading)
263
- **`.doc`** - Microsoft Word documents (via antiword/catdoc)
264
- **`.docx`** - Microsoft Word XML documents (via docx2txt)
265
- **`.pdf`** - PDF documents (multiple methods: pdftotext, pdfminer, tesseract OCR)
266
- **`.rtf`** - Rich Text Format (via unrtf)
267
- **`.odt`** - OpenDocument Text (via odt2txt)
268
- **`.epub`** - Electronic publication format (via zipfile + BeautifulSoup)
269
- **`.html/.htm`** - HTML documents (via BeautifulSoup with table parsing)
270

271
### Spreadsheet Formats
272
- **`.xls`** - Excel 97-2003 format (via xlrd)
273
- **`.xlsx`** - Excel 2007+ format (via xlrd)
274
- **`.csv`** - Comma-separated values (via csv module)
275
- **`.tsv`** - Tab-separated values (via csv module)
276
- **`.psv`** - Pipe-separated values (via csv module)
277

278
### Presentation Formats
279
- **`.pptx`** - PowerPoint presentations (via pptx)
280

281
### Image Formats (OCR)
282
- **`.jpg/.jpeg`** - JPEG images (via tesseract OCR)
283
- **`.png`** - PNG images (via tesseract OCR)
284
- **`.gif`** - GIF images (via tesseract OCR)
285
- **`.tiff/.tif`** - TIFF images (via tesseract OCR)
286

287
### Audio Formats (Speech Recognition)
288
- **`.wav`** - WAV audio files (via SpeechRecognition)
289
- **`.mp3`** - MP3 audio files (converted to WAV then processed)
290
- **`.ogg`** - OGG audio files (converted to WAV then processed)
291

292
### Email Formats
293
- **`.eml`** - Email message files (via email.parser)
294
- **`.msg`** - Outlook message files (via msg-extractor)
295

296
### Other Formats
297
- **`.json`** - JSON files (extracts all string values recursively)
298
- **`.ps`** - PostScript files (via ps2ascii)
299

300
## Parser Method Options
301

302
Several file formats support multiple extraction methods via the `method` parameter:
303

304
### PDF Extraction Methods
305
```python
306
# Default method using pdftotext utility
307
text = textract.process('document.pdf', method='pdftotext')
308

309
# Use pdfminer library for extraction
310
text = textract.process('document.pdf', method='pdfminer')
311

312
# OCR-based extraction for scanned PDFs
313
text = textract.process('document.pdf', method='tesseract')
314

315
# Preserve layout with pdftotext
316
text = textract.process('document.pdf', method='pdftotext', layout=True)
317
```
318

319
### Audio Recognition Methods
320
```python
321
# Google Speech Recognition (default)
322
text = textract.process('audio.wav', method='google')
323

324
# PocketSphinx offline recognition
325
text = textract.process('audio.wav', method='sphinx')
326
```
327

328
### Image OCR Options
329
```python
330
# Specify language for OCR recognition
331
text = textract.process('image.png', language='eng')  # English
332
text = textract.process('image.png', language='fra')  # French
333
text = textract.process('image.png', language='deu')  # German
334
```
335

336
## Command-Line Interface
337

338
Textract provides a full-featured CLI with the same capabilities as the Python API:
339

340
```bash
341
# Basic text extraction
342
textract document.pdf
343

344
# Specify output encoding
345
textract --encoding utf-8 document.docx
346

347
# Override file extension detection
348
textract --extension .txt unknown_file
349

350
# Use specific extraction method
351
textract --method pdfminer document.pdf
352

353
# Save output to file
354
textract --output extracted.txt document.pdf
355

356
# Use parser-specific options
357
textract --option layout=True document.pdf
358

359
# Show version information
360
textract --version
361
```
362

363
### CLI Options
364
- **`filename`** - Required input file path
365
- **`-e/--encoding`** - Output encoding specification
366
- **`--extension`** - Manual extension override for format detection
367
- **`-m/--method`** - Extraction method selection
368
- **`-o/--output`** - Output file specification
369
- **`-O/--option`** - Parser options in KEYWORD=VALUE format
370
- **`-v/--version`** - Display version information