or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.md

index.mddocs/

0

# Textract

1

2

A comprehensive Python library for extracting text from any document format without worrying about underlying complexities. Textract provides a unified interface that automatically detects file types and applies appropriate extraction methods for 25+ document formats including PDFs, Word documents, images, audio files, and more.

3

4

## Package Information

5

6

- **Package Name**: textract

7

- **Language**: Python

8

- **Installation**: `pip install textract`

9

10

## Core Imports

11

12

```python

13

import textract

14

```

15

16

For accessing exceptions:

17

18

```python

19

from textract import exceptions

20

```

21

22

For accessing constants:

23

24

```python

25

from textract.parsers import DEFAULT_OUTPUT_ENCODING, EXTENSION_SYNONYMS

26

```

27

28

For accessing color utilities:

29

30

```python

31

from textract.colors import red, green, blue

32

```

33

34

## Basic Usage

35

36

```python

37

import textract

38

39

# Extract text from any supported file format

40

text = textract.process('/path/to/document.pdf')

41

print(text)

42

43

# Extract with specific encoding

44

text = textract.process('/path/to/document.docx', output_encoding='utf-8')

45

46

# Extract with parser-specific options

47

text = textract.process('/path/to/document.pdf', method='pdfminer')

48

49

# Extract with language specification for OCR

50

text = textract.process('/path/to/image.png', language='eng')

51

52

# Handle files without extensions

53

text = textract.process('/path/to/file', extension='.txt')

54

```

55

56

## Architecture

57

58

Textract is built on a modular parser architecture that provides:

59

60

- **Unified Interface**: Single `process()` function handles all file types automatically

61

- **Format Detection**: Automatic file type detection based on extensions with override support

62

- **Parser Registry**: Extensible system supporting 25+ document formats via specialized parsers

63

- **Method Selection**: Multiple extraction methods for certain formats (PDFs, audio, images)

64

- **Encoding Handling**: Robust text encoding support with intelligent defaults

65

- **External Tool Integration**: Seamless integration with tools like tesseract, pdftotext, antiword, etc.

66

67

This design enables users to extract text from virtually any document format with a single function call while providing advanced options for specialized use cases.

68

69

## Capabilities

70

71

### Text Extraction

72

73

The core functionality for extracting text from any supported document format with automatic format detection and method selection.

74

75

```python { .api }

76

def process(filename, input_encoding=None, output_encoding='utf_8', extension=None, **kwargs):

77

"""

78

Extract text from any supported document format.

79

80

Parameters:

81

- filename (str): Path to the file to extract text from

82

- input_encoding (str, optional): Input encoding specification

83

- output_encoding (str): Output encoding (default: 'utf_8')

84

- extension (str, optional): Manual extension override for format detection

85

- **kwargs: Parser-specific options including:

86

- method (str): Extraction method ('pdftotext', 'pdfminer', 'tesseract', 'google', 'sphinx')

87

- language (str): Language code for OCR (e.g., 'eng', 'fra', 'deu')

88

- layout (bool): Preserve layout in PDF extraction (pdftotext method)

89

90

Returns:

91

str: Extracted text content

92

93

Raises:

94

- ExtensionNotSupported: When file extension is not supported

95

- MissingFileError: When specified file cannot be found

96

- UnknownMethod: When specified extraction method is unknown

97

- ShellError: When external command execution fails

98

"""

99

```

100

101

### Package Metadata

102

103

Package name and version identifiers for compatibility checking and debugging.

104

105

```python { .api }

106

__name__: str = "textract"

107

VERSION: str = "1.6.5"

108

```

109

110

### Error Handling

111

112

Comprehensive exception classes for robust error handling and user feedback.

113

114

```python { .api }

115

class CommandLineError(Exception):

116

"""Base exception class for CLI errors with suppressed tracebacks."""

117

118

def render(self, msg: str) -> str:

119

"""

120

Format error messages for display.

121

122

Parameters:

123

- msg (str): Message template with variable placeholders

124

125

Returns:

126

str: Formatted message string

127

"""

128

129

class ExtensionNotSupported(CommandLineError):

130

"""Raised when file extension is not supported."""

131

132

def __init__(self, ext):

133

"""

134

Parameters:

135

- ext (str): The unsupported extension

136

"""

137

138

class MissingFileError(CommandLineError):

139

"""Raised when specified file cannot be found."""

140

141

def __init__(self, filename):

142

"""

143

Parameters:

144

- filename (str): The missing file path

145

"""

146

147

class UnknownMethod(CommandLineError):

148

"""Raised when specified extraction method is unknown."""

149

150

def __init__(self, method):

151

"""

152

Parameters:

153

- method (str): The unknown method name

154

"""

155

156

class ShellError(CommandLineError):

157

"""Raised when shell command execution fails."""

158

159

def __init__(self, command, exit_code, stdout, stderr):

160

"""

161

Parameters:

162

- command (str): Command that failed

163

- exit_code (int): Process exit code

164

- stdout (str): Standard output

165

- stderr (str): Standard error

166

"""

167

168

def is_not_installed(self):

169

"""Check if error is due to missing executable."""

170

171

def not_installed_message(self):

172

"""Get missing dependency message."""

173

174

def failed_message(self):

175

"""Get command failure message."""

176

```

177

178

### Parser Constants

179

180

Constants for encoding and extension handling used throughout the parsing system.

181

182

```python { .api }

183

EXTENSION_SYNONYMS: dict = {

184

".jpeg": ".jpg",

185

".tff": ".tiff",

186

".tif": ".tiff",

187

".htm": ".html",

188

"": ".txt",

189

".log": ".txt",

190

".tab": ".tsv"

191

}

192

193

DEFAULT_OUTPUT_ENCODING: str = 'utf_8'

194

195

DEFAULT_ENCODING: str = 'utf_8'

196

```

197

198

### Color Utilities

199

200

Terminal color formatting functions for enhanced CLI output and user interfaces.

201

202

```python { .api }

203

red: function

204

"""Apply red ANSI color codes to text string."""

205

206

green: function

207

"""Apply green ANSI color codes to text string."""

208

209

yellow: function

210

"""Apply yellow ANSI color codes to text string."""

211

212

blue: function

213

"""Apply blue ANSI color codes to text string."""

214

215

magenta: function

216

"""Apply magenta ANSI color codes to text string."""

217

218

cyan: function

219

"""Apply cyan ANSI color codes to text string."""

220

221

white: function

222

"""Apply white ANSI color codes to text string."""

223

224

bold_red: function

225

"""Apply bold red ANSI color codes to text string."""

226

227

bold_green: function

228

"""Apply bold green ANSI color codes to text string."""

229

230

bold_yellow: function

231

"""Apply bold yellow ANSI color codes to text string."""

232

233

bold_blue: function

234

"""Apply bold blue ANSI color codes to text string."""

235

236

bold_magenta: function

237

"""Apply bold magenta ANSI color codes to text string."""

238

239

bold_cyan: function

240

"""Apply bold cyan ANSI color codes to text string."""

241

242

bold_white: function

243

"""Apply bold white ANSI color codes to text string."""

244

245

def colorless(text: str) -> str:

246

"""

247

Remove ANSI color codes from text.

248

249

Parameters:

250

- text (str): Text containing ANSI color codes

251

252

Returns:

253

str: Text with color codes removed

254

"""

255

```

256

257

## Supported File Formats

258

259

Textract supports 25 distinct file formats through specialized parsers:

260

261

### Document Formats

262

- **`.txt`** - Plain text files (direct reading)

263

- **`.doc`** - Microsoft Word documents (via antiword/catdoc)

264

- **`.docx`** - Microsoft Word XML documents (via docx2txt)

265

- **`.pdf`** - PDF documents (multiple methods: pdftotext, pdfminer, tesseract OCR)

266

- **`.rtf`** - Rich Text Format (via unrtf)

267

- **`.odt`** - OpenDocument Text (via odt2txt)

268

- **`.epub`** - Electronic publication format (via zipfile + BeautifulSoup)

269

- **`.html/.htm`** - HTML documents (via BeautifulSoup with table parsing)

270

271

### Spreadsheet Formats

272

- **`.xls`** - Excel 97-2003 format (via xlrd)

273

- **`.xlsx`** - Excel 2007+ format (via xlrd)

274

- **`.csv`** - Comma-separated values (via csv module)

275

- **`.tsv`** - Tab-separated values (via csv module)

276

- **`.psv`** - Pipe-separated values (via csv module)

277

278

### Presentation Formats

279

- **`.pptx`** - PowerPoint presentations (via pptx)

280

281

### Image Formats (OCR)

282

- **`.jpg/.jpeg`** - JPEG images (via tesseract OCR)

283

- **`.png`** - PNG images (via tesseract OCR)

284

- **`.gif`** - GIF images (via tesseract OCR)

285

- **`.tiff/.tif`** - TIFF images (via tesseract OCR)

286

287

### Audio Formats (Speech Recognition)

288

- **`.wav`** - WAV audio files (via SpeechRecognition)

289

- **`.mp3`** - MP3 audio files (converted to WAV then processed)

290

- **`.ogg`** - OGG audio files (converted to WAV then processed)

291

292

### Email Formats

293

- **`.eml`** - Email message files (via email.parser)

294

- **`.msg`** - Outlook message files (via msg-extractor)

295

296

### Other Formats

297

- **`.json`** - JSON files (extracts all string values recursively)

298

- **`.ps`** - PostScript files (via ps2ascii)

299

300

## Parser Method Options

301

302

Several file formats support multiple extraction methods via the `method` parameter:

303

304

### PDF Extraction Methods

305

```python

306

# Default method using pdftotext utility

307

text = textract.process('document.pdf', method='pdftotext')

308

309

# Use pdfminer library for extraction

310

text = textract.process('document.pdf', method='pdfminer')

311

312

# OCR-based extraction for scanned PDFs

313

text = textract.process('document.pdf', method='tesseract')

314

315

# Preserve layout with pdftotext

316

text = textract.process('document.pdf', method='pdftotext', layout=True)

317

```

318

319

### Audio Recognition Methods

320

```python

321

# Google Speech Recognition (default)

322

text = textract.process('audio.wav', method='google')

323

324

# PocketSphinx offline recognition

325

text = textract.process('audio.wav', method='sphinx')

326

```

327

328

### Image OCR Options

329

```python

330

# Specify language for OCR recognition

331

text = textract.process('image.png', language='eng') # English

332

text = textract.process('image.png', language='fra') # French

333

text = textract.process('image.png', language='deu') # German

334

```

335

336

## Command-Line Interface

337

338

Textract provides a full-featured CLI with the same capabilities as the Python API:

339

340

```bash

341

# Basic text extraction

342

textract document.pdf

343

344

# Specify output encoding

345

textract --encoding utf-8 document.docx

346

347

# Override file extension detection

348

textract --extension .txt unknown_file

349

350

# Use specific extraction method

351

textract --method pdfminer document.pdf

352

353

# Save output to file

354

textract --output extracted.txt document.pdf

355

356

# Use parser-specific options

357

textract --option layout=True document.pdf

358

359

# Show version information

360

textract --version

361

```

362

363

### CLI Options

364

- **`filename`** - Required input file path

365

- **`-e/--encoding`** - Output encoding specification

366

- **`--extension`** - Manual extension override for format detection

367

- **`-m/--method`** - Extraction method selection

368

- **`-o/--output`** - Output file specification

369

- **`-O/--option`** - Parser options in KEYWORD=VALUE format

370

- **`-v/--version`** - Display version information