Extract text from any document format without worrying about underlying complexities.
npx @tessl/cli install tessl/pypi-textract@1.6.0A comprehensive Python library for extracting text from any document format without worrying about underlying complexities. Textract provides a unified interface that automatically detects file types and applies appropriate extraction methods for 25+ document formats including PDFs, Word documents, images, audio files, and more.
pip install textractimport textractFor accessing exceptions:
from textract import exceptionsFor accessing constants:
from textract.parsers import DEFAULT_OUTPUT_ENCODING, EXTENSION_SYNONYMSFor accessing color utilities:
from textract.colors import red, green, blueimport textract
# Extract text from any supported file format
text = textract.process('/path/to/document.pdf')
print(text)
# Extract with specific encoding
text = textract.process('/path/to/document.docx', output_encoding='utf-8')
# Extract with parser-specific options
text = textract.process('/path/to/document.pdf', method='pdfminer')
# Extract with language specification for OCR
text = textract.process('/path/to/image.png', language='eng')
# Handle files without extensions
text = textract.process('/path/to/file', extension='.txt')Textract is built on a modular parser architecture that provides:
process() function handles all file types automaticallyThis design enables users to extract text from virtually any document format with a single function call while providing advanced options for specialized use cases.
The core functionality for extracting text from any supported document format with automatic format detection and method selection.
def process(filename, input_encoding=None, output_encoding='utf_8', extension=None, **kwargs):
"""
Extract text from any supported document format.
Parameters:
- filename (str): Path to the file to extract text from
- input_encoding (str, optional): Input encoding specification
- output_encoding (str): Output encoding (default: 'utf_8')
- extension (str, optional): Manual extension override for format detection
- **kwargs: Parser-specific options including:
- method (str): Extraction method ('pdftotext', 'pdfminer', 'tesseract', 'google', 'sphinx')
- language (str): Language code for OCR (e.g., 'eng', 'fra', 'deu')
- layout (bool): Preserve layout in PDF extraction (pdftotext method)
Returns:
str: Extracted text content
Raises:
- ExtensionNotSupported: When file extension is not supported
- MissingFileError: When specified file cannot be found
- UnknownMethod: When specified extraction method is unknown
- ShellError: When external command execution fails
"""Package name and version identifiers for compatibility checking and debugging.
__name__: str = "textract"
VERSION: str = "1.6.5"Comprehensive exception classes for robust error handling and user feedback.
class CommandLineError(Exception):
"""Base exception class for CLI errors with suppressed tracebacks."""
def render(self, msg: str) -> str:
"""
Format error messages for display.
Parameters:
- msg (str): Message template with variable placeholders
Returns:
str: Formatted message string
"""
class ExtensionNotSupported(CommandLineError):
"""Raised when file extension is not supported."""
def __init__(self, ext):
"""
Parameters:
- ext (str): The unsupported extension
"""
class MissingFileError(CommandLineError):
"""Raised when specified file cannot be found."""
def __init__(self, filename):
"""
Parameters:
- filename (str): The missing file path
"""
class UnknownMethod(CommandLineError):
"""Raised when specified extraction method is unknown."""
def __init__(self, method):
"""
Parameters:
- method (str): The unknown method name
"""
class ShellError(CommandLineError):
"""Raised when shell command execution fails."""
def __init__(self, command, exit_code, stdout, stderr):
"""
Parameters:
- command (str): Command that failed
- exit_code (int): Process exit code
- stdout (str): Standard output
- stderr (str): Standard error
"""
def is_not_installed(self):
"""Check if error is due to missing executable."""
def not_installed_message(self):
"""Get missing dependency message."""
def failed_message(self):
"""Get command failure message."""Constants for encoding and extension handling used throughout the parsing system.
EXTENSION_SYNONYMS: dict = {
".jpeg": ".jpg",
".tff": ".tiff",
".tif": ".tiff",
".htm": ".html",
"": ".txt",
".log": ".txt",
".tab": ".tsv"
}
DEFAULT_OUTPUT_ENCODING: str = 'utf_8'
DEFAULT_ENCODING: str = 'utf_8'Terminal color formatting functions for enhanced CLI output and user interfaces.
red: function
"""Apply red ANSI color codes to text string."""
green: function
"""Apply green ANSI color codes to text string."""
yellow: function
"""Apply yellow ANSI color codes to text string."""
blue: function
"""Apply blue ANSI color codes to text string."""
magenta: function
"""Apply magenta ANSI color codes to text string."""
cyan: function
"""Apply cyan ANSI color codes to text string."""
white: function
"""Apply white ANSI color codes to text string."""
bold_red: function
"""Apply bold red ANSI color codes to text string."""
bold_green: function
"""Apply bold green ANSI color codes to text string."""
bold_yellow: function
"""Apply bold yellow ANSI color codes to text string."""
bold_blue: function
"""Apply bold blue ANSI color codes to text string."""
bold_magenta: function
"""Apply bold magenta ANSI color codes to text string."""
bold_cyan: function
"""Apply bold cyan ANSI color codes to text string."""
bold_white: function
"""Apply bold white ANSI color codes to text string."""
def colorless(text: str) -> str:
"""
Remove ANSI color codes from text.
Parameters:
- text (str): Text containing ANSI color codes
Returns:
str: Text with color codes removed
"""Textract supports 25 distinct file formats through specialized parsers:
.txt - Plain text files (direct reading).doc - Microsoft Word documents (via antiword/catdoc).docx - Microsoft Word XML documents (via docx2txt).pdf - PDF documents (multiple methods: pdftotext, pdfminer, tesseract OCR).rtf - Rich Text Format (via unrtf).odt - OpenDocument Text (via odt2txt).epub - Electronic publication format (via zipfile + BeautifulSoup).html/.htm - HTML documents (via BeautifulSoup with table parsing).xls - Excel 97-2003 format (via xlrd).xlsx - Excel 2007+ format (via xlrd).csv - Comma-separated values (via csv module).tsv - Tab-separated values (via csv module).psv - Pipe-separated values (via csv module).pptx - PowerPoint presentations (via pptx).jpg/.jpeg - JPEG images (via tesseract OCR).png - PNG images (via tesseract OCR).gif - GIF images (via tesseract OCR).tiff/.tif - TIFF images (via tesseract OCR).wav - WAV audio files (via SpeechRecognition).mp3 - MP3 audio files (converted to WAV then processed).ogg - OGG audio files (converted to WAV then processed).eml - Email message files (via email.parser).msg - Outlook message files (via msg-extractor).json - JSON files (extracts all string values recursively).ps - PostScript files (via ps2ascii)Several file formats support multiple extraction methods via the method parameter:
# Default method using pdftotext utility
text = textract.process('document.pdf', method='pdftotext')
# Use pdfminer library for extraction
text = textract.process('document.pdf', method='pdfminer')
# OCR-based extraction for scanned PDFs
text = textract.process('document.pdf', method='tesseract')
# Preserve layout with pdftotext
text = textract.process('document.pdf', method='pdftotext', layout=True)# Google Speech Recognition (default)
text = textract.process('audio.wav', method='google')
# PocketSphinx offline recognition
text = textract.process('audio.wav', method='sphinx')# Specify language for OCR recognition
text = textract.process('image.png', language='eng') # English
text = textract.process('image.png', language='fra') # French
text = textract.process('image.png', language='deu') # GermanTextract provides a full-featured CLI with the same capabilities as the Python API:
# Basic text extraction
textract document.pdf
# Specify output encoding
textract --encoding utf-8 document.docx
# Override file extension detection
textract --extension .txt unknown_file
# Use specific extraction method
textract --method pdfminer document.pdf
# Save output to file
textract --output extracted.txt document.pdf
# Use parser-specific options
textract --option layout=True document.pdf
# Show version information
textract --versionfilename - Required input file path-e/--encoding - Output encoding specification--extension - Manual extension override for format detection-m/--method - Extraction method selection-o/--output - Output file specification-O/--option - Parser options in KEYWORD=VALUE format-v/--version - Display version information