or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md

tile.json

tessl/pypi-textract

Extract text from any document format without worrying about underlying complexities.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/textract@1.6.x

To install, run

npx @tessl/cli install tessl/pypi-textract@1.6.0

Textract

A comprehensive Python library for extracting text from any document format without worrying about underlying complexities. Textract provides a unified interface that automatically detects file types and applies appropriate extraction methods for 25+ document formats including PDFs, Word documents, images, audio files, and more.

Package Information

Package Name: textract
Language: Python
Installation: pip install textract

Core Imports

import textract

For accessing exceptions:

from textract import exceptions

For accessing constants:

from textract.parsers import DEFAULT_OUTPUT_ENCODING, EXTENSION_SYNONYMS

For accessing color utilities:

from textract.colors import red, green, blue

Basic Usage

import textract

# Extract text from any supported file format
text = textract.process('/path/to/document.pdf')
print(text)

# Extract with specific encoding
text = textract.process('/path/to/document.docx', output_encoding='utf-8')

# Extract with parser-specific options
text = textract.process('/path/to/document.pdf', method='pdfminer')

# Extract with language specification for OCR
text = textract.process('/path/to/image.png', language='eng')

# Handle files without extensions
text = textract.process('/path/to/file', extension='.txt')

Architecture

Textract is built on a modular parser architecture that provides:

Unified Interface: Single process() function handles all file types automatically
Format Detection: Automatic file type detection based on extensions with override support
Parser Registry: Extensible system supporting 25+ document formats via specialized parsers
Method Selection: Multiple extraction methods for certain formats (PDFs, audio, images)
Encoding Handling: Robust text encoding support with intelligent defaults
External Tool Integration: Seamless integration with tools like tesseract, pdftotext, antiword, etc.

This design enables users to extract text from virtually any document format with a single function call while providing advanced options for specialized use cases.

Capabilities

Text Extraction

The core functionality for extracting text from any supported document format with automatic format detection and method selection.

def process(filename, input_encoding=None, output_encoding='utf_8', extension=None, **kwargs):
    """
    Extract text from any supported document format.
    
    Parameters:
    - filename (str): Path to the file to extract text from
    - input_encoding (str, optional): Input encoding specification
    - output_encoding (str): Output encoding (default: 'utf_8')
    - extension (str, optional): Manual extension override for format detection
    - **kwargs: Parser-specific options including:
        - method (str): Extraction method ('pdftotext', 'pdfminer', 'tesseract', 'google', 'sphinx')  
        - language (str): Language code for OCR (e.g., 'eng', 'fra', 'deu')
        - layout (bool): Preserve layout in PDF extraction (pdftotext method)
    
    Returns:
    str: Extracted text content
    
    Raises:
    - ExtensionNotSupported: When file extension is not supported
    - MissingFileError: When specified file cannot be found
    - UnknownMethod: When specified extraction method is unknown
    - ShellError: When external command execution fails
    """

Package Metadata

Package name and version identifiers for compatibility checking and debugging.

__name__: str = "textract"
VERSION: str = "1.6.5"

Error Handling

Comprehensive exception classes for robust error handling and user feedback.

class CommandLineError(Exception):
    """Base exception class for CLI errors with suppressed tracebacks."""
    
    def render(self, msg: str) -> str:
        """
        Format error messages for display.
        
        Parameters:
        - msg (str): Message template with variable placeholders
        
        Returns:
        str: Formatted message string
        """

class ExtensionNotSupported(CommandLineError):
    """Raised when file extension is not supported."""
    
    def __init__(self, ext):
        """
        Parameters:
        - ext (str): The unsupported extension
        """

class MissingFileError(CommandLineError):
    """Raised when specified file cannot be found."""
    
    def __init__(self, filename):
        """
        Parameters:
        - filename (str): The missing file path
        """

class UnknownMethod(CommandLineError):
    """Raised when specified extraction method is unknown."""
    
    def __init__(self, method):
        """
        Parameters:
        - method (str): The unknown method name
        """

class ShellError(CommandLineError):
    """Raised when shell command execution fails."""
    
    def __init__(self, command, exit_code, stdout, stderr):
        """
        Parameters:
        - command (str): Command that failed
        - exit_code (int): Process exit code  
        - stdout (str): Standard output
        - stderr (str): Standard error
        """
    
    def is_not_installed(self):
        """Check if error is due to missing executable."""
        
    def not_installed_message(self):
        """Get missing dependency message."""
        
    def failed_message(self):
        """Get command failure message."""

Parser Constants

Constants for encoding and extension handling used throughout the parsing system.

EXTENSION_SYNONYMS: dict = {
    ".jpeg": ".jpg", 
    ".tff": ".tiff", 
    ".tif": ".tiff", 
    ".htm": ".html", 
    "": ".txt", 
    ".log": ".txt", 
    ".tab": ".tsv"
}

DEFAULT_OUTPUT_ENCODING: str = 'utf_8'

DEFAULT_ENCODING: str = 'utf_8'

Color Utilities

Terminal color formatting functions for enhanced CLI output and user interfaces.

red: function
"""Apply red ANSI color codes to text string."""

green: function
"""Apply green ANSI color codes to text string."""

yellow: function
"""Apply yellow ANSI color codes to text string."""

blue: function
"""Apply blue ANSI color codes to text string."""

magenta: function
"""Apply magenta ANSI color codes to text string."""

cyan: function
"""Apply cyan ANSI color codes to text string."""

white: function
"""Apply white ANSI color codes to text string."""

bold_red: function
"""Apply bold red ANSI color codes to text string."""

bold_green: function
"""Apply bold green ANSI color codes to text string."""

bold_yellow: function
"""Apply bold yellow ANSI color codes to text string."""

bold_blue: function
"""Apply bold blue ANSI color codes to text string."""

bold_magenta: function
"""Apply bold magenta ANSI color codes to text string."""

bold_cyan: function
"""Apply bold cyan ANSI color codes to text string."""

bold_white: function
"""Apply bold white ANSI color codes to text string."""

def colorless(text: str) -> str:
    """
    Remove ANSI color codes from text.
    
    Parameters:
    - text (str): Text containing ANSI color codes
    
    Returns:
    str: Text with color codes removed
    """

Supported File Formats

Textract supports 25 distinct file formats through specialized parsers:

Document Formats

.txt - Plain text files (direct reading)
.doc - Microsoft Word documents (via antiword/catdoc)
.docx - Microsoft Word XML documents (via docx2txt)
.pdf - PDF documents (multiple methods: pdftotext, pdfminer, tesseract OCR)
.rtf - Rich Text Format (via unrtf)
.odt - OpenDocument Text (via odt2txt)
.epub - Electronic publication format (via zipfile + BeautifulSoup)
.html/.htm - HTML documents (via BeautifulSoup with table parsing)

Spreadsheet Formats

.xls - Excel 97-2003 format (via xlrd)
.xlsx - Excel 2007+ format (via xlrd)
.csv - Comma-separated values (via csv module)
.tsv - Tab-separated values (via csv module)
.psv - Pipe-separated values (via csv module)

Presentation Formats

.pptx - PowerPoint presentations (via pptx)

Image Formats (OCR)

.jpg/.jpeg - JPEG images (via tesseract OCR)
.png - PNG images (via tesseract OCR)
.gif - GIF images (via tesseract OCR)
.tiff/.tif - TIFF images (via tesseract OCR)

Audio Formats (Speech Recognition)

.wav - WAV audio files (via SpeechRecognition)
.mp3 - MP3 audio files (converted to WAV then processed)
.ogg - OGG audio files (converted to WAV then processed)

Email Formats

.eml - Email message files (via email.parser)
.msg - Outlook message files (via msg-extractor)

Other Formats

.json - JSON files (extracts all string values recursively)
.ps - PostScript files (via ps2ascii)

Parser Method Options

Several file formats support multiple extraction methods via the method parameter:

PDF Extraction Methods

# Default method using pdftotext utility
text = textract.process('document.pdf', method='pdftotext')

# Use pdfminer library for extraction
text = textract.process('document.pdf', method='pdfminer')

# OCR-based extraction for scanned PDFs
text = textract.process('document.pdf', method='tesseract')

# Preserve layout with pdftotext
text = textract.process('document.pdf', method='pdftotext', layout=True)

Audio Recognition Methods

# Google Speech Recognition (default)
text = textract.process('audio.wav', method='google')

# PocketSphinx offline recognition
text = textract.process('audio.wav', method='sphinx')

Image OCR Options

# Specify language for OCR recognition
text = textract.process('image.png', language='eng')  # English
text = textract.process('image.png', language='fra')  # French
text = textract.process('image.png', language='deu')  # German

Command-Line Interface

Textract provides a full-featured CLI with the same capabilities as the Python API:

# Basic text extraction
textract document.pdf

# Specify output encoding
textract --encoding utf-8 document.docx

# Override file extension detection
textract --extension .txt unknown_file

# Use specific extraction method
textract --method pdfminer document.pdf

# Save output to file
textract --output extracted.txt document.pdf

# Use parser-specific options
textract --option layout=True document.pdf

# Show version information
textract --version

CLI Options

filename - Required input file path
-e/--encoding - Output encoding specification
--extension - Manual extension override for format detection
-m/--method - Extraction method selection
-o/--output - Output file specification
-O/--option - Parser options in KEYWORD=VALUE format
-v/--version - Display version information