Tessl Tile for pypi/pdftotext@3.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md

index.mddocs/

0
# pdftotext
1

2
Simple Python library for extracting text from PDF documents using the Poppler backend. The library provides a minimal but complete API through a single PDF class that supports sequential access to pages, password-protected documents, and multiple text extraction modes for optimal readability.
3

4
## Package Information
5

6
- **Package Name**: pdftotext
7
- **Language**: Python (with C++ extension)
8
- **Installation**: `pip install pdftotext`
9
- **System Dependencies**: libpoppler-cpp, pkg-config, python3-dev
10

11
## Core Imports
12

13
```python
14
import pdftotext
15
```
16

17
## Basic Usage
18

19
```python
20
import pdftotext
21

22
# Load a PDF file
23
with open("document.pdf", "rb") as f:
24
    pdf = pdftotext.PDF(f)
25

26
# Check page count
27
print(f"Document has {len(pdf)} pages")
28

29
# Read individual pages
30
print("First page:")
31
print(pdf[0])
32

33
print("Last page:")
34
print(pdf[-1])
35

36
# Iterate through all pages
37
for page_num, page_text in enumerate(pdf):
38
    print(f"--- Page {page_num + 1} ---")
39
    print(page_text)
40

41
# Read all text as single string
42
full_text = "\n\n".join(pdf)
43
print(full_text)
44
```
45

46
## Capabilities
47

48
### PDF Document Loading
49

50
Load PDF documents from file-like objects with optional password authentication and text extraction mode configuration.
51

52
```python { .api }
53
class PDF:
54
    def __init__(self, pdf_file, password="", raw=False, physical=False):
55
        """
56
        Initialize PDF object for text extraction.
57

58
        Args:
59
            pdf_file: A file-like object opened in binary mode containing PDF data
60
            password (str, optional): Password to unlock encrypted PDFs. Both owner and user passwords work. Defaults to "".
61
            raw (bool, optional): Extract text in content stream order (as stored in PDF). Defaults to False.
62
            physical (bool, optional): Extract text in physical layout order (spatial arrangement on page). Defaults to False.
63

64
        Raises:
65
            pdftotext.Error: If PDF is invalid, corrupted, or password-protected without correct password
66
            TypeError: If pdf_file is not a file-like object or opened in text mode
67
            ValueError: If both raw and physical are True, or if raw/physical values are invalid
68

69
        Note:
70
            The raw and physical parameters are mutually exclusive. Default mode provides most readable output
71
            by respecting logical document structure. Usually this is preferred over raw or physical modes.
72
        """
73
```
74

75
### Page Access
76

77
Access individual pages as strings using sequence-like interface with support for indexing and iteration.
78

79
```python { .api }
80
def __len__(self) -> int:
81
    """
82
    Return the number of pages in the PDF document.
83

84
    Returns:
85
        int: Number of pages in the document
86
    """
87

88
def __getitem__(self, index: int) -> str:
89
    """
90
    Get text content of a specific page.
91

92
    Args:
93
        index (int): Page index (0-based). Supports negative indexing.
94

95
    Returns:
96
        str: Text content of the page as UTF-8 string
97

98
    Raises:
99
        IndexError: If index is out of range
100
        pdftotext.Error: If page cannot be read due to corruption
101
    """
102

103
def __iter__(self):
104
    """
105
    Enable iteration over pages, yielding page text.
106

107
    Yields:
108
        str: Text content of each page in sequence
109

110
    Example:
111
        for page in pdf:
112
            print(page)
113
    """
114
```
115

116
### Text Extraction Modes
117

118
Configure how text is extracted from PDF pages to optimize for different document layouts and reading requirements.
119

120
**Default Mode** (recommended): Most readable output that respects logical document structure. Handles multi-column layouts, reading order, and text flow intelligently.
121

122
**Raw Mode** (`raw=True`): Extracts text in the order it appears in the PDF content stream. Useful for debugging or when document structure is less important than preserving original ordering.
123

124
**Physical Mode** (`physical=True`): Extracts text in physical layout order based on spatial arrangement on the page. Can be useful for documents with complex layouts where spatial positioning matters.
125

126
Usage examples:
127

128
```python
129
# Default mode - most readable
130
with open("document.pdf", "rb") as f:
131
    pdf = pdftotext.PDF(f)
132
    text = pdf[0]  # Respects logical structure
133

134
# Raw mode - content stream order
135
with open("document.pdf", "rb") as f:
136
    pdf = pdftotext.PDF(f, raw=True)
137
    text = pdf[0]  # Order as stored in PDF
138

139
# Physical mode - spatial order
140
with open("document.pdf", "rb") as f:
141
    pdf = pdftotext.PDF(f, physical=True)
142
    text = pdf[0]  # Spatial arrangement on page
143
```
144

145
### Password-Protected PDFs
146

147
Handle encrypted PDF documents using owner or user passwords.
148

149
```python
150
# Unlock with password
151
with open("secure_document.pdf", "rb") as f:
152
    pdf = pdftotext.PDF(f, password="secret123")
153
    text = pdf[0]
154

155
# Both owner and user passwords work
156
with open("encrypted.pdf", "rb") as f:
157
    # This works with either password type
158
    pdf = pdftotext.PDF(f, password="owner_password")
159
    # or
160
    pdf = pdftotext.PDF(f, password="user_password")
161
```
162

163
### Error Handling
164

165
Handle PDF-related errors and edge cases gracefully.
166

167
```python { .api }
168
class Error(Exception):
169
    """
170
    Exception raised for PDF-related errors.
171
    
172
    Raised when:
173
    - PDF file is invalid or corrupted
174
    - PDF is password-protected and no/wrong password provided
175
    - Poppler library encounters errors during processing
176
    - Page cannot be read due to corruption
177
    """
178
```
179

180
Example error handling:
181

182
```python
183
import pdftotext
184

185
try:
186
    with open("document.pdf", "rb") as f:
187
        pdf = pdftotext.PDF(f)
188
        text = pdf[0]
189
except pdftotext.Error as e:
190
    print(f"PDF error: {e}")
191
except FileNotFoundError:
192
    print("PDF file not found")
193
except IndexError as e:
194
    print(f"Page index error: {e}")
195
```
196

197
## Types
198

199
```python { .api }
200
class PDF:
201
    """
202
    Main class for PDF text extraction with sequence-like interface.
203
    
204
    Provides:
205
    - Sequential access to pages via indexing (pdf[0], pdf[1], etc.)
206
    - Length operation (len(pdf))
207
    - Iteration support (for page in pdf)
208
    - Password authentication for encrypted PDFs
209
    - Multiple text extraction modes (default, raw, physical)
210
    """
211

212
class Error(Exception):
213
    """
214
    Custom exception class for PDF-related errors.
215
    
216
    Inherits from built-in Exception class and is raised for:
217
    - Invalid or corrupted PDF files
218
    - Authentication failures on password-protected PDFs
219
    - Poppler library processing errors
220
    - Page reading errors due to corruption
221
    """
222
```
223

224
## Common Usage Patterns
225

226
### Processing Multi-page Documents
227

228
```python
229
import pdftotext
230

231
with open("report.pdf", "rb") as f:
232
    pdf = pdftotext.PDF(f)
233
    
234
    # Process each page
235
    for i, page in enumerate(pdf):
236
        print(f"=== Page {i + 1} ===")
237
        print(page[:100] + "..." if len(page) > 100 else page)
238

239
    # Or get all text at once
240
    full_document = "\n\n".join(pdf)
241
```
242

243
### Handling Different Document Types
244

245
```python
246
# Regular document
247
with open("document.pdf", "rb") as f:
248
    pdf = pdftotext.PDF(f)
249

250
# Password-protected document
251
with open("secure.pdf", "rb") as f:
252
    pdf = pdftotext.PDF(f, password="mypassword")
253

254
# Multi-column document (try physical mode)
255
with open("newspaper.pdf", "rb") as f:
256
    pdf = pdftotext.PDF(f, physical=True)
257

258
# Document with complex layout (try raw mode)
259
with open("form.pdf", "rb") as f:
260
    pdf = pdftotext.PDF(f, raw=True)
261
```
262

263
### Robust Error Handling
264

265
```python
266
import pdftotext
267

268
def extract_pdf_text(filepath, password=None):
269
    """Extract text from PDF with comprehensive error handling."""
270
    try:
271
        with open(filepath, "rb") as f:
272
            if password:
273
                pdf = pdftotext.PDF(f, password=password)
274
            else:
275
                pdf = pdftotext.PDF(f)
276
            
277
            return [page for page in pdf]
278
            
279
    except FileNotFoundError:
280
        print(f"File not found: {filepath}")
281
        return None
282
    except pdftotext.Error as e:
283
        print(f"PDF processing error: {e}")
284
        return None
285
    except Exception as e:
286
        print(f"Unexpected error: {e}")
287
        return None
288
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/