0
# pdftotext
1
2
Simple Python library for extracting text from PDF documents using the Poppler backend. The library provides a minimal but complete API through a single PDF class that supports sequential access to pages, password-protected documents, and multiple text extraction modes for optimal readability.
3
4
## Package Information
5
6
- **Package Name**: pdftotext
7
- **Language**: Python (with C++ extension)
8
- **Installation**: `pip install pdftotext`
9
- **System Dependencies**: libpoppler-cpp, pkg-config, python3-dev
10
11
## Core Imports
12
13
```python
14
import pdftotext
15
```
16
17
## Basic Usage
18
19
```python
20
import pdftotext
21
22
# Load a PDF file
23
with open("document.pdf", "rb") as f:
24
pdf = pdftotext.PDF(f)
25
26
# Check page count
27
print(f"Document has {len(pdf)} pages")
28
29
# Read individual pages
30
print("First page:")
31
print(pdf[0])
32
33
print("Last page:")
34
print(pdf[-1])
35
36
# Iterate through all pages
37
for page_num, page_text in enumerate(pdf):
38
print(f"--- Page {page_num + 1} ---")
39
print(page_text)
40
41
# Read all text as single string
42
full_text = "\n\n".join(pdf)
43
print(full_text)
44
```
45
46
## Capabilities
47
48
### PDF Document Loading
49
50
Load PDF documents from file-like objects with optional password authentication and text extraction mode configuration.
51
52
```python { .api }
53
class PDF:
54
def __init__(self, pdf_file, password="", raw=False, physical=False):
55
"""
56
Initialize PDF object for text extraction.
57
58
Args:
59
pdf_file: A file-like object opened in binary mode containing PDF data
60
password (str, optional): Password to unlock encrypted PDFs. Both owner and user passwords work. Defaults to "".
61
raw (bool, optional): Extract text in content stream order (as stored in PDF). Defaults to False.
62
physical (bool, optional): Extract text in physical layout order (spatial arrangement on page). Defaults to False.
63
64
Raises:
65
pdftotext.Error: If PDF is invalid, corrupted, or password-protected without correct password
66
TypeError: If pdf_file is not a file-like object or opened in text mode
67
ValueError: If both raw and physical are True, or if raw/physical values are invalid
68
69
Note:
70
The raw and physical parameters are mutually exclusive. Default mode provides most readable output
71
by respecting logical document structure. Usually this is preferred over raw or physical modes.
72
"""
73
```
74
75
### Page Access
76
77
Access individual pages as strings using sequence-like interface with support for indexing and iteration.
78
79
```python { .api }
80
def __len__(self) -> int:
81
"""
82
Return the number of pages in the PDF document.
83
84
Returns:
85
int: Number of pages in the document
86
"""
87
88
def __getitem__(self, index: int) -> str:
89
"""
90
Get text content of a specific page.
91
92
Args:
93
index (int): Page index (0-based). Supports negative indexing.
94
95
Returns:
96
str: Text content of the page as UTF-8 string
97
98
Raises:
99
IndexError: If index is out of range
100
pdftotext.Error: If page cannot be read due to corruption
101
"""
102
103
def __iter__(self):
104
"""
105
Enable iteration over pages, yielding page text.
106
107
Yields:
108
str: Text content of each page in sequence
109
110
Example:
111
for page in pdf:
112
print(page)
113
"""
114
```
115
116
### Text Extraction Modes
117
118
Configure how text is extracted from PDF pages to optimize for different document layouts and reading requirements.
119
120
**Default Mode** (recommended): Most readable output that respects logical document structure. Handles multi-column layouts, reading order, and text flow intelligently.
121
122
**Raw Mode** (`raw=True`): Extracts text in the order it appears in the PDF content stream. Useful for debugging or when document structure is less important than preserving original ordering.
123
124
**Physical Mode** (`physical=True`): Extracts text in physical layout order based on spatial arrangement on the page. Can be useful for documents with complex layouts where spatial positioning matters.
125
126
Usage examples:
127
128
```python
129
# Default mode - most readable
130
with open("document.pdf", "rb") as f:
131
pdf = pdftotext.PDF(f)
132
text = pdf[0] # Respects logical structure
133
134
# Raw mode - content stream order
135
with open("document.pdf", "rb") as f:
136
pdf = pdftotext.PDF(f, raw=True)
137
text = pdf[0] # Order as stored in PDF
138
139
# Physical mode - spatial order
140
with open("document.pdf", "rb") as f:
141
pdf = pdftotext.PDF(f, physical=True)
142
text = pdf[0] # Spatial arrangement on page
143
```
144
145
### Password-Protected PDFs
146
147
Handle encrypted PDF documents using owner or user passwords.
148
149
```python
150
# Unlock with password
151
with open("secure_document.pdf", "rb") as f:
152
pdf = pdftotext.PDF(f, password="secret123")
153
text = pdf[0]
154
155
# Both owner and user passwords work
156
with open("encrypted.pdf", "rb") as f:
157
# This works with either password type
158
pdf = pdftotext.PDF(f, password="owner_password")
159
# or
160
pdf = pdftotext.PDF(f, password="user_password")
161
```
162
163
### Error Handling
164
165
Handle PDF-related errors and edge cases gracefully.
166
167
```python { .api }
168
class Error(Exception):
169
"""
170
Exception raised for PDF-related errors.
171
172
Raised when:
173
- PDF file is invalid or corrupted
174
- PDF is password-protected and no/wrong password provided
175
- Poppler library encounters errors during processing
176
- Page cannot be read due to corruption
177
"""
178
```
179
180
Example error handling:
181
182
```python
183
import pdftotext
184
185
try:
186
with open("document.pdf", "rb") as f:
187
pdf = pdftotext.PDF(f)
188
text = pdf[0]
189
except pdftotext.Error as e:
190
print(f"PDF error: {e}")
191
except FileNotFoundError:
192
print("PDF file not found")
193
except IndexError as e:
194
print(f"Page index error: {e}")
195
```
196
197
## Types
198
199
```python { .api }
200
class PDF:
201
"""
202
Main class for PDF text extraction with sequence-like interface.
203
204
Provides:
205
- Sequential access to pages via indexing (pdf[0], pdf[1], etc.)
206
- Length operation (len(pdf))
207
- Iteration support (for page in pdf)
208
- Password authentication for encrypted PDFs
209
- Multiple text extraction modes (default, raw, physical)
210
"""
211
212
class Error(Exception):
213
"""
214
Custom exception class for PDF-related errors.
215
216
Inherits from built-in Exception class and is raised for:
217
- Invalid or corrupted PDF files
218
- Authentication failures on password-protected PDFs
219
- Poppler library processing errors
220
- Page reading errors due to corruption
221
"""
222
```
223
224
## Common Usage Patterns
225
226
### Processing Multi-page Documents
227
228
```python
229
import pdftotext
230
231
with open("report.pdf", "rb") as f:
232
pdf = pdftotext.PDF(f)
233
234
# Process each page
235
for i, page in enumerate(pdf):
236
print(f"=== Page {i + 1} ===")
237
print(page[:100] + "..." if len(page) > 100 else page)
238
239
# Or get all text at once
240
full_document = "\n\n".join(pdf)
241
```
242
243
### Handling Different Document Types
244
245
```python
246
# Regular document
247
with open("document.pdf", "rb") as f:
248
pdf = pdftotext.PDF(f)
249
250
# Password-protected document
251
with open("secure.pdf", "rb") as f:
252
pdf = pdftotext.PDF(f, password="mypassword")
253
254
# Multi-column document (try physical mode)
255
with open("newspaper.pdf", "rb") as f:
256
pdf = pdftotext.PDF(f, physical=True)
257
258
# Document with complex layout (try raw mode)
259
with open("form.pdf", "rb") as f:
260
pdf = pdftotext.PDF(f, raw=True)
261
```
262
263
### Robust Error Handling
264
265
```python
266
import pdftotext
267
268
def extract_pdf_text(filepath, password=None):
269
"""Extract text from PDF with comprehensive error handling."""
270
try:
271
with open(filepath, "rb") as f:
272
if password:
273
pdf = pdftotext.PDF(f, password=password)
274
else:
275
pdf = pdftotext.PDF(f)
276
277
return [page for page in pdf]
278
279
except FileNotFoundError:
280
print(f"File not found: {filepath}")
281
return None
282
except pdftotext.Error as e:
283
print(f"PDF processing error: {e}")
284
return None
285
except Exception as e:
286
print(f"Unexpected error: {e}")
287
return None
288
```