Python bindings to PDFium for comprehensive PDF manipulation, rendering, and processing
npx @tessl/cli install tessl/pypi-pypdfium2@4.30.00
# pypdfium2
1
2
Python bindings to PDFium for comprehensive PDF manipulation, rendering, and processing. Built on Google's powerful PDFium library, pypdfium2 provides both high-level helper classes for common PDF operations and low-level raw bindings for advanced functionality.
3
4
## Package Information
5
6
- **Package Name**: pypdfium2
7
- **Language**: Python
8
- **Installation**: `pip install pypdfium2`
9
- **Python Requirements**: Python 3.6+
10
11
## Core Imports
12
13
```python
14
import pypdfium2 as pdfium
15
```
16
17
For direct access to specific classes:
18
19
```python
20
from pypdfium2 import PdfDocument, PdfPage, PdfBitmap
21
```
22
23
For version information:
24
25
```python
26
from pypdfium2 import PYPDFIUM_INFO, PDFIUM_INFO
27
```
28
29
## Basic Usage
30
31
```python
32
import pypdfium2 as pdfium
33
34
# Open a PDF document
35
pdf = pdfium.PdfDocument("document.pdf")
36
37
# Get basic information
38
print(f"Pages: {len(pdf)}")
39
print(f"Version: {pdf.get_version()}")
40
print(f"Metadata: {pdf.get_metadata_dict()}")
41
42
# Render first page to image
43
page = pdf[0]
44
bitmap = page.render(scale=2.0)
45
pil_image = bitmap.to_pil()
46
pil_image.save("page1.png")
47
48
# Extract text from page
49
textpage = page.get_textpage()
50
text = textpage.get_text_range()
51
print(f"Page text: {text}")
52
53
# Clean up
54
pdf.close()
55
```
56
57
## Architecture
58
59
pypdfium2 follows a layered architecture design:
60
61
- **Helper Classes**: High-level Python API (PdfDocument, PdfPage, PdfBitmap, etc.) providing intuitive interfaces for common operations
62
- **Raw Bindings**: Direct access to PDFium C API functions through pypdfium2.raw module
63
- **Type System**: Named tuples and data classes for structured information (PdfBitmapInfo, ImageInfo, etc.)
64
- **Resource Management**: Automatic cleanup with context managers and explicit close() methods
65
- **Multi-format Support**: PDF reading/writing, image rendering (PIL, NumPy), text extraction
66
67
This design enables both simple high-level operations and advanced low-level manipulation while maintaining compatibility with the broader Python ecosystem.
68
69
## Capabilities
70
71
### Document Management
72
73
Core PDF document operations including loading, creating, saving, and metadata manipulation. Supports password-protected PDFs, form handling, and file attachments.
74
75
```python { .api }
76
class PdfDocument:
77
def __init__(self, input_data, password=None, autoclose=False): ...
78
@classmethod
79
def new(cls): ...
80
def __len__(self) -> int: ...
81
def save(self, dest, version=None, flags=...): ...
82
def get_metadata_dict(self, skip_empty=False) -> dict: ...
83
def is_tagged(self) -> bool: ...
84
```
85
86
[Document Management](./document-management.md)
87
88
### Page Manipulation
89
90
Page-level operations including rendering, rotation, dimension management, and bounding box manipulation. Supports various rendering formats and customization options.
91
92
```python { .api }
93
class PdfPage:
94
def get_size(self) -> tuple[float, float]: ...
95
def render(self, rotation=0, scale=1, ...) -> PdfBitmap: ...
96
def get_rotation(self) -> int: ...
97
def set_rotation(self, rotation): ...
98
def get_mediabox(self, fallback_ok=True) -> tuple | None: ...
99
```
100
101
[Page Manipulation](./page-manipulation.md)
102
103
### Text Processing
104
105
Comprehensive text extraction and search capabilities with support for bounded text extraction, character-level positioning, and full-text search.
106
107
```python { .api }
108
class PdfTextPage:
109
def get_text_range(self, index=0, count=-1, errors="ignore", force_this=False) -> str: ...
110
def get_text_bounded(self, left=None, bottom=None, right=None, top=None, errors="ignore") -> str: ...
111
def search(self, text, index=0, match_case=False, match_whole_word=False, consecutive=False) -> PdfTextSearcher: ...
112
def get_charbox(self, index, loose=False) -> tuple: ...
113
```
114
115
[Text Processing](./text-processing.md)
116
117
### Image and Bitmap Operations
118
119
Image rendering, manipulation, and extraction with support for multiple output formats including PIL Images, NumPy arrays, and raw bitmaps.
120
121
```python { .api }
122
class PdfBitmap:
123
@classmethod
124
def from_pil(cls, pil_image, recopy=False) -> PdfBitmap: ...
125
def to_numpy(self) -> numpy.ndarray: ...
126
def to_pil(self) -> PIL.Image: ...
127
def fill_rect(self, left, top, width, height, color): ...
128
```
129
130
[Image and Bitmap Operations](./image-bitmap.md)
131
132
### Page Objects and Graphics
133
134
Manipulation of PDF page objects including images, text, and vector graphics. Supports object transformation, insertion, and removal.
135
136
```python { .api }
137
class PdfObject:
138
def get_pos(self) -> tuple: ...
139
def get_matrix(self) -> PdfMatrix: ...
140
def transform(self, matrix): ...
141
142
class PdfImage(PdfObject):
143
def get_metadata(self) -> ImageInfo: ...
144
def extract(self, dest, *args, **kwargs): ...
145
```
146
147
[Page Objects and Graphics](./page-objects.md)
148
149
### File Attachments
150
151
Management of embedded file attachments with support for attachment metadata, data extraction, and modification.
152
153
```python { .api }
154
class PdfAttachment:
155
def get_name(self) -> str: ...
156
def get_data(self) -> ctypes.Array: ...
157
def set_data(self, data): ...
158
def get_str_value(self, key) -> str: ...
159
```
160
161
[File Attachments](./attachments.md)
162
163
### Transformation and Geometry
164
165
2D transformation matrices for coordinate system manipulation, rotation, scaling, and translation operations.
166
167
```python { .api }
168
class PdfMatrix:
169
def __init__(self, a=1, b=0, c=0, d=1, e=0, f=0): ...
170
def translate(self, x, y) -> PdfMatrix: ...
171
def scale(self, x, y) -> PdfMatrix: ...
172
def rotate(self, angle, ccw=False, rad=False) -> PdfMatrix: ...
173
def on_point(self, x, y) -> tuple: ...
174
```
175
176
[Transformation and Geometry](./transformation.md)
177
178
### Version and Library Information
179
180
Access to pypdfium2 and PDFium version information, build details, and feature flags.
181
182
```python { .api }
183
PYPDFIUM_INFO: _version_pypdfium2
184
PDFIUM_INFO: _version_pdfium
185
186
# Version properties
187
version: str
188
api_tag: tuple[int]
189
major: int
190
minor: int
191
patch: int
192
build: int # PDFIUM_INFO only
193
```
194
195
[Version and Library Information](./version-info.md)
196
197
### Command Line Interface
198
199
Access to pypdfium2's comprehensive command-line tools for batch processing, text extraction, image operations, and document manipulation.
200
201
```python { .api }
202
def cli_main(raw_args=None) -> int:
203
"""Main CLI entry point for pypdfium2 command-line tools."""
204
205
def api_main(raw_args=None) -> int:
206
"""Alternative API entry point with same functionality as cli_main."""
207
```
208
209
[Command Line Interface](./cli-tools.md)
210
211
## Exception Handling
212
213
```python { .api }
214
class PdfiumError(RuntimeError):
215
"""Main exception for PDFium library errors"""
216
217
class ImageNotExtractableError(Exception):
218
"""Raised when image cannot be extracted from PDF"""
219
```
220
221
Common error scenarios include invalid PDF files, unsupported operations, memory allocation failures, and file I/O errors. Always handle exceptions when working with external PDF files or performing complex operations.
222
223
## Raw Bindings Access
224
225
For advanced use cases requiring direct PDFium API access:
226
227
```python
228
from pypdfium2 import raw
229
230
# Access low-level PDFium functions
231
doc_handle = raw.FPDF_LoadDocument(file_path, password)
232
page_count = raw.FPDF_GetPageCount(doc_handle)
233
```
234
235
The raw module provides complete access to PDFium's C API with all functions, constants, and structures available for advanced manipulation.