A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files
npx @tessl/cli install tessl/pypi-pypdf@6.0.00
# pypdf
1
2
A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files. pypdf can also add custom data, viewing options, and passwords to PDF files, while providing comprehensive text and metadata extraction capabilities.
3
4
## Package Information
5
6
- **Package Name**: pypdf
7
- **Language**: Python
8
- **Installation**: `pip install pypdf`
9
- **Optional Dependencies**: `pip install pypdf[crypto]` for AES encryption/decryption
10
11
## Core Imports
12
13
```python
14
from pypdf import PdfReader, PdfWriter
15
```
16
17
For page operations:
18
19
```python
20
from pypdf import PdfReader, PdfWriter, PageObject, Transformation
21
```
22
23
For working with metadata and annotations:
24
25
```python
26
from pypdf import DocumentInformation, PageRange, PaperSize
27
```
28
29
## Basic Usage
30
31
```python
32
from pypdf import PdfReader, PdfWriter
33
34
# Reading a PDF
35
reader = PdfReader("example.pdf")
36
number_of_pages = len(reader.pages)
37
page = reader.pages[0]
38
text = page.extract_text()
39
40
# Writing a PDF
41
writer = PdfWriter()
42
writer.add_page(page)
43
with open("output.pdf", "wb") as output_file:
44
writer.write(output_file)
45
46
# Merging PDFs
47
reader1 = PdfReader("document1.pdf")
48
reader2 = PdfReader("document2.pdf")
49
50
writer = PdfWriter()
51
for page in reader1.pages:
52
writer.add_page(page)
53
for page in reader2.pages:
54
writer.add_page(page)
55
56
with open("merged.pdf", "wb") as output_file:
57
writer.write(output_file)
58
```
59
60
## Architecture
61
62
pypdf is built around two core classes and a rich ecosystem of supporting components:
63
64
- **PdfReader**: Handles PDF file parsing, decryption, and provides access to pages, metadata, and document structure
65
- **PdfWriter**: Manages PDF creation, page manipulation, encryption, and output generation
66
- **PageObject**: Represents individual PDF pages with comprehensive transformation and content manipulation capabilities
67
- **Generic Objects**: Low-level PDF object types for advanced manipulation (DictionaryObject, ArrayObject, StreamObject, etc.)
68
- **Annotations**: Complete annotation system for interactive PDF elements
69
- **Metadata**: Document information handling and XMP metadata support
70
71
## Capabilities
72
73
### PDF Reading and Writing
74
75
Core functionality for opening, reading, creating, and saving PDF documents. Includes support for encrypted PDFs, incremental updates, and context manager usage patterns.
76
77
```python { .api }
78
class PdfReader:
79
def __init__(self, stream, strict: bool = False, password: str | None = None): ...
80
def decrypt(self, password: str) -> PasswordType: ...
81
def close(self) -> None: ...
82
83
class PdfWriter:
84
def __init__(self, clone_from=None, incremental: bool = False): ...
85
def add_page(self, page: PageObject) -> None: ...
86
def write(self, stream) -> None: ...
87
def encrypt(self, user_password: str, owner_password: str | None = None, **kwargs) -> None: ...
88
```
89
90
[PDF Reading and Writing](./reading-writing.md)
91
92
### Page Operations
93
94
Comprehensive page manipulation including transformations (scaling, rotation, translation), page merging, cropping, and geometric operations. Support for blank page creation and advanced transformation matrices.
95
96
```python { .api }
97
class PageObject:
98
def extract_text(self, extraction_mode: str = "layout", **kwargs) -> str: ...
99
def scale(self, sx: float, sy: float) -> PageObject: ...
100
def rotate(self, angle: int) -> PageObject: ...
101
def merge_page(self, page2: PageObject) -> None: ...
102
def merge_transformed_page(self, page2: PageObject, ctm, expand: bool = False) -> None: ...
103
104
class Transformation:
105
def __init__(self, ctm=(1, 0, 0, 1, 0, 0)): ...
106
def translate(self, tx: float = 0, ty: float = 0) -> Transformation: ...
107
def scale(self, sx: float = 1, sy: float | None = None) -> Transformation: ...
108
def rotate(self, rotation: float) -> Transformation: ...
109
```
110
111
[Page Operations](./page-operations.md)
112
113
### Text Extraction
114
115
Advanced text extraction capabilities with multiple extraction modes, layout preservation, and customizable text processing options.
116
117
```python { .api }
118
def extract_text(
119
self,
120
orientations: tuple | int = (0, 90, 180, 270),
121
space_width: float = 200.0,
122
visitor_operand_before=None,
123
visitor_operand_after=None,
124
visitor_text=None,
125
extraction_mode: str = "plain"
126
) -> str: ...
127
```
128
129
[Text Extraction](./text-extraction.md)
130
131
### Metadata and Document Information
132
133
Access and manipulation of PDF metadata, document properties, XMP information, and custom document attributes.
134
135
```python { .api }
136
class DocumentInformation:
137
@property
138
def title(self) -> str | None: ...
139
@property
140
def author(self) -> str | None: ...
141
@property
142
def subject(self) -> str | None: ...
143
@property
144
def creator(self) -> str | None: ...
145
@property
146
def producer(self) -> str | None: ...
147
@property
148
def creation_date(self) -> datetime | None: ...
149
@property
150
def modification_date(self) -> datetime | None: ...
151
```
152
153
[Metadata](./metadata.md)
154
155
### Annotations
156
157
Complete annotation system supporting markup annotations (highlights, text annotations, shapes) and interactive elements (links, popups) with full customization capabilities.
158
159
```python { .api }
160
class AnnotationDictionary: ...
161
class Highlight: ...
162
class Text: ...
163
class Link: ...
164
class FreeText: ...
165
```
166
167
[Annotations](./annotations.md)
168
169
### Utilities and Helpers
170
171
Supporting utilities including page ranges, standard paper sizes, constants, error handling, and type definitions for enhanced developer experience.
172
173
```python { .api }
174
class PageRange:
175
def __init__(self, arg): ...
176
def indices(self, n: int) -> tuple[int, int, int]: ...
177
178
class PaperSize:
179
A4: tuple[float, float]
180
A3: tuple[float, float]
181
# ... other standard sizes
182
183
def parse_filename_page_ranges(fnprs: list[str]) -> tuple[list[str], list[PageRange]]: ...
184
```
185
186
[Utilities](./utilities.md)
187
188
### Form Fields and Interactive Elements
189
190
Comprehensive form field manipulation including reading field values, updating form data, setting field appearance properties, and managing interactive PDF forms.
191
192
```python { .api }
193
def update_page_form_field_values(
194
self,
195
page: PageObject | list[PageObject] | None,
196
fields: dict[str, str | list[str] | tuple[str, str, float]],
197
flags: int = 0,
198
auto_regenerate: bool = True,
199
flatten: bool = False
200
) -> None: ...
201
202
def set_need_appearances_writer(self, state: bool = True) -> None: ...
203
204
def reattach_fields(self, page: PageObject | None = None) -> list[DictionaryObject]: ...
205
```
206
207
[Form Fields](./form-fields.md)
208
209
## Types
210
211
```python { .api }
212
from enum import IntEnum, IntFlag
213
214
class PasswordType(IntEnum):
215
NOT_DECRYPTED = 0
216
USER_PASSWORD = 1
217
OWNER_PASSWORD = 2
218
219
class ImageType(IntFlag):
220
NONE = 0
221
XOBJECT_IMAGES = 1
222
INLINE_IMAGES = 2
223
DRAWING_IMAGES = 4
224
IMAGES = XOBJECT_IMAGES | INLINE_IMAGES
225
ALL = XOBJECT_IMAGES | INLINE_IMAGES | DRAWING_IMAGES
226
227
class ObjectDeletionFlag(IntFlag):
228
NONE = 0
229
TEXT = 1
230
LINKS = 2
231
ATTACHMENTS = 4
232
OBJECTS_3D = 8
233
ALL_ANNOTATIONS = 16
234
XOBJECT_IMAGES = 32
235
INLINE_IMAGES = 64
236
DRAWING_IMAGES = 128
237
IMAGES = XOBJECT_IMAGES | INLINE_IMAGES | DRAWING_IMAGES
238
```