0
# Document Management
1
2
Core PDF document operations including loading, creating, saving, metadata handling, and document-level manipulation. The PdfDocument class serves as the primary entry point for all PDF operations.
3
4
## Capabilities
5
6
### Document Creation and Loading
7
8
Create new PDF documents or load existing ones from various sources including file paths, bytes, and file-like objects.
9
10
```python { .api }
11
class PdfDocument:
12
def __init__(self, input, password=None, autoclose=False):
13
"""
14
Create a PDF document from various input sources.
15
16
Parameters:
17
- input: str (file path), bytes, or file-like object
18
- password: str, optional password for encrypted PDFs
19
- autoclose: bool, automatically close document when object is deleted
20
"""
21
22
@classmethod
23
def new(cls) -> PdfDocument:
24
"""Create a new empty PDF document."""
25
```
26
27
Example usage:
28
29
```python
30
import pypdfium2 as pdfium
31
32
# Load from file path
33
pdf = pdfium.PdfDocument("document.pdf")
34
35
# Load with password
36
pdf = pdfium.PdfDocument("encrypted.pdf", password="secret")
37
38
# Load from bytes
39
with open("document.pdf", "rb") as f:
40
pdf_bytes = f.read()
41
pdf = pdfium.PdfDocument(pdf_bytes)
42
43
# Create new document
44
new_pdf = pdfium.PdfDocument.new()
45
```
46
47
### Document Information
48
49
Access and modify document metadata, version information, and properties.
50
51
```python { .api }
52
def __len__(self) -> int:
53
"""Get the number of pages in the document."""
54
55
def get_version(self) -> int | None:
56
"""Get PDF version number (e.g., 14 for PDF 1.4)."""
57
58
def get_identifier(self, type=...) -> bytes:
59
"""Get document file identifier."""
60
61
def is_tagged(self) -> bool:
62
"""Check if document is a tagged PDF for accessibility."""
63
64
def get_pagemode(self) -> int:
65
"""Get page mode (how document should be displayed)."""
66
67
def get_formtype(self) -> int:
68
"""Get form type if document contains interactive forms."""
69
```
70
71
### Metadata Management
72
73
Read and write PDF metadata including title, author, subject, keywords, and creation information.
74
75
```python { .api }
76
def get_metadata_value(self, key: str) -> str:
77
"""
78
Get specific metadata value.
79
80
Parameters:
81
- key: str, metadata key (Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate)
82
83
Returns:
84
str: Metadata value or empty string if not found
85
"""
86
87
def get_metadata_dict(self, skip_empty=False) -> dict:
88
"""
89
Get all metadata as dictionary.
90
91
Parameters:
92
- skip_empty: bool, exclude empty metadata values
93
94
Returns:
95
dict: Metadata key-value pairs
96
"""
97
98
# Available metadata keys
99
METADATA_KEYS = ("Title", "Author", "Subject", "Keywords", "Creator", "Producer", "CreationDate", "ModDate")
100
```
101
102
Example:
103
104
```python
105
pdf = pdfium.PdfDocument("document.pdf")
106
107
# Get specific metadata
108
title = pdf.get_metadata_value("Title")
109
author = pdf.get_metadata_value("Author")
110
111
# Get all metadata
112
metadata = pdf.get_metadata_dict()
113
print(f"Title: {metadata.get('Title', 'Unknown')}")
114
print(f"Pages: {len(pdf)}")
115
print(f"PDF Version: {pdf.get_version()}")
116
```
117
118
### Page Management
119
120
Access, create, delete, and manipulate pages within the document.
121
122
```python { .api }
123
def __iter__(self) -> Iterator[PdfPage]:
124
"""Iterate over all pages in the document."""
125
126
def __getitem__(self, index: int) -> PdfPage:
127
"""Get page by index (0-based)."""
128
129
def __delitem__(self, index: int):
130
"""Delete page by index."""
131
132
def get_page(self, index: int) -> PdfPage:
133
"""Get page by index with explicit method."""
134
135
def new_page(self, width: float, height: float, index: int = None) -> PdfPage:
136
"""
137
Create new page in document.
138
139
Parameters:
140
- width: float, page width in PDF units (1/72 inch)
141
- height: float, page height in PDF units
142
- index: int, optional insertion index (None = append)
143
144
Returns:
145
PdfPage: New page object
146
"""
147
148
def del_page(self, index: int):
149
"""Delete page by index."""
150
151
def import_pages(self, pdf: PdfDocument, pages=None, index=None):
152
"""
153
Import pages from another PDF document.
154
155
Parameters:
156
- pdf: PdfDocument, source document
157
- pages: list of int, page indices to import (None = all pages)
158
- index: int, insertion point in this document (None = append)
159
"""
160
161
def get_page_size(self, index: int) -> tuple[float, float]:
162
"""Get page dimensions as (width, height) tuple."""
163
164
def get_page_label(self, index: int) -> str:
165
"""Get page label (may differ from index for custom numbering)."""
166
167
def page_as_xobject(self, index: int, dest_pdf: PdfDocument) -> PdfXObject:
168
"""Convert page to Form XObject for embedding in another document."""
169
```
170
171
Example usage:
172
173
```python
174
pdf = pdfium.PdfDocument("document.pdf")
175
176
# Access pages
177
first_page = pdf[0]
178
last_page = pdf[-1]
179
180
# Iterate pages
181
for i, page in enumerate(pdf):
182
print(f"Page {i+1}: {page.get_size()}")
183
184
# Create new page
185
new_page = pdf.new_page(612, 792) # US Letter size
186
187
# Import pages from another PDF
188
source_pdf = pdfium.PdfDocument("source.pdf")
189
pdf.import_pages(source_pdf, pages=[0, 2, 4]) # Import pages 1, 3, 5
190
191
# Delete a page
192
del pdf[5]
193
```
194
195
### File Attachments
196
197
Manage embedded file attachments within the PDF document.
198
199
```python { .api }
200
def count_attachments(self) -> int:
201
"""Get number of file attachments."""
202
203
def get_attachment(self, index: int) -> PdfAttachment:
204
"""Get attachment by index."""
205
206
def new_attachment(self, name: str) -> PdfAttachment:
207
"""
208
Create new file attachment.
209
210
Parameters:
211
- name: str, attachment filename
212
213
Returns:
214
PdfAttachment: New attachment object
215
"""
216
217
def del_attachment(self, index: int):
218
"""Delete attachment by index."""
219
```
220
221
### Document Outline and Bookmarks
222
223
Navigate and extract the document's table of contents structure, including nested bookmarks.
224
225
```python { .api }
226
def get_toc(self, max_depth=15, parent=None, level=0, seen=None) -> Iterator[PdfOutlineItem]:
227
"""
228
Iterate through the bookmarks in the document's table of contents.
229
230
Parameters:
231
- max_depth: int, maximum recursion depth to consider (default: 15)
232
- parent: internal parent bookmark (typically None for root level)
233
- level: internal nesting level (typically 0 for root)
234
- seen: internal set for circular reference detection
235
236
Yields:
237
PdfOutlineItem: Bookmark information objects
238
239
Each bookmark contains title, page reference, view settings, and
240
hierarchical information including nesting level and child counts.
241
"""
242
```
243
244
#### PdfOutlineItem Class
245
246
Bookmark information structure for PDF table of contents entries.
247
248
```python { .api }
249
class PdfOutlineItem:
250
"""
251
Bookmark information namedtuple for PDF outline entries.
252
253
Represents a single bookmark/outline item from a PDF's table of contents,
254
containing hierarchical navigation information and target page details.
255
256
Attributes:
257
- level: int, number of parent items (nesting depth)
258
- title: str, title string of the bookmark
259
- is_closed: bool | None, True if children should be collapsed,
260
False if expanded, None if no children
261
- n_kids: int, absolute number of child items
262
- page_index: int | None, zero-based target page index (None if no target)
263
- view_mode: int, view mode constant defining coordinate interpretation
264
- view_pos: list[float], target position coordinates on the page
265
"""
266
267
level: int
268
title: str
269
is_closed: bool | None
270
n_kids: int
271
page_index: int | None
272
view_mode: int
273
view_pos: list[float]
274
```
275
276
Example usage:
277
278
```python
279
pdf = pdfium.PdfDocument("document_with_bookmarks.pdf")
280
281
# Extract table of contents
282
for bookmark in pdf.get_toc():
283
indent = " " * bookmark.level # Indent based on nesting
284
print(f"{indent}{bookmark.title}")
285
286
if bookmark.page_index is not None:
287
print(f"{indent} β Page {bookmark.page_index + 1}")
288
print(f"{indent} β Position: {bookmark.view_pos}")
289
290
if bookmark.n_kids > 0:
291
expanded = "π" if not bookmark.is_closed else "π"
292
print(f"{indent} {expanded} ({bookmark.n_kids} children)")
293
294
# Navigate to specific bookmark
295
for bookmark in pdf.get_toc():
296
if "Chapter 1" in bookmark.title and bookmark.page_index is not None:
297
# Load the target page
298
target_page = pdf[bookmark.page_index]
299
break
300
```
301
302
### Interactive Forms
303
304
Initialize interactive form environment for handling PDF forms and annotations.
305
306
```python { .api }
307
def init_forms(self, config=None):
308
"""
309
Initialize interactive form environment.
310
311
Parameters:
312
- config: optional form configuration
313
314
Sets up form environment for handling interactive elements,
315
annotations, and form fields.
316
"""
317
```
318
319
#### PdfFormEnv Class
320
321
Form environment helper class for managing interactive PDF forms.
322
323
```python { .api }
324
class PdfFormEnv:
325
"""
326
Form environment helper class for managing interactive PDF forms.
327
328
This class provides the form environment context needed for rendering
329
and interacting with PDF forms. Created automatically when init_forms()
330
is called on a document that contains forms.
331
332
Attributes:
333
- raw: FPDF_FORMHANDLE, underlying PDFium form env handle
334
- config: FPDF_FORMFILLINFO, form configuration interface
335
- pdf: PdfDocument, parent document this form env belongs to
336
"""
337
338
def __init__(self, raw, config, pdf):
339
"""
340
Initialize form environment.
341
342
Parameters:
343
- raw: FPDF_FORMHANDLE, PDFium form handle
344
- config: FPDF_FORMFILLINFO, form configuration
345
- pdf: PdfDocument, parent document
346
347
Note: This is typically created automatically by PdfDocument.init_forms()
348
rather than being instantiated directly.
349
"""
350
351
def close(self):
352
"""Close and clean up form environment resources."""
353
```
354
355
Example usage:
356
357
```python
358
pdf = pdfium.PdfDocument("form.pdf")
359
360
# Initialize forms if document contains them
361
pdf.init_forms()
362
363
if pdf.formenv:
364
print("Form environment is active")
365
# Form environment will be used automatically during page rendering
366
# to handle interactive form elements
367
```
368
369
### Document Saving
370
371
Save PDF documents to files or buffers with version control and optimization options.
372
373
```python { .api }
374
def save(self, dest, version=None, flags=...):
375
"""
376
Save document to file or buffer.
377
378
Parameters:
379
- dest: str (file path) or file-like object for output
380
- version: int, optional PDF version to save as
381
- flags: various save options and optimization flags
382
383
Saves the current state of the document including all modifications,
384
new pages, and metadata changes.
385
"""
386
```
387
388
Example:
389
390
```python
391
pdf = pdfium.PdfDocument("input.pdf")
392
393
# Modify document
394
pdf.new_page(612, 792)
395
396
# Save to new file
397
pdf.save("output.pdf")
398
399
# Save to buffer
400
import io
401
buffer = io.BytesIO()
402
pdf.save(buffer)
403
pdf_bytes = buffer.getvalue()
404
```
405
406
### Resource Management
407
408
Proper cleanup and resource management for PDF documents.
409
410
```python { .api }
411
def close():
412
"""Close document and free resources."""
413
414
def __enter__(self) -> PdfDocument:
415
"""Context manager entry."""
416
417
def __exit__(self, exc_type, exc_val, exc_tb):
418
"""Context manager exit with cleanup."""
419
```
420
421
Always close documents when done or use context managers:
422
423
```python
424
# Manual cleanup
425
pdf = pdfium.PdfDocument("document.pdf")
426
# ... work with PDF
427
pdf.close()
428
429
# Context manager (recommended)
430
with pdfium.PdfDocument("document.pdf") as pdf:
431
# ... work with PDF
432
pass # Automatically closed
433
```
434
435
## Properties
436
437
```python { .api }
438
@property
439
def raw(self) -> FPDF_DOCUMENT:
440
"""Raw PDFium document handle for low-level operations."""
441
442
@property
443
def formenv(self) -> PdfFormEnv | None:
444
"""Form environment if initialized, None otherwise."""
445
```
446
447
## Advanced Features
448
449
### Unsupported Feature Handling
450
451
Handle notifications about PDF features not supported by the PDFium library.
452
453
#### PdfUnspHandler Class
454
455
Unsupported feature handler for managing notifications about PDF features not available in PDFium.
456
457
```python { .api }
458
class PdfUnspHandler:
459
"""
460
Unsupported feature handler helper class.
461
462
Manages callbacks for handling notifications when PDFium encounters
463
PDF features that are not supported by the current build. Useful for
464
logging, debugging, and informing users about document limitations.
465
466
Attributes:
467
- handlers: dict[str, callable], dictionary of named handler functions
468
called with unsupported feature codes (FPDF_UNSP_*)
469
"""
470
471
def __init__(self):
472
"""Initialize unsupported feature handler."""
473
474
def setup(self, add_default=True):
475
"""
476
Attach the handler to PDFium and register exit function.
477
478
Parameters:
479
- add_default: bool, if True, add default warning callback
480
481
Sets up the handler to receive notifications from PDFium when
482
unsupported features are encountered during document processing.
483
"""
484
485
def __call__(self, _, type: int):
486
"""
487
Handle unsupported feature notification.
488
489
Parameters:
490
- _: unused parameter (PDFium context)
491
- type: int, unsupported feature code (FPDF_UNSP_*)
492
493
Called automatically by PDFium when unsupported features are found.
494
Executes all registered handler functions with the feature code.
495
"""
496
```
497
498
Example usage:
499
500
```python
501
import pypdfium2 as pdfium
502
503
# Create and setup unsupported feature handler
504
unsp_handler = pdfium.PdfUnspHandler()
505
506
# Add custom handler for unsupported features
507
def my_handler(feature_code):
508
feature_name = {
509
1: "Document XFA",
510
2: "Portable Collection",
511
3: "Attachment",
512
4: "Security",
513
5: "Shared Review",
514
6: "Shared Form Acrobat",
515
7: "Shared Form Filesystem",
516
8: "Shared Form Email",
517
9: "3D Annotation",
518
10: "Movie Annotation",
519
11: "Sound Annotation",
520
12: "Screen Media",
521
13: "Screen Rich Media",
522
14: "Attachment 3D",
523
15: "Multimedia"
524
}.get(feature_code, f"Unknown feature {feature_code}")
525
526
print(f"Warning: Unsupported PDF feature detected: {feature_name}")
527
528
unsp_handler.handlers["custom"] = my_handler
529
530
# Setup handler (includes default warning logger)
531
unsp_handler.setup(add_default=True)
532
533
# Now when processing PDFs, unsupported features will be reported
534
pdf = pdfium.PdfDocument("document_with_unsupported_features.pdf")
535
# Any unsupported features will trigger the handlers
536
```