or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

attachments.mdcli-tools.mddocument-management.mdimage-bitmap.mdindex.mdpage-manipulation.mdpage-objects.mdtext-processing.mdtransformation.mdversion-info.md

document-management.mddocs/

0

# Document Management

1

2

Core PDF document operations including loading, creating, saving, metadata handling, and document-level manipulation. The PdfDocument class serves as the primary entry point for all PDF operations.

3

4

## Capabilities

5

6

### Document Creation and Loading

7

8

Create new PDF documents or load existing ones from various sources including file paths, bytes, and file-like objects.

9

10

```python { .api }

11

class PdfDocument:

12

def __init__(self, input, password=None, autoclose=False):

13

"""

14

Create a PDF document from various input sources.

15

16

Parameters:

17

- input: str (file path), bytes, or file-like object

18

- password: str, optional password for encrypted PDFs

19

- autoclose: bool, automatically close document when object is deleted

20

"""

21

22

@classmethod

23

def new(cls) -> PdfDocument:

24

"""Create a new empty PDF document."""

25

```

26

27

Example usage:

28

29

```python

30

import pypdfium2 as pdfium

31

32

# Load from file path

33

pdf = pdfium.PdfDocument("document.pdf")

34

35

# Load with password

36

pdf = pdfium.PdfDocument("encrypted.pdf", password="secret")

37

38

# Load from bytes

39

with open("document.pdf", "rb") as f:

40

pdf_bytes = f.read()

41

pdf = pdfium.PdfDocument(pdf_bytes)

42

43

# Create new document

44

new_pdf = pdfium.PdfDocument.new()

45

```

46

47

### Document Information

48

49

Access and modify document metadata, version information, and properties.

50

51

```python { .api }

52

def __len__(self) -> int:

53

"""Get the number of pages in the document."""

54

55

def get_version(self) -> int | None:

56

"""Get PDF version number (e.g., 14 for PDF 1.4)."""

57

58

def get_identifier(self, type=...) -> bytes:

59

"""Get document file identifier."""

60

61

def is_tagged(self) -> bool:

62

"""Check if document is a tagged PDF for accessibility."""

63

64

def get_pagemode(self) -> int:

65

"""Get page mode (how document should be displayed)."""

66

67

def get_formtype(self) -> int:

68

"""Get form type if document contains interactive forms."""

69

```

70

71

### Metadata Management

72

73

Read and write PDF metadata including title, author, subject, keywords, and creation information.

74

75

```python { .api }

76

def get_metadata_value(self, key: str) -> str:

77

"""

78

Get specific metadata value.

79

80

Parameters:

81

- key: str, metadata key (Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate)

82

83

Returns:

84

str: Metadata value or empty string if not found

85

"""

86

87

def get_metadata_dict(self, skip_empty=False) -> dict:

88

"""

89

Get all metadata as dictionary.

90

91

Parameters:

92

- skip_empty: bool, exclude empty metadata values

93

94

Returns:

95

dict: Metadata key-value pairs

96

"""

97

98

# Available metadata keys

99

METADATA_KEYS = ("Title", "Author", "Subject", "Keywords", "Creator", "Producer", "CreationDate", "ModDate")

100

```

101

102

Example:

103

104

```python

105

pdf = pdfium.PdfDocument("document.pdf")

106

107

# Get specific metadata

108

title = pdf.get_metadata_value("Title")

109

author = pdf.get_metadata_value("Author")

110

111

# Get all metadata

112

metadata = pdf.get_metadata_dict()

113

print(f"Title: {metadata.get('Title', 'Unknown')}")

114

print(f"Pages: {len(pdf)}")

115

print(f"PDF Version: {pdf.get_version()}")

116

```

117

118

### Page Management

119

120

Access, create, delete, and manipulate pages within the document.

121

122

```python { .api }

123

def __iter__(self) -> Iterator[PdfPage]:

124

"""Iterate over all pages in the document."""

125

126

def __getitem__(self, index: int) -> PdfPage:

127

"""Get page by index (0-based)."""

128

129

def __delitem__(self, index: int):

130

"""Delete page by index."""

131

132

def get_page(self, index: int) -> PdfPage:

133

"""Get page by index with explicit method."""

134

135

def new_page(self, width: float, height: float, index: int = None) -> PdfPage:

136

"""

137

Create new page in document.

138

139

Parameters:

140

- width: float, page width in PDF units (1/72 inch)

141

- height: float, page height in PDF units

142

- index: int, optional insertion index (None = append)

143

144

Returns:

145

PdfPage: New page object

146

"""

147

148

def del_page(self, index: int):

149

"""Delete page by index."""

150

151

def import_pages(self, pdf: PdfDocument, pages=None, index=None):

152

"""

153

Import pages from another PDF document.

154

155

Parameters:

156

- pdf: PdfDocument, source document

157

- pages: list of int, page indices to import (None = all pages)

158

- index: int, insertion point in this document (None = append)

159

"""

160

161

def get_page_size(self, index: int) -> tuple[float, float]:

162

"""Get page dimensions as (width, height) tuple."""

163

164

def get_page_label(self, index: int) -> str:

165

"""Get page label (may differ from index for custom numbering)."""

166

167

def page_as_xobject(self, index: int, dest_pdf: PdfDocument) -> PdfXObject:

168

"""Convert page to Form XObject for embedding in another document."""

169

```

170

171

Example usage:

172

173

```python

174

pdf = pdfium.PdfDocument("document.pdf")

175

176

# Access pages

177

first_page = pdf[0]

178

last_page = pdf[-1]

179

180

# Iterate pages

181

for i, page in enumerate(pdf):

182

print(f"Page {i+1}: {page.get_size()}")

183

184

# Create new page

185

new_page = pdf.new_page(612, 792) # US Letter size

186

187

# Import pages from another PDF

188

source_pdf = pdfium.PdfDocument("source.pdf")

189

pdf.import_pages(source_pdf, pages=[0, 2, 4]) # Import pages 1, 3, 5

190

191

# Delete a page

192

del pdf[5]

193

```

194

195

### File Attachments

196

197

Manage embedded file attachments within the PDF document.

198

199

```python { .api }

200

def count_attachments(self) -> int:

201

"""Get number of file attachments."""

202

203

def get_attachment(self, index: int) -> PdfAttachment:

204

"""Get attachment by index."""

205

206

def new_attachment(self, name: str) -> PdfAttachment:

207

"""

208

Create new file attachment.

209

210

Parameters:

211

- name: str, attachment filename

212

213

Returns:

214

PdfAttachment: New attachment object

215

"""

216

217

def del_attachment(self, index: int):

218

"""Delete attachment by index."""

219

```

220

221

### Document Outline and Bookmarks

222

223

Navigate and extract the document's table of contents structure, including nested bookmarks.

224

225

```python { .api }

226

def get_toc(self, max_depth=15, parent=None, level=0, seen=None) -> Iterator[PdfOutlineItem]:

227

"""

228

Iterate through the bookmarks in the document's table of contents.

229

230

Parameters:

231

- max_depth: int, maximum recursion depth to consider (default: 15)

232

- parent: internal parent bookmark (typically None for root level)

233

- level: internal nesting level (typically 0 for root)

234

- seen: internal set for circular reference detection

235

236

Yields:

237

PdfOutlineItem: Bookmark information objects

238

239

Each bookmark contains title, page reference, view settings, and

240

hierarchical information including nesting level and child counts.

241

"""

242

```

243

244

#### PdfOutlineItem Class

245

246

Bookmark information structure for PDF table of contents entries.

247

248

```python { .api }

249

class PdfOutlineItem:

250

"""

251

Bookmark information namedtuple for PDF outline entries.

252

253

Represents a single bookmark/outline item from a PDF's table of contents,

254

containing hierarchical navigation information and target page details.

255

256

Attributes:

257

- level: int, number of parent items (nesting depth)

258

- title: str, title string of the bookmark

259

- is_closed: bool | None, True if children should be collapsed,

260

False if expanded, None if no children

261

- n_kids: int, absolute number of child items

262

- page_index: int | None, zero-based target page index (None if no target)

263

- view_mode: int, view mode constant defining coordinate interpretation

264

- view_pos: list[float], target position coordinates on the page

265

"""

266

267

level: int

268

title: str

269

is_closed: bool | None

270

n_kids: int

271

page_index: int | None

272

view_mode: int

273

view_pos: list[float]

274

```

275

276

Example usage:

277

278

```python

279

pdf = pdfium.PdfDocument("document_with_bookmarks.pdf")

280

281

# Extract table of contents

282

for bookmark in pdf.get_toc():

283

indent = " " * bookmark.level # Indent based on nesting

284

print(f"{indent}{bookmark.title}")

285

286

if bookmark.page_index is not None:

287

print(f"{indent} β†’ Page {bookmark.page_index + 1}")

288

print(f"{indent} β†’ Position: {bookmark.view_pos}")

289

290

if bookmark.n_kids > 0:

291

expanded = "πŸ“‚" if not bookmark.is_closed else "πŸ“"

292

print(f"{indent} {expanded} ({bookmark.n_kids} children)")

293

294

# Navigate to specific bookmark

295

for bookmark in pdf.get_toc():

296

if "Chapter 1" in bookmark.title and bookmark.page_index is not None:

297

# Load the target page

298

target_page = pdf[bookmark.page_index]

299

break

300

```

301

302

### Interactive Forms

303

304

Initialize interactive form environment for handling PDF forms and annotations.

305

306

```python { .api }

307

def init_forms(self, config=None):

308

"""

309

Initialize interactive form environment.

310

311

Parameters:

312

- config: optional form configuration

313

314

Sets up form environment for handling interactive elements,

315

annotations, and form fields.

316

"""

317

```

318

319

#### PdfFormEnv Class

320

321

Form environment helper class for managing interactive PDF forms.

322

323

```python { .api }

324

class PdfFormEnv:

325

"""

326

Form environment helper class for managing interactive PDF forms.

327

328

This class provides the form environment context needed for rendering

329

and interacting with PDF forms. Created automatically when init_forms()

330

is called on a document that contains forms.

331

332

Attributes:

333

- raw: FPDF_FORMHANDLE, underlying PDFium form env handle

334

- config: FPDF_FORMFILLINFO, form configuration interface

335

- pdf: PdfDocument, parent document this form env belongs to

336

"""

337

338

def __init__(self, raw, config, pdf):

339

"""

340

Initialize form environment.

341

342

Parameters:

343

- raw: FPDF_FORMHANDLE, PDFium form handle

344

- config: FPDF_FORMFILLINFO, form configuration

345

- pdf: PdfDocument, parent document

346

347

Note: This is typically created automatically by PdfDocument.init_forms()

348

rather than being instantiated directly.

349

"""

350

351

def close(self):

352

"""Close and clean up form environment resources."""

353

```

354

355

Example usage:

356

357

```python

358

pdf = pdfium.PdfDocument("form.pdf")

359

360

# Initialize forms if document contains them

361

pdf.init_forms()

362

363

if pdf.formenv:

364

print("Form environment is active")

365

# Form environment will be used automatically during page rendering

366

# to handle interactive form elements

367

```

368

369

### Document Saving

370

371

Save PDF documents to files or buffers with version control and optimization options.

372

373

```python { .api }

374

def save(self, dest, version=None, flags=...):

375

"""

376

Save document to file or buffer.

377

378

Parameters:

379

- dest: str (file path) or file-like object for output

380

- version: int, optional PDF version to save as

381

- flags: various save options and optimization flags

382

383

Saves the current state of the document including all modifications,

384

new pages, and metadata changes.

385

"""

386

```

387

388

Example:

389

390

```python

391

pdf = pdfium.PdfDocument("input.pdf")

392

393

# Modify document

394

pdf.new_page(612, 792)

395

396

# Save to new file

397

pdf.save("output.pdf")

398

399

# Save to buffer

400

import io

401

buffer = io.BytesIO()

402

pdf.save(buffer)

403

pdf_bytes = buffer.getvalue()

404

```

405

406

### Resource Management

407

408

Proper cleanup and resource management for PDF documents.

409

410

```python { .api }

411

def close():

412

"""Close document and free resources."""

413

414

def __enter__(self) -> PdfDocument:

415

"""Context manager entry."""

416

417

def __exit__(self, exc_type, exc_val, exc_tb):

418

"""Context manager exit with cleanup."""

419

```

420

421

Always close documents when done or use context managers:

422

423

```python

424

# Manual cleanup

425

pdf = pdfium.PdfDocument("document.pdf")

426

# ... work with PDF

427

pdf.close()

428

429

# Context manager (recommended)

430

with pdfium.PdfDocument("document.pdf") as pdf:

431

# ... work with PDF

432

pass # Automatically closed

433

```

434

435

## Properties

436

437

```python { .api }

438

@property

439

def raw(self) -> FPDF_DOCUMENT:

440

"""Raw PDFium document handle for low-level operations."""

441

442

@property

443

def formenv(self) -> PdfFormEnv | None:

444

"""Form environment if initialized, None otherwise."""

445

```

446

447

## Advanced Features

448

449

### Unsupported Feature Handling

450

451

Handle notifications about PDF features not supported by the PDFium library.

452

453

#### PdfUnspHandler Class

454

455

Unsupported feature handler for managing notifications about PDF features not available in PDFium.

456

457

```python { .api }

458

class PdfUnspHandler:

459

"""

460

Unsupported feature handler helper class.

461

462

Manages callbacks for handling notifications when PDFium encounters

463

PDF features that are not supported by the current build. Useful for

464

logging, debugging, and informing users about document limitations.

465

466

Attributes:

467

- handlers: dict[str, callable], dictionary of named handler functions

468

called with unsupported feature codes (FPDF_UNSP_*)

469

"""

470

471

def __init__(self):

472

"""Initialize unsupported feature handler."""

473

474

def setup(self, add_default=True):

475

"""

476

Attach the handler to PDFium and register exit function.

477

478

Parameters:

479

- add_default: bool, if True, add default warning callback

480

481

Sets up the handler to receive notifications from PDFium when

482

unsupported features are encountered during document processing.

483

"""

484

485

def __call__(self, _, type: int):

486

"""

487

Handle unsupported feature notification.

488

489

Parameters:

490

- _: unused parameter (PDFium context)

491

- type: int, unsupported feature code (FPDF_UNSP_*)

492

493

Called automatically by PDFium when unsupported features are found.

494

Executes all registered handler functions with the feature code.

495

"""

496

```

497

498

Example usage:

499

500

```python

501

import pypdfium2 as pdfium

502

503

# Create and setup unsupported feature handler

504

unsp_handler = pdfium.PdfUnspHandler()

505

506

# Add custom handler for unsupported features

507

def my_handler(feature_code):

508

feature_name = {

509

1: "Document XFA",

510

2: "Portable Collection",

511

3: "Attachment",

512

4: "Security",

513

5: "Shared Review",

514

6: "Shared Form Acrobat",

515

7: "Shared Form Filesystem",

516

8: "Shared Form Email",

517

9: "3D Annotation",

518

10: "Movie Annotation",

519

11: "Sound Annotation",

520

12: "Screen Media",

521

13: "Screen Rich Media",

522

14: "Attachment 3D",

523

15: "Multimedia"

524

}.get(feature_code, f"Unknown feature {feature_code}")

525

526

print(f"Warning: Unsupported PDF feature detected: {feature_name}")

527

528

unsp_handler.handlers["custom"] = my_handler

529

530

# Setup handler (includes default warning logger)

531

unsp_handler.setup(add_default=True)

532

533

# Now when processing PDFs, unsupported features will be reported

534

pdf = pdfium.PdfDocument("document_with_unsupported_features.pdf")

535

# Any unsupported features will trigger the handlers

536

```