or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

attachments.mdcli-tools.mddocument-management.mdimage-bitmap.mdindex.mdpage-manipulation.mdpage-objects.mdtext-processing.mdtransformation.mdversion-info.md

index.mddocs/

0

# pypdfium2

1

2

Python bindings to PDFium for comprehensive PDF manipulation, rendering, and processing. Built on Google's powerful PDFium library, pypdfium2 provides both high-level helper classes for common PDF operations and low-level raw bindings for advanced functionality.

3

4

## Package Information

5

6

- **Package Name**: pypdfium2

7

- **Language**: Python

8

- **Installation**: `pip install pypdfium2`

9

- **Python Requirements**: Python 3.6+

10

11

## Core Imports

12

13

```python

14

import pypdfium2 as pdfium

15

```

16

17

For direct access to specific classes:

18

19

```python

20

from pypdfium2 import PdfDocument, PdfPage, PdfBitmap

21

```

22

23

For version information:

24

25

```python

26

from pypdfium2 import PYPDFIUM_INFO, PDFIUM_INFO

27

```

28

29

## Basic Usage

30

31

```python

32

import pypdfium2 as pdfium

33

34

# Open a PDF document

35

pdf = pdfium.PdfDocument("document.pdf")

36

37

# Get basic information

38

print(f"Pages: {len(pdf)}")

39

print(f"Version: {pdf.get_version()}")

40

print(f"Metadata: {pdf.get_metadata_dict()}")

41

42

# Render first page to image

43

page = pdf[0]

44

bitmap = page.render(scale=2.0)

45

pil_image = bitmap.to_pil()

46

pil_image.save("page1.png")

47

48

# Extract text from page

49

textpage = page.get_textpage()

50

text = textpage.get_text_range()

51

print(f"Page text: {text}")

52

53

# Clean up

54

pdf.close()

55

```

56

57

## Architecture

58

59

pypdfium2 follows a layered architecture design:

60

61

- **Helper Classes**: High-level Python API (PdfDocument, PdfPage, PdfBitmap, etc.) providing intuitive interfaces for common operations

62

- **Raw Bindings**: Direct access to PDFium C API functions through pypdfium2.raw module

63

- **Type System**: Named tuples and data classes for structured information (PdfBitmapInfo, ImageInfo, etc.)

64

- **Resource Management**: Automatic cleanup with context managers and explicit close() methods

65

- **Multi-format Support**: PDF reading/writing, image rendering (PIL, NumPy), text extraction

66

67

This design enables both simple high-level operations and advanced low-level manipulation while maintaining compatibility with the broader Python ecosystem.

68

69

## Capabilities

70

71

### Document Management

72

73

Core PDF document operations including loading, creating, saving, and metadata manipulation. Supports password-protected PDFs, form handling, and file attachments.

74

75

```python { .api }

76

class PdfDocument:

77

def __init__(self, input_data, password=None, autoclose=False): ...

78

@classmethod

79

def new(cls): ...

80

def __len__(self) -> int: ...

81

def save(self, dest, version=None, flags=...): ...

82

def get_metadata_dict(self, skip_empty=False) -> dict: ...

83

def is_tagged(self) -> bool: ...

84

```

85

86

[Document Management](./document-management.md)

87

88

### Page Manipulation

89

90

Page-level operations including rendering, rotation, dimension management, and bounding box manipulation. Supports various rendering formats and customization options.

91

92

```python { .api }

93

class PdfPage:

94

def get_size(self) -> tuple[float, float]: ...

95

def render(self, rotation=0, scale=1, ...) -> PdfBitmap: ...

96

def get_rotation(self) -> int: ...

97

def set_rotation(self, rotation): ...

98

def get_mediabox(self, fallback_ok=True) -> tuple | None: ...

99

```

100

101

[Page Manipulation](./page-manipulation.md)

102

103

### Text Processing

104

105

Comprehensive text extraction and search capabilities with support for bounded text extraction, character-level positioning, and full-text search.

106

107

```python { .api }

108

class PdfTextPage:

109

def get_text_range(self, index=0, count=-1, errors="ignore", force_this=False) -> str: ...

110

def get_text_bounded(self, left=None, bottom=None, right=None, top=None, errors="ignore") -> str: ...

111

def search(self, text, index=0, match_case=False, match_whole_word=False, consecutive=False) -> PdfTextSearcher: ...

112

def get_charbox(self, index, loose=False) -> tuple: ...

113

```

114

115

[Text Processing](./text-processing.md)

116

117

### Image and Bitmap Operations

118

119

Image rendering, manipulation, and extraction with support for multiple output formats including PIL Images, NumPy arrays, and raw bitmaps.

120

121

```python { .api }

122

class PdfBitmap:

123

@classmethod

124

def from_pil(cls, pil_image, recopy=False) -> PdfBitmap: ...

125

def to_numpy(self) -> numpy.ndarray: ...

126

def to_pil(self) -> PIL.Image: ...

127

def fill_rect(self, left, top, width, height, color): ...

128

```

129

130

[Image and Bitmap Operations](./image-bitmap.md)

131

132

### Page Objects and Graphics

133

134

Manipulation of PDF page objects including images, text, and vector graphics. Supports object transformation, insertion, and removal.

135

136

```python { .api }

137

class PdfObject:

138

def get_pos(self) -> tuple: ...

139

def get_matrix(self) -> PdfMatrix: ...

140

def transform(self, matrix): ...

141

142

class PdfImage(PdfObject):

143

def get_metadata(self) -> ImageInfo: ...

144

def extract(self, dest, *args, **kwargs): ...

145

```

146

147

[Page Objects and Graphics](./page-objects.md)

148

149

### File Attachments

150

151

Management of embedded file attachments with support for attachment metadata, data extraction, and modification.

152

153

```python { .api }

154

class PdfAttachment:

155

def get_name(self) -> str: ...

156

def get_data(self) -> ctypes.Array: ...

157

def set_data(self, data): ...

158

def get_str_value(self, key) -> str: ...

159

```

160

161

[File Attachments](./attachments.md)

162

163

### Transformation and Geometry

164

165

2D transformation matrices for coordinate system manipulation, rotation, scaling, and translation operations.

166

167

```python { .api }

168

class PdfMatrix:

169

def __init__(self, a=1, b=0, c=0, d=1, e=0, f=0): ...

170

def translate(self, x, y) -> PdfMatrix: ...

171

def scale(self, x, y) -> PdfMatrix: ...

172

def rotate(self, angle, ccw=False, rad=False) -> PdfMatrix: ...

173

def on_point(self, x, y) -> tuple: ...

174

```

175

176

[Transformation and Geometry](./transformation.md)

177

178

### Version and Library Information

179

180

Access to pypdfium2 and PDFium version information, build details, and feature flags.

181

182

```python { .api }

183

PYPDFIUM_INFO: _version_pypdfium2

184

PDFIUM_INFO: _version_pdfium

185

186

# Version properties

187

version: str

188

api_tag: tuple[int]

189

major: int

190

minor: int

191

patch: int

192

build: int # PDFIUM_INFO only

193

```

194

195

[Version and Library Information](./version-info.md)

196

197

### Command Line Interface

198

199

Access to pypdfium2's comprehensive command-line tools for batch processing, text extraction, image operations, and document manipulation.

200

201

```python { .api }

202

def cli_main(raw_args=None) -> int:

203

"""Main CLI entry point for pypdfium2 command-line tools."""

204

205

def api_main(raw_args=None) -> int:

206

"""Alternative API entry point with same functionality as cli_main."""

207

```

208

209

[Command Line Interface](./cli-tools.md)

210

211

## Exception Handling

212

213

```python { .api }

214

class PdfiumError(RuntimeError):

215

"""Main exception for PDFium library errors"""

216

217

class ImageNotExtractableError(Exception):

218

"""Raised when image cannot be extracted from PDF"""

219

```

220

221

Common error scenarios include invalid PDF files, unsupported operations, memory allocation failures, and file I/O errors. Always handle exceptions when working with external PDF files or performing complex operations.

222

223

## Raw Bindings Access

224

225

For advanced use cases requiring direct PDFium API access:

226

227

```python

228

from pypdfium2 import raw

229

230

# Access low-level PDFium functions

231

doc_handle = raw.FPDF_LoadDocument(file_path, password)

232

page_count = raw.FPDF_GetPageCount(doc_handle)

233

```

234

235

The raw module provides complete access to PDFium's C API with all functions, constants, and structures available for advanced manipulation.