or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-pymupdf

High performance Python library for data extraction, analysis, conversion & manipulation of PDF and other documents.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/pymupdf@1.26.x

To install, run

npx @tessl/cli install tessl/pypi-pymupdf@1.26.0

0

# PyMuPDF

1

2

A high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents. PyMuPDF provides comprehensive PDF processing capabilities built on top of the MuPDF C++ library, enabling developers to extract text, images, and metadata, manipulate document content, and render pages to various formats.

3

4

## Package Information

5

6

- **Package Name**: PyMuPDF

7

- **Language**: Python

8

- **Installation**: `pip install PyMuPDF`

9

- **Minimum Python Version**: 3.9+

10

11

## Core Imports

12

13

```python

14

import pymupdf

15

```

16

17

Legacy compatibility (still supported):

18

19

```python

20

import fitz # Maps to pymupdf

21

```

22

23

## Basic Usage

24

25

```python

26

import pymupdf

27

28

# Open a document

29

doc = pymupdf.open("document.pdf") # Same as pymupdf.Document("document.pdf")

30

31

# Extract text from all pages using standalone function

32

text = ""

33

for page in doc:

34

text += pymupdf.get_text(page)

35

36

# Get document metadata

37

metadata = doc.metadata

38

39

# Save and close

40

doc.save("output.pdf")

41

doc.close()

42

```

43

44

## Architecture

45

46

PyMuPDF follows a hierarchical document model:

47

48

- **Document**: Top-level container representing the entire document (PDF, XPS, EPUB, etc.)

49

- **Page**: Individual pages containing content, annotations, and links

50

- **Pixmap**: Raster image representation for rendering and image processing

51

- **TextPage**: Text extraction and analysis with layout information

52

- **Geometry Classes**: Matrix, Rect, Point, Quad for coordinate transformations and positioning

53

54

The library provides both high-level convenience methods and low-level access to document structures, enabling everything from simple text extraction to complex document manipulation and rendering.

55

56

## Capabilities

57

58

### Document Operations

59

60

Core document handling including opening, saving, and metadata management. Supports PDF, XPS, EPUB, MOBI, CBZ, SVG and other formats with comprehensive document manipulation capabilities.

61

62

```python { .api }

63

# Note: open() is an alias for Document constructor

64

open = Document

65

66

class Document:

67

def __init__(self, filename: str = None, stream: bytes = None, filetype: str = None,

68

rect: Rect = None, width: int = 0, height: int = 0, fontsize: int = 11): ...

69

def save(self, filename: str, **kwargs) -> None: ...

70

def close(self) -> None: ...

71

def load_page(self, page_num: int) -> Page: ...

72

@property

73

def page_count(self) -> int: ...

74

@property

75

def metadata(self) -> dict: ...

76

```

77

78

[Document Operations](./document-operations.md)

79

80

### Page Content Extraction

81

82

Text and image extraction from document pages with multiple output formats, search capabilities, and layout analysis. Includes support for structured text extraction with formatting information.

83

84

```python { .api }

85

# Standalone text extraction functions

86

def get_text(page: Page, option: str = "text", **kwargs) -> str: ...

87

def get_text_blocks(page: Page, **kwargs) -> list: ...

88

def get_text_words(page: Page, **kwargs) -> list: ...

89

def get_textbox(page: Page, rect: Rect, **kwargs) -> str: ...

90

91

class Page:

92

def get_textpage(self, **kwargs) -> TextPage: ...

93

def search_for(self, needle: str, **kwargs) -> list: ...

94

def get_images(self, **kwargs) -> list: ...

95

def get_links(self) -> list: ...

96

```

97

98

[Page Content Extraction](./page-content-extraction.md)

99

100

### Document Rendering

101

102

High-performance rendering of document pages to various formats including PNG, JPEG, and other image formats. Supports custom resolutions, color spaces, and rendering options.

103

104

```python { .api }

105

class Page:

106

def get_pixmap(self, **kwargs) -> Pixmap: ...

107

108

class Pixmap:

109

def save(self, filename: str, **kwargs) -> None: ...

110

def tobytes(self, output: str = "png") -> bytes: ...

111

@property

112

def width(self) -> int: ...

113

@property

114

def height(self) -> int: ...

115

```

116

117

[Document Rendering](./document-rendering.md)

118

119

### Annotations and Forms

120

121

Comprehensive annotation handling including creation, modification, and deletion of various annotation types. Support for interactive forms and form field manipulation.

122

123

```python { .api }

124

class Annot:

125

def set_info(self, content: str = None, **kwargs) -> None: ...

126

def set_rect(self, rect: Rect) -> None: ...

127

def update(self) -> None: ...

128

def delete(self) -> None: ...

129

@property

130

def type(self) -> list: ...

131

```

132

133

[Annotations and Forms](./annotations-forms.md)

134

135

### Geometry and Transformations

136

137

Coordinate system handling with matrices, rectangles, points, and quads for precise positioning and transformations. Essential for layout manipulation and coordinate calculations.

138

139

```python { .api }

140

class Matrix:

141

def __init__(self, a: float = 1.0, b: float = 0.0, c: float = 0.0,

142

d: float = 1.0, e: float = 0.0, f: float = 0.0): ...

143

def prerotate(self, deg: float) -> Matrix: ...

144

def prescale(self, sx: float, sy: float) -> Matrix: ...

145

146

class Rect:

147

def __init__(self, x0: float, y0: float, x1: float, y1: float): ...

148

def transform(self, matrix: Matrix) -> Rect: ...

149

@property

150

def width(self) -> float: ...

151

@property

152

def height(self) -> float: ...

153

```

154

155

[Geometry and Transformations](./geometry-transformations.md)

156

157

### Table Extraction

158

159

Advanced table detection and extraction capabilities with support for table structure analysis, cell content extraction, and export to various formats including pandas DataFrames.

160

161

```python { .api }

162

class Table:

163

def extract(self) -> list: ...

164

def to_pandas(self) -> 'pandas.DataFrame': ...

165

166

class TableFinder:

167

def __init__(self, page: Page): ...

168

def find_tables(self, **kwargs) -> list: ...

169

```

170

171

[Table Extraction](./table-extraction.md)

172

173

### Document Creation and Modification

174

175

Creating new documents and modifying existing ones including page insertion, deletion, and content manipulation. Support for adding text, images, and other content elements.

176

177

```python { .api }

178

class Document:

179

def new_page(self, width: float = 595, height: float = 842, **kwargs) -> Page: ...

180

def delete_page(self, pno: int) -> None: ...

181

def insert_pdf(self, docsrc: Document, **kwargs) -> int: ...

182

183

class Page:

184

def insert_text(self, point: Point, text: str, **kwargs) -> int: ...

185

def insert_image(self, rect: Rect, **kwargs) -> None: ...

186

```

187

188

[Document Creation and Modification](./document-creation-modification.md)

189

190

## Types

191

192

```python { .api }

193

class Document:

194

"""Main document class for PDF and other document formats."""

195

196

class Page:

197

"""Represents a single page in a document."""

198

199

class Pixmap:

200

"""Raster image representation with pixel data."""

201

202

class TextPage:

203

"""Text extraction with layout and formatting information."""

204

205

class Annot:

206

"""Document annotation (note, highlight, etc.)."""

207

208

class Matrix:

209

"""2D transformation matrix for coordinate transformations."""

210

211

class Rect:

212

"""Rectangle defined by four coordinates (x0, y0, x1, y1)."""

213

214

class Point:

215

"""2D point with x and y coordinates."""

216

217

class Quad:

218

"""Quadrilateral defined by four corner points."""

219

220

class Font:

221

"""Font representation for text operations."""

222

223

class Archive:

224

"""Archive file handling for compressed documents."""

225

226

class TextWriter:

227

"""Utility for writing text with advanced formatting."""

228

229

class Shape:

230

"""Drawing operations for vector graphics."""

231

232

# Exception types

233

class FileDataError(RuntimeError):

234

"""Raised when file data is corrupted or invalid."""

235

236

class FileNotFoundError(RuntimeError):

237

"""Raised when requested file cannot be found."""

238

239

class EmptyFileError(FileDataError):

240

"""Raised when file is empty or contains no data."""

241

```