0
# PyMuPDF
1
2
A high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents. PyMuPDF provides comprehensive PDF processing capabilities built on top of the MuPDF C++ library, enabling developers to extract text, images, and metadata, manipulate document content, and render pages to various formats.
3
4
## Package Information
5
6
- **Package Name**: PyMuPDF
7
- **Language**: Python
8
- **Installation**: `pip install PyMuPDF`
9
- **Minimum Python Version**: 3.9+
10
11
## Core Imports
12
13
```python
14
import pymupdf
15
```
16
17
Legacy compatibility (still supported):
18
19
```python
20
import fitz # Maps to pymupdf
21
```
22
23
## Basic Usage
24
25
```python
26
import pymupdf
27
28
# Open a document
29
doc = pymupdf.open("document.pdf") # Same as pymupdf.Document("document.pdf")
30
31
# Extract text from all pages using standalone function
32
text = ""
33
for page in doc:
34
text += pymupdf.get_text(page)
35
36
# Get document metadata
37
metadata = doc.metadata
38
39
# Save and close
40
doc.save("output.pdf")
41
doc.close()
42
```
43
44
## Architecture
45
46
PyMuPDF follows a hierarchical document model:
47
48
- **Document**: Top-level container representing the entire document (PDF, XPS, EPUB, etc.)
49
- **Page**: Individual pages containing content, annotations, and links
50
- **Pixmap**: Raster image representation for rendering and image processing
51
- **TextPage**: Text extraction and analysis with layout information
52
- **Geometry Classes**: Matrix, Rect, Point, Quad for coordinate transformations and positioning
53
54
The library provides both high-level convenience methods and low-level access to document structures, enabling everything from simple text extraction to complex document manipulation and rendering.
55
56
## Capabilities
57
58
### Document Operations
59
60
Core document handling including opening, saving, and metadata management. Supports PDF, XPS, EPUB, MOBI, CBZ, SVG and other formats with comprehensive document manipulation capabilities.
61
62
```python { .api }
63
# Note: open() is an alias for Document constructor
64
open = Document
65
66
class Document:
67
def __init__(self, filename: str = None, stream: bytes = None, filetype: str = None,
68
rect: Rect = None, width: int = 0, height: int = 0, fontsize: int = 11): ...
69
def save(self, filename: str, **kwargs) -> None: ...
70
def close(self) -> None: ...
71
def load_page(self, page_num: int) -> Page: ...
72
@property
73
def page_count(self) -> int: ...
74
@property
75
def metadata(self) -> dict: ...
76
```
77
78
[Document Operations](./document-operations.md)
79
80
### Page Content Extraction
81
82
Text and image extraction from document pages with multiple output formats, search capabilities, and layout analysis. Includes support for structured text extraction with formatting information.
83
84
```python { .api }
85
# Standalone text extraction functions
86
def get_text(page: Page, option: str = "text", **kwargs) -> str: ...
87
def get_text_blocks(page: Page, **kwargs) -> list: ...
88
def get_text_words(page: Page, **kwargs) -> list: ...
89
def get_textbox(page: Page, rect: Rect, **kwargs) -> str: ...
90
91
class Page:
92
def get_textpage(self, **kwargs) -> TextPage: ...
93
def search_for(self, needle: str, **kwargs) -> list: ...
94
def get_images(self, **kwargs) -> list: ...
95
def get_links(self) -> list: ...
96
```
97
98
[Page Content Extraction](./page-content-extraction.md)
99
100
### Document Rendering
101
102
High-performance rendering of document pages to various formats including PNG, JPEG, and other image formats. Supports custom resolutions, color spaces, and rendering options.
103
104
```python { .api }
105
class Page:
106
def get_pixmap(self, **kwargs) -> Pixmap: ...
107
108
class Pixmap:
109
def save(self, filename: str, **kwargs) -> None: ...
110
def tobytes(self, output: str = "png") -> bytes: ...
111
@property
112
def width(self) -> int: ...
113
@property
114
def height(self) -> int: ...
115
```
116
117
[Document Rendering](./document-rendering.md)
118
119
### Annotations and Forms
120
121
Comprehensive annotation handling including creation, modification, and deletion of various annotation types. Support for interactive forms and form field manipulation.
122
123
```python { .api }
124
class Annot:
125
def set_info(self, content: str = None, **kwargs) -> None: ...
126
def set_rect(self, rect: Rect) -> None: ...
127
def update(self) -> None: ...
128
def delete(self) -> None: ...
129
@property
130
def type(self) -> list: ...
131
```
132
133
[Annotations and Forms](./annotations-forms.md)
134
135
### Geometry and Transformations
136
137
Coordinate system handling with matrices, rectangles, points, and quads for precise positioning and transformations. Essential for layout manipulation and coordinate calculations.
138
139
```python { .api }
140
class Matrix:
141
def __init__(self, a: float = 1.0, b: float = 0.0, c: float = 0.0,
142
d: float = 1.0, e: float = 0.0, f: float = 0.0): ...
143
def prerotate(self, deg: float) -> Matrix: ...
144
def prescale(self, sx: float, sy: float) -> Matrix: ...
145
146
class Rect:
147
def __init__(self, x0: float, y0: float, x1: float, y1: float): ...
148
def transform(self, matrix: Matrix) -> Rect: ...
149
@property
150
def width(self) -> float: ...
151
@property
152
def height(self) -> float: ...
153
```
154
155
[Geometry and Transformations](./geometry-transformations.md)
156
157
### Table Extraction
158
159
Advanced table detection and extraction capabilities with support for table structure analysis, cell content extraction, and export to various formats including pandas DataFrames.
160
161
```python { .api }
162
class Table:
163
def extract(self) -> list: ...
164
def to_pandas(self) -> 'pandas.DataFrame': ...
165
166
class TableFinder:
167
def __init__(self, page: Page): ...
168
def find_tables(self, **kwargs) -> list: ...
169
```
170
171
[Table Extraction](./table-extraction.md)
172
173
### Document Creation and Modification
174
175
Creating new documents and modifying existing ones including page insertion, deletion, and content manipulation. Support for adding text, images, and other content elements.
176
177
```python { .api }
178
class Document:
179
def new_page(self, width: float = 595, height: float = 842, **kwargs) -> Page: ...
180
def delete_page(self, pno: int) -> None: ...
181
def insert_pdf(self, docsrc: Document, **kwargs) -> int: ...
182
183
class Page:
184
def insert_text(self, point: Point, text: str, **kwargs) -> int: ...
185
def insert_image(self, rect: Rect, **kwargs) -> None: ...
186
```
187
188
[Document Creation and Modification](./document-creation-modification.md)
189
190
## Types
191
192
```python { .api }
193
class Document:
194
"""Main document class for PDF and other document formats."""
195
196
class Page:
197
"""Represents a single page in a document."""
198
199
class Pixmap:
200
"""Raster image representation with pixel data."""
201
202
class TextPage:
203
"""Text extraction with layout and formatting information."""
204
205
class Annot:
206
"""Document annotation (note, highlight, etc.)."""
207
208
class Matrix:
209
"""2D transformation matrix for coordinate transformations."""
210
211
class Rect:
212
"""Rectangle defined by four coordinates (x0, y0, x1, y1)."""
213
214
class Point:
215
"""2D point with x and y coordinates."""
216
217
class Quad:
218
"""Quadrilateral defined by four corner points."""
219
220
class Font:
221
"""Font representation for text operations."""
222
223
class Archive:
224
"""Archive file handling for compressed documents."""
225
226
class TextWriter:
227
"""Utility for writing text with advanced formatting."""
228
229
class Shape:
230
"""Drawing operations for vector graphics."""
231
232
# Exception types
233
class FileDataError(RuntimeError):
234
"""Raised when file data is corrupted or invalid."""
235
236
class FileNotFoundError(RuntimeError):
237
"""Raised when requested file cannot be found."""
238
239
class EmptyFileError(FileDataError):
240
"""Raised when file is empty or contains no data."""
241
```