Detailed PDF analysis and extraction library with comprehensive table detection and visual debugging capabilities.
npx @tessl/cli install tessl/pypi-pdfplumber@0.11.00
# PDFplumber
1
2
A comprehensive Python library for detailed PDF analysis and extraction. PDFplumber provides granular access to PDF structure including text characters, rectangles, lines, curves, images, and annotations. It offers advanced table extraction capabilities with customizable detection strategies, visual debugging tools for understanding PDF structure, and comprehensive text extraction with layout preservation options.
3
4
## Package Information
5
6
- **Package Name**: pdfplumber
7
- **Language**: Python
8
- **Installation**: `pip install pdfplumber`
9
10
## Core Imports
11
12
```python
13
import pdfplumber
14
```
15
16
Common usage patterns:
17
18
```python
19
from pdfplumber import open
20
from pdfplumber.utils import extract_text, bbox_to_rect
21
```
22
23
## Basic Usage
24
25
```python
26
import pdfplumber
27
28
# Open a PDF file
29
with pdfplumber.open("document.pdf") as pdf:
30
# Access the first page
31
first_page = pdf.pages[0]
32
33
# Extract text from the page
34
text = first_page.extract_text()
35
print(text)
36
37
# Extract tables
38
tables = first_page.extract_tables()
39
for table in tables:
40
for row in table:
41
print(row)
42
43
# Visual debugging - save page as image with overlays
44
im = first_page.to_image()
45
im.draw_rects(first_page.chars, fill=(255, 0, 0, 30))
46
im.save("debug.png")
47
48
# Alternative - open without context manager
49
pdf = pdfplumber.open("document.pdf")
50
page = pdf.pages[0]
51
text = page.extract_text()
52
pdf.close()
53
```
54
55
## Architecture
56
57
PDFplumber's architecture centers around:
58
59
- **PDF**: Top-level document container managing pages and metadata
60
- **Page**: Individual page objects containing all PDF elements (text, graphics, tables)
61
- **Container**: Base class providing object access and filtering capabilities
62
- **Utils**: Comprehensive utility functions for geometry, text processing, and PDF internals
63
- **Table Extraction**: Specialized classes for detecting and extracting tabular data
64
- **Visual Debugging**: PageImage class for overlaying visual debugging information
65
66
This design provides maximum flexibility for PDF analysis tasks, from simple text extraction to complex document structure analysis and table detection.
67
68
## Capabilities
69
70
### PDF Document Operations
71
72
Core functionality for opening, accessing, and managing PDF documents including metadata extraction, page access, and document-level operations.
73
74
```python { .api }
75
def open(path_or_fp, pages=None, laparams=None, password=None,
76
strict_metadata=False, unicode_norm=None, repair=False,
77
gs_path=None, repair_setting="default", raise_unicode_errors=True):
78
"""Open PDF document from file path or stream."""
79
...
80
81
def repair(path_or_fp, outfile=None, password=None, gs_path=None,
82
setting="default"):
83
"""Repair PDF using Ghostscript."""
84
...
85
```
86
87
[PDF Operations](./pdf-operations.md)
88
89
### Text Extraction
90
91
Advanced text extraction with layout-aware algorithms, word detection, text search, and character-level analysis with position information.
92
93
```python { .api }
94
def extract_text(**kwargs):
95
"""Extract text using layout-aware algorithm."""
96
...
97
98
def extract_words(**kwargs):
99
"""Extract words as objects with position data."""
100
...
101
102
def search(pattern, regex=True, case=True, **kwargs):
103
"""Search for text patterns with regex support."""
104
...
105
```
106
107
[Text Extraction](./text-extraction.md)
108
109
### Table Extraction
110
111
Sophisticated table detection and extraction with customizable strategies, edge detection algorithms, and comprehensive configuration options.
112
113
```python { .api }
114
def find_tables(table_settings=None):
115
"""Find all tables using detection algorithms."""
116
...
117
118
def extract_tables(table_settings=None):
119
"""Extract tables as 2D arrays."""
120
...
121
122
class TableSettings:
123
"""Configuration for table detection parameters."""
124
...
125
```
126
127
[Table Extraction](./table-extraction.md)
128
129
### Page Manipulation
130
131
Page cropping, object filtering, bounding box operations, and coordinate transformations for precise PDF element analysis.
132
133
```python { .api }
134
def crop(bbox, relative=False, strict=True):
135
"""Crop page to bounding box."""
136
...
137
138
def within_bbox(bbox, relative=False, strict=True):
139
"""Filter objects within bounding box."""
140
...
141
142
def filter(test_function):
143
"""Filter objects using custom function."""
144
...
145
```
146
147
[Page Manipulation](./page-manipulation.md)
148
149
### Visual Debugging
150
151
Comprehensive visualization tools for overlaying debug information on PDF pages, including object highlighting, table structure visualization, and custom drawing operations.
152
153
```python { .api }
154
def to_image(resolution=None, width=None, height=None, antialias=False):
155
"""Convert page to image for debugging."""
156
...
157
158
class PageImage:
159
"""Image representation with drawing capabilities."""
160
def draw_rects(self, list_of_rects, **kwargs): ...
161
def debug_table(self, table, **kwargs): ...
162
```
163
164
[Visual Debugging](./visual-debugging.md)
165
166
### Utility Functions
167
168
Extensive utility functions for geometry operations, text processing, clustering algorithms, and PDF internal structure manipulation.
169
170
```python { .api }
171
def bbox_to_rect(bbox):
172
"""Convert bounding box to rectangle dictionary."""
173
...
174
175
def merge_bboxes(bboxes):
176
"""Merge multiple bounding boxes."""
177
...
178
179
def cluster_objects(objs, key_fn, tolerance):
180
"""Cluster objects by key function."""
181
...
182
```
183
184
[Utilities](./utilities.md)
185
186
### Command Line Interface
187
188
Complete command-line interface for PDF processing with support for text extraction, object export, and structure analysis.
189
190
```python { .api }
191
def main(args_raw=None):
192
"""CLI entry point with full argument parsing."""
193
...
194
```
195
196
[Command Line Interface](./cli.md)
197
198
## Known Issues
199
200
**Note**: The `set_debug` function is listed in the package's `__all__` export list but is not actually implemented in version 0.11.7. Attempting to use `pdfplumber.set_debug()` will result in an `AttributeError`.
201
202
## Types and Exceptions
203
204
```python { .api }
205
# Core type aliases
206
T_num = Union[int, float]
207
T_bbox = Tuple[T_num, T_num, T_num, T_num] # (x0, top, x1, bottom)
208
T_obj = Dict[str, Any] # PDF object representation
209
T_obj_list = List[T_obj]
210
211
# Custom exceptions
212
class MalformedPDFException(Exception):
213
"""Raised for malformed PDF files."""
214
...
215
216
class PdfminerException(Exception):
217
"""Wrapper for pdfminer exceptions."""
218
...
219
```