Tessl Tile for pypi/pdfplumber@0.11.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli.md index.md page-manipulation.md pdf-operations.md table-extraction.md text-extraction.md utilities.md visual-debugging.md

index.mddocs/

0
# PDFplumber
1

2
A comprehensive Python library for detailed PDF analysis and extraction. PDFplumber provides granular access to PDF structure including text characters, rectangles, lines, curves, images, and annotations. It offers advanced table extraction capabilities with customizable detection strategies, visual debugging tools for understanding PDF structure, and comprehensive text extraction with layout preservation options.
3

4
## Package Information
5

6
- **Package Name**: pdfplumber
7
- **Language**: Python
8
- **Installation**: `pip install pdfplumber`
9

10
## Core Imports
11

12
```python
13
import pdfplumber
14
```
15

16
Common usage patterns:
17

18
```python
19
from pdfplumber import open
20
from pdfplumber.utils import extract_text, bbox_to_rect
21
```
22

23
## Basic Usage
24

25
```python
26
import pdfplumber
27

28
# Open a PDF file
29
with pdfplumber.open("document.pdf") as pdf:
30
    # Access the first page
31
    first_page = pdf.pages[0]
32
    
33
    # Extract text from the page
34
    text = first_page.extract_text()
35
    print(text)
36
    
37
    # Extract tables
38
    tables = first_page.extract_tables()
39
    for table in tables:
40
        for row in table:
41
            print(row)
42
    
43
    # Visual debugging - save page as image with overlays
44
    im = first_page.to_image()
45
    im.draw_rects(first_page.chars, fill=(255, 0, 0, 30))
46
    im.save("debug.png")
47

48
# Alternative - open without context manager
49
pdf = pdfplumber.open("document.pdf")
50
page = pdf.pages[0]
51
text = page.extract_text()
52
pdf.close()
53
```
54

55
## Architecture
56

57
PDFplumber's architecture centers around:
58

59
- **PDF**: Top-level document container managing pages and metadata
60
- **Page**: Individual page objects containing all PDF elements (text, graphics, tables)
61
- **Container**: Base class providing object access and filtering capabilities
62
- **Utils**: Comprehensive utility functions for geometry, text processing, and PDF internals
63
- **Table Extraction**: Specialized classes for detecting and extracting tabular data
64
- **Visual Debugging**: PageImage class for overlaying visual debugging information
65

66
This design provides maximum flexibility for PDF analysis tasks, from simple text extraction to complex document structure analysis and table detection.
67

68
## Capabilities
69

70
### PDF Document Operations
71

72
Core functionality for opening, accessing, and managing PDF documents including metadata extraction, page access, and document-level operations.
73

74
```python { .api }
75
def open(path_or_fp, pages=None, laparams=None, password=None, 
76
         strict_metadata=False, unicode_norm=None, repair=False, 
77
         gs_path=None, repair_setting="default", raise_unicode_errors=True):
78
    """Open PDF document from file path or stream."""
79
    ...
80

81
def repair(path_or_fp, outfile=None, password=None, gs_path=None, 
82
           setting="default"):
83
    """Repair PDF using Ghostscript."""
84
    ...
85
```
86

87
[PDF Operations](./pdf-operations.md)
88

89
### Text Extraction
90

91
Advanced text extraction with layout-aware algorithms, word detection, text search, and character-level analysis with position information.
92

93
```python { .api }
94
def extract_text(**kwargs):
95
    """Extract text using layout-aware algorithm."""
96
    ...
97

98
def extract_words(**kwargs):
99
    """Extract words as objects with position data."""
100
    ...
101

102
def search(pattern, regex=True, case=True, **kwargs):
103
    """Search for text patterns with regex support."""
104
    ...
105
```
106

107
[Text Extraction](./text-extraction.md)
108

109
### Table Extraction
110

111
Sophisticated table detection and extraction with customizable strategies, edge detection algorithms, and comprehensive configuration options.
112

113
```python { .api }
114
def find_tables(table_settings=None):
115
    """Find all tables using detection algorithms."""
116
    ...
117

118
def extract_tables(table_settings=None):
119
    """Extract tables as 2D arrays."""
120
    ...
121

122
class TableSettings:
123
    """Configuration for table detection parameters."""
124
    ...
125
```
126

127
[Table Extraction](./table-extraction.md)
128

129
### Page Manipulation
130

131
Page cropping, object filtering, bounding box operations, and coordinate transformations for precise PDF element analysis.
132

133
```python { .api }
134
def crop(bbox, relative=False, strict=True):
135
    """Crop page to bounding box."""
136
    ...
137

138
def within_bbox(bbox, relative=False, strict=True):
139
    """Filter objects within bounding box."""
140
    ...
141

142
def filter(test_function):
143
    """Filter objects using custom function."""
144
    ...
145
```
146

147
[Page Manipulation](./page-manipulation.md)
148

149
### Visual Debugging
150

151
Comprehensive visualization tools for overlaying debug information on PDF pages, including object highlighting, table structure visualization, and custom drawing operations.
152

153
```python { .api }
154
def to_image(resolution=None, width=None, height=None, antialias=False):
155
    """Convert page to image for debugging."""
156
    ...
157

158
class PageImage:
159
    """Image representation with drawing capabilities."""
160
    def draw_rects(self, list_of_rects, **kwargs): ...
161
    def debug_table(self, table, **kwargs): ...
162
```
163

164
[Visual Debugging](./visual-debugging.md)
165

166
### Utility Functions
167

168
Extensive utility functions for geometry operations, text processing, clustering algorithms, and PDF internal structure manipulation.
169

170
```python { .api }
171
def bbox_to_rect(bbox):
172
    """Convert bounding box to rectangle dictionary."""
173
    ...
174

175
def merge_bboxes(bboxes):
176
    """Merge multiple bounding boxes."""
177
    ...
178

179
def cluster_objects(objs, key_fn, tolerance):
180
    """Cluster objects by key function."""
181
    ...
182
```
183

184
[Utilities](./utilities.md)
185

186
### Command Line Interface
187

188
Complete command-line interface for PDF processing with support for text extraction, object export, and structure analysis.
189

190
```python { .api }
191
def main(args_raw=None):
192
    """CLI entry point with full argument parsing."""
193
    ...
194
```
195

196
[Command Line Interface](./cli.md)
197

198
## Known Issues
199

200
**Note**: The `set_debug` function is listed in the package's `__all__` export list but is not actually implemented in version 0.11.7. Attempting to use `pdfplumber.set_debug()` will result in an `AttributeError`.
201

202
## Types and Exceptions
203

204
```python { .api }
205
# Core type aliases
206
T_num = Union[int, float]
207
T_bbox = Tuple[T_num, T_num, T_num, T_num]  # (x0, top, x1, bottom)
208
T_obj = Dict[str, Any]  # PDF object representation
209
T_obj_list = List[T_obj]
210

211
# Custom exceptions
212
class MalformedPDFException(Exception):
213
    """Raised for malformed PDF files."""
214
    ...
215

216
class PdfminerException(Exception):
217
    """Wrapper for pdfminer exceptions."""
218
    ...
219
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/