or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli.mdindex.mdpage-manipulation.mdpdf-operations.mdtable-extraction.mdtext-extraction.mdutilities.mdvisual-debugging.md

index.mddocs/

0

# PDFplumber

1

2

A comprehensive Python library for detailed PDF analysis and extraction. PDFplumber provides granular access to PDF structure including text characters, rectangles, lines, curves, images, and annotations. It offers advanced table extraction capabilities with customizable detection strategies, visual debugging tools for understanding PDF structure, and comprehensive text extraction with layout preservation options.

3

4

## Package Information

5

6

- **Package Name**: pdfplumber

7

- **Language**: Python

8

- **Installation**: `pip install pdfplumber`

9

10

## Core Imports

11

12

```python

13

import pdfplumber

14

```

15

16

Common usage patterns:

17

18

```python

19

from pdfplumber import open

20

from pdfplumber.utils import extract_text, bbox_to_rect

21

```

22

23

## Basic Usage

24

25

```python

26

import pdfplumber

27

28

# Open a PDF file

29

with pdfplumber.open("document.pdf") as pdf:

30

# Access the first page

31

first_page = pdf.pages[0]

32

33

# Extract text from the page

34

text = first_page.extract_text()

35

print(text)

36

37

# Extract tables

38

tables = first_page.extract_tables()

39

for table in tables:

40

for row in table:

41

print(row)

42

43

# Visual debugging - save page as image with overlays

44

im = first_page.to_image()

45

im.draw_rects(first_page.chars, fill=(255, 0, 0, 30))

46

im.save("debug.png")

47

48

# Alternative - open without context manager

49

pdf = pdfplumber.open("document.pdf")

50

page = pdf.pages[0]

51

text = page.extract_text()

52

pdf.close()

53

```

54

55

## Architecture

56

57

PDFplumber's architecture centers around:

58

59

- **PDF**: Top-level document container managing pages and metadata

60

- **Page**: Individual page objects containing all PDF elements (text, graphics, tables)

61

- **Container**: Base class providing object access and filtering capabilities

62

- **Utils**: Comprehensive utility functions for geometry, text processing, and PDF internals

63

- **Table Extraction**: Specialized classes for detecting and extracting tabular data

64

- **Visual Debugging**: PageImage class for overlaying visual debugging information

65

66

This design provides maximum flexibility for PDF analysis tasks, from simple text extraction to complex document structure analysis and table detection.

67

68

## Capabilities

69

70

### PDF Document Operations

71

72

Core functionality for opening, accessing, and managing PDF documents including metadata extraction, page access, and document-level operations.

73

74

```python { .api }

75

def open(path_or_fp, pages=None, laparams=None, password=None,

76

strict_metadata=False, unicode_norm=None, repair=False,

77

gs_path=None, repair_setting="default", raise_unicode_errors=True):

78

"""Open PDF document from file path or stream."""

79

...

80

81

def repair(path_or_fp, outfile=None, password=None, gs_path=None,

82

setting="default"):

83

"""Repair PDF using Ghostscript."""

84

...

85

```

86

87

[PDF Operations](./pdf-operations.md)

88

89

### Text Extraction

90

91

Advanced text extraction with layout-aware algorithms, word detection, text search, and character-level analysis with position information.

92

93

```python { .api }

94

def extract_text(**kwargs):

95

"""Extract text using layout-aware algorithm."""

96

...

97

98

def extract_words(**kwargs):

99

"""Extract words as objects with position data."""

100

...

101

102

def search(pattern, regex=True, case=True, **kwargs):

103

"""Search for text patterns with regex support."""

104

...

105

```

106

107

[Text Extraction](./text-extraction.md)

108

109

### Table Extraction

110

111

Sophisticated table detection and extraction with customizable strategies, edge detection algorithms, and comprehensive configuration options.

112

113

```python { .api }

114

def find_tables(table_settings=None):

115

"""Find all tables using detection algorithms."""

116

...

117

118

def extract_tables(table_settings=None):

119

"""Extract tables as 2D arrays."""

120

...

121

122

class TableSettings:

123

"""Configuration for table detection parameters."""

124

...

125

```

126

127

[Table Extraction](./table-extraction.md)

128

129

### Page Manipulation

130

131

Page cropping, object filtering, bounding box operations, and coordinate transformations for precise PDF element analysis.

132

133

```python { .api }

134

def crop(bbox, relative=False, strict=True):

135

"""Crop page to bounding box."""

136

...

137

138

def within_bbox(bbox, relative=False, strict=True):

139

"""Filter objects within bounding box."""

140

...

141

142

def filter(test_function):

143

"""Filter objects using custom function."""

144

...

145

```

146

147

[Page Manipulation](./page-manipulation.md)

148

149

### Visual Debugging

150

151

Comprehensive visualization tools for overlaying debug information on PDF pages, including object highlighting, table structure visualization, and custom drawing operations.

152

153

```python { .api }

154

def to_image(resolution=None, width=None, height=None, antialias=False):

155

"""Convert page to image for debugging."""

156

...

157

158

class PageImage:

159

"""Image representation with drawing capabilities."""

160

def draw_rects(self, list_of_rects, **kwargs): ...

161

def debug_table(self, table, **kwargs): ...

162

```

163

164

[Visual Debugging](./visual-debugging.md)

165

166

### Utility Functions

167

168

Extensive utility functions for geometry operations, text processing, clustering algorithms, and PDF internal structure manipulation.

169

170

```python { .api }

171

def bbox_to_rect(bbox):

172

"""Convert bounding box to rectangle dictionary."""

173

...

174

175

def merge_bboxes(bboxes):

176

"""Merge multiple bounding boxes."""

177

...

178

179

def cluster_objects(objs, key_fn, tolerance):

180

"""Cluster objects by key function."""

181

...

182

```

183

184

[Utilities](./utilities.md)

185

186

### Command Line Interface

187

188

Complete command-line interface for PDF processing with support for text extraction, object export, and structure analysis.

189

190

```python { .api }

191

def main(args_raw=None):

192

"""CLI entry point with full argument parsing."""

193

...

194

```

195

196

[Command Line Interface](./cli.md)

197

198

## Known Issues

199

200

**Note**: The `set_debug` function is listed in the package's `__all__` export list but is not actually implemented in version 0.11.7. Attempting to use `pdfplumber.set_debug()` will result in an `AttributeError`.

201

202

## Types and Exceptions

203

204

```python { .api }

205

# Core type aliases

206

T_num = Union[int, float]

207

T_bbox = Tuple[T_num, T_num, T_num, T_num] # (x0, top, x1, bottom)

208

T_obj = Dict[str, Any] # PDF object representation

209

T_obj_list = List[T_obj]

210

211

# Custom exceptions

212

class MalformedPDFException(Exception):

213

"""Raised for malformed PDF files."""

214

...

215

216

class PdfminerException(Exception):

217

"""Wrapper for pdfminer exceptions."""

218

...

219

```