0
# Text Extraction
1
2
Advanced text extraction capabilities with layout-aware algorithms, word detection, text search, character-level analysis, and comprehensive text processing options.
3
4
## Capabilities
5
6
### Layout-Aware Text Extraction
7
8
Primary text extraction method that preserves document layout and formatting using sophisticated algorithms.
9
10
```python { .api }
11
def extract_text(x_tolerance=3, y_tolerance=3, layout=False,
12
x_density=7.25, y_density=13, **kwargs):
13
"""
14
Extract text using layout-aware algorithm.
15
16
Parameters:
17
- x_tolerance: int or float - Horizontal tolerance for grouping characters
18
- y_tolerance: int or float - Vertical tolerance for grouping characters
19
- layout: bool - Preserve layout with whitespace and positioning
20
- x_density: float - Horizontal character density for layout
21
- y_density: float - Vertical character density for layout
22
- **kwargs: Additional text processing options
23
24
Returns:
25
str: Extracted text with layout preservation
26
"""
27
```
28
29
**Usage Examples:**
30
31
```python
32
with pdfplumber.open("document.pdf") as pdf:
33
page = pdf.pages[0]
34
35
# Basic text extraction
36
text = page.extract_text()
37
print(text)
38
39
# Layout-preserving extraction
40
formatted_text = page.extract_text(layout=True)
41
print(formatted_text)
42
43
# Fine-tuned character grouping
44
precise_text = page.extract_text(x_tolerance=1, y_tolerance=1)
45
print(precise_text)
46
47
# Custom density for layout reconstruction
48
spaced_text = page.extract_text(layout=True, x_density=10, y_density=15)
49
print(spaced_text)
50
```
51
52
### Simple Text Extraction
53
54
Streamlined text extraction without complex layout analysis for performance-critical applications.
55
56
```python { .api }
57
def extract_text_simple(**kwargs):
58
"""
59
Extract text using simple algorithm.
60
61
Parameters:
62
- **kwargs: Text processing options
63
64
Returns:
65
str: Extracted text without layout preservation
66
"""
67
```
68
69
### Word Extraction
70
71
Extract words as objects with detailed position and formatting information.
72
73
```python { .api }
74
def extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False,
75
use_text_flow=False, horizontal_ltr=True, vertical_ttb=True,
76
extra_attrs=None, split_at_punctuation=False, **kwargs):
77
"""
78
Extract words as objects with position data.
79
80
Parameters:
81
- x_tolerance: int or float - Horizontal tolerance for word boundaries
82
- y_tolerance: int or float - Vertical tolerance for word boundaries
83
- keep_blank_chars: bool - Include blank character objects
84
- use_text_flow: bool - Use text flow direction for word detection
85
- horizontal_ltr: bool - Left-to-right reading order for horizontal text
86
- vertical_ttb: bool - Top-to-bottom reading order for vertical text
87
- extra_attrs: List[str] - Additional attributes to include in word objects
88
- split_at_punctuation: bool - Split words at punctuation marks
89
- **kwargs: Additional word processing options
90
91
Returns:
92
List[Dict[str, Any]]: List of word objects with position and formatting
93
"""
94
```
95
96
**Usage Examples:**
97
98
```python
99
with pdfplumber.open("document.pdf") as pdf:
100
page = pdf.pages[0]
101
102
# Extract words with position data
103
words = page.extract_words()
104
for word in words:
105
print(f"'{word['text']}' at ({word['x0']}, {word['top']})")
106
107
# Extract words with custom tolerances
108
tight_words = page.extract_words(x_tolerance=1, y_tolerance=1)
109
110
# Include font information
111
detailed_words = page.extract_words(extra_attrs=['fontname', 'size'])
112
for word in detailed_words:
113
print(f"'{word['text']}' - Font: {word.get('fontname', 'Unknown')} Size: {word.get('size', 'Unknown')}")
114
```
115
116
### Text Line Extraction
117
118
Extract text organized by lines with character-level details and line-level formatting.
119
120
```python { .api }
121
def extract_text_lines(strip=True, return_chars=True, **kwargs):
122
"""
123
Extract text lines with character details.
124
125
Parameters:
126
- strip: bool - Strip whitespace from line text
127
- return_chars: bool - Include character objects in line data
128
- **kwargs: Additional line processing options
129
130
Returns:
131
List[Dict[str, Any]]: List of line objects with text and character data
132
"""
133
```
134
135
**Usage Examples:**
136
137
```python
138
with pdfplumber.open("document.pdf") as pdf:
139
page = pdf.pages[0]
140
141
# Extract text lines
142
lines = page.extract_text_lines()
143
for line in lines:
144
print(f"Line: '{line['text']}' at y={line['top']}")
145
print(f" Contains {len(line.get('chars', []))} characters")
146
147
# Extract lines without character details
148
simple_lines = page.extract_text_lines(return_chars=False)
149
for line in simple_lines:
150
print(line['text'])
151
```
152
153
### Text Search
154
155
Advanced text search with regex support, case sensitivity options, and detailed match information.
156
157
```python { .api }
158
def search(pattern, regex=True, case=True, main_group=0,
159
return_chars=True, return_groups=True, **kwargs):
160
"""
161
Search for text patterns with regex support.
162
163
Parameters:
164
- pattern: str - Search pattern (literal text or regex)
165
- regex: bool - Treat pattern as regular expression
166
- case: bool - Case-sensitive search
167
- main_group: int - Primary regex group for match extraction
168
- return_chars: bool - Include character objects in matches
169
- return_groups: bool - Include regex group information
170
- **kwargs: Additional search options
171
172
Returns:
173
List[Dict[str, Any]]: List of match objects with position and text data
174
"""
175
```
176
177
**Usage Examples:**
178
179
```python
180
with pdfplumber.open("document.pdf") as pdf:
181
page = pdf.pages[0]
182
183
# Simple text search
184
matches = page.search("invoice")
185
for match in matches:
186
print(f"Found '{match['text']}' at ({match['x0']}, {match['top']})")
187
188
# Regex search with groups
189
email_matches = page.search(r'(\w+)@(\w+\.\w+)', regex=True)
190
for match in email_matches:
191
print(f"Email: {match['text']}")
192
print(f"Groups: {match.get('groups', [])}")
193
194
# Case-insensitive search
195
ci_matches = page.search("TOTAL", case=False)
196
197
# Search with character details
198
detailed_matches = page.search("amount", return_chars=True)
199
for match in detailed_matches:
200
chars = match.get('chars', [])
201
print(f"Match uses {len(chars)} characters")
202
```
203
204
### Character Processing
205
206
Low-level character processing and deduplication functions.
207
208
```python { .api }
209
def dedupe_chars(tolerance=1, use_text_flow=False, **kwargs):
210
"""
211
Remove duplicate characters.
212
213
Parameters:
214
- tolerance: int or float - Distance tolerance for duplicate detection
215
- use_text_flow: bool - Consider text flow in deduplication
216
- **kwargs: Additional deduplication options
217
218
Returns:
219
Page: New page object with deduplicated characters
220
"""
221
```
222
223
## Utility Text Functions
224
225
Standalone text processing functions available in the utils module.
226
227
```python { .api }
228
# From pdfplumber.utils
229
def extract_text(chars, **kwargs):
230
"""Extract text from character objects."""
231
232
def extract_text_simple(chars, **kwargs):
233
"""Simple text extraction from characters."""
234
235
def extract_words(chars, **kwargs):
236
"""Extract words from character objects."""
237
238
def dedupe_chars(chars, tolerance=1, **kwargs):
239
"""Remove duplicate characters from list."""
240
241
def chars_to_textmap(chars, **kwargs):
242
"""Convert characters to TextMap object."""
243
244
def collate_line(chars, **kwargs):
245
"""Collate characters into text line."""
246
```
247
248
**Text Processing Constants:**
249
250
```python { .api }
251
# Default tolerance values
252
DEFAULT_X_TOLERANCE = 3
253
DEFAULT_Y_TOLERANCE = 3
254
DEFAULT_X_DENSITY = 7.25
255
DEFAULT_Y_DENSITY = 13
256
```
257
258
## TextMap Class
259
260
Advanced text mapping object for character-level text analysis.
261
262
```python { .api }
263
class TextMap:
264
"""Character-level text mapping with position data."""
265
266
def __init__(self, chars, **kwargs):
267
"""Initialize TextMap from character objects."""
268
269
def as_list(self):
270
"""Convert to list representation."""
271
272
def as_string(self):
273
"""Convert to string representation."""
274
```
275
276
**Usage Examples:**
277
278
```python
279
from pdfplumber.utils import chars_to_textmap
280
281
with pdfplumber.open("document.pdf") as pdf:
282
page = pdf.pages[0]
283
284
# Create TextMap from page characters
285
textmap = chars_to_textmap(page.chars)
286
287
# Convert to different representations
288
text_list = textmap.as_list()
289
text_string = textmap.as_string()
290
291
print(f"TextMap contains {len(text_list)} text elements")
292
print(f"Combined text: {text_string}")
293
```