Tessl Tile for pypi/pdfplumber@0.11.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli.md index.md page-manipulation.md pdf-operations.md table-extraction.md text-extraction.md utilities.md visual-debugging.md

text-extraction.mddocs/

0
# Text Extraction
1

2
Advanced text extraction capabilities with layout-aware algorithms, word detection, text search, character-level analysis, and comprehensive text processing options.
3

4
## Capabilities
5

6
### Layout-Aware Text Extraction
7

8
Primary text extraction method that preserves document layout and formatting using sophisticated algorithms.
9

10
```python { .api }
11
def extract_text(x_tolerance=3, y_tolerance=3, layout=False, 
12
                x_density=7.25, y_density=13, **kwargs):
13
    """
14
    Extract text using layout-aware algorithm.
15
    
16
    Parameters:
17
    - x_tolerance: int or float - Horizontal tolerance for grouping characters
18
    - y_tolerance: int or float - Vertical tolerance for grouping characters  
19
    - layout: bool - Preserve layout with whitespace and positioning
20
    - x_density: float - Horizontal character density for layout
21
    - y_density: float - Vertical character density for layout
22
    - **kwargs: Additional text processing options
23
    
24
    Returns:
25
    str: Extracted text with layout preservation
26
    """
27
```
28

29
**Usage Examples:**
30

31
```python
32
with pdfplumber.open("document.pdf") as pdf:
33
    page = pdf.pages[0]
34
    
35
    # Basic text extraction
36
    text = page.extract_text()
37
    print(text)
38
    
39
    # Layout-preserving extraction
40
    formatted_text = page.extract_text(layout=True)
41
    print(formatted_text)
42
    
43
    # Fine-tuned character grouping
44
    precise_text = page.extract_text(x_tolerance=1, y_tolerance=1)
45
    print(precise_text)
46

47
    # Custom density for layout reconstruction
48
    spaced_text = page.extract_text(layout=True, x_density=10, y_density=15)
49
    print(spaced_text)
50
```
51

52
### Simple Text Extraction
53

54
Streamlined text extraction without complex layout analysis for performance-critical applications.
55

56
```python { .api }
57
def extract_text_simple(**kwargs):
58
    """
59
    Extract text using simple algorithm.
60
    
61
    Parameters:
62
    - **kwargs: Text processing options
63
    
64
    Returns:
65
    str: Extracted text without layout preservation
66
    """
67
```
68

69
### Word Extraction
70

71
Extract words as objects with detailed position and formatting information.
72

73
```python { .api }
74
def extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False,
75
                  use_text_flow=False, horizontal_ltr=True, vertical_ttb=True,
76
                  extra_attrs=None, split_at_punctuation=False, **kwargs):
77
    """
78
    Extract words as objects with position data.
79
    
80
    Parameters:
81
    - x_tolerance: int or float - Horizontal tolerance for word boundaries
82
    - y_tolerance: int or float - Vertical tolerance for word boundaries
83
    - keep_blank_chars: bool - Include blank character objects
84
    - use_text_flow: bool - Use text flow direction for word detection
85
    - horizontal_ltr: bool - Left-to-right reading order for horizontal text
86
    - vertical_ttb: bool - Top-to-bottom reading order for vertical text
87
    - extra_attrs: List[str] - Additional attributes to include in word objects
88
    - split_at_punctuation: bool - Split words at punctuation marks
89
    - **kwargs: Additional word processing options
90
    
91
    Returns:
92
    List[Dict[str, Any]]: List of word objects with position and formatting
93
    """
94
```
95

96
**Usage Examples:**
97

98
```python
99
with pdfplumber.open("document.pdf") as pdf:
100
    page = pdf.pages[0]
101
    
102
    # Extract words with position data
103
    words = page.extract_words()
104
    for word in words:
105
        print(f"'{word['text']}' at ({word['x0']}, {word['top']})")
106
    
107
    # Extract words with custom tolerances
108
    tight_words = page.extract_words(x_tolerance=1, y_tolerance=1)
109
    
110
    # Include font information
111
    detailed_words = page.extract_words(extra_attrs=['fontname', 'size'])
112
    for word in detailed_words:
113
        print(f"'{word['text']}' - Font: {word.get('fontname', 'Unknown')} Size: {word.get('size', 'Unknown')}")
114
```
115

116
### Text Line Extraction
117

118
Extract text organized by lines with character-level details and line-level formatting.
119

120
```python { .api }
121
def extract_text_lines(strip=True, return_chars=True, **kwargs):
122
    """
123
    Extract text lines with character details.
124
    
125
    Parameters:
126
    - strip: bool - Strip whitespace from line text
127
    - return_chars: bool - Include character objects in line data
128
    - **kwargs: Additional line processing options
129
    
130
    Returns:
131
    List[Dict[str, Any]]: List of line objects with text and character data
132
    """
133
```
134

135
**Usage Examples:**
136

137
```python
138
with pdfplumber.open("document.pdf") as pdf:
139
    page = pdf.pages[0]
140
    
141
    # Extract text lines
142
    lines = page.extract_text_lines()
143
    for line in lines:
144
        print(f"Line: '{line['text']}' at y={line['top']}")
145
        print(f"  Contains {len(line.get('chars', []))} characters")
146
    
147
    # Extract lines without character details
148
    simple_lines = page.extract_text_lines(return_chars=False)
149
    for line in simple_lines:
150
        print(line['text'])
151
```
152

153
### Text Search
154

155
Advanced text search with regex support, case sensitivity options, and detailed match information.
156

157
```python { .api }
158
def search(pattern, regex=True, case=True, main_group=0, 
159
           return_chars=True, return_groups=True, **kwargs):
160
    """
161
    Search for text patterns with regex support.
162
    
163
    Parameters:
164
    - pattern: str - Search pattern (literal text or regex)
165
    - regex: bool - Treat pattern as regular expression
166
    - case: bool - Case-sensitive search
167
    - main_group: int - Primary regex group for match extraction
168
    - return_chars: bool - Include character objects in matches
169
    - return_groups: bool - Include regex group information
170
    - **kwargs: Additional search options
171
    
172
    Returns:
173
    List[Dict[str, Any]]: List of match objects with position and text data
174
    """
175
```
176

177
**Usage Examples:**
178

179
```python
180
with pdfplumber.open("document.pdf") as pdf:
181
    page = pdf.pages[0]
182
    
183
    # Simple text search
184
    matches = page.search("invoice")
185
    for match in matches:
186
        print(f"Found '{match['text']}' at ({match['x0']}, {match['top']})")
187
    
188
    # Regex search with groups
189
    email_matches = page.search(r'(\w+)@(\w+\.\w+)', regex=True)
190
    for match in email_matches:
191
        print(f"Email: {match['text']}")
192
        print(f"Groups: {match.get('groups', [])}")
193
    
194
    # Case-insensitive search
195
    ci_matches = page.search("TOTAL", case=False)
196
    
197
    # Search with character details
198
    detailed_matches = page.search("amount", return_chars=True)
199
    for match in detailed_matches:
200
        chars = match.get('chars', [])
201
        print(f"Match uses {len(chars)} characters")
202
```
203

204
### Character Processing
205

206
Low-level character processing and deduplication functions.
207

208
```python { .api }
209
def dedupe_chars(tolerance=1, use_text_flow=False, **kwargs):
210
    """
211
    Remove duplicate characters.
212
    
213
    Parameters:
214
    - tolerance: int or float - Distance tolerance for duplicate detection
215
    - use_text_flow: bool - Consider text flow in deduplication
216
    - **kwargs: Additional deduplication options
217
    
218
    Returns:
219
    Page: New page object with deduplicated characters
220
    """
221
```
222

223
## Utility Text Functions
224

225
Standalone text processing functions available in the utils module.
226

227
```python { .api }
228
# From pdfplumber.utils
229
def extract_text(chars, **kwargs):
230
    """Extract text from character objects."""
231

232
def extract_text_simple(chars, **kwargs):
233
    """Simple text extraction from characters."""
234

235
def extract_words(chars, **kwargs):
236
    """Extract words from character objects."""
237

238
def dedupe_chars(chars, tolerance=1, **kwargs):
239
    """Remove duplicate characters from list."""
240

241
def chars_to_textmap(chars, **kwargs):
242
    """Convert characters to TextMap object."""
243

244
def collate_line(chars, **kwargs):
245
    """Collate characters into text line."""
246
```
247

248
**Text Processing Constants:**
249

250
```python { .api }
251
# Default tolerance values
252
DEFAULT_X_TOLERANCE = 3
253
DEFAULT_Y_TOLERANCE = 3
254
DEFAULT_X_DENSITY = 7.25
255
DEFAULT_Y_DENSITY = 13
256
```
257

258
## TextMap Class
259

260
Advanced text mapping object for character-level text analysis.
261

262
```python { .api }
263
class TextMap:
264
    """Character-level text mapping with position data."""
265
    
266
    def __init__(self, chars, **kwargs):
267
        """Initialize TextMap from character objects."""
268
    
269
    def as_list(self):
270
        """Convert to list representation."""
271
    
272
    def as_string(self):
273
        """Convert to string representation."""
274
```
275

276
**Usage Examples:**
277

278
```python
279
from pdfplumber.utils import chars_to_textmap
280

281
with pdfplumber.open("document.pdf") as pdf:
282
    page = pdf.pages[0]
283
    
284
    # Create TextMap from page characters
285
    textmap = chars_to_textmap(page.chars)
286
    
287
    # Convert to different representations
288
    text_list = textmap.as_list()
289
    text_string = textmap.as_string()
290
    
291
    print(f"TextMap contains {len(text_list)} text elements")
292
    print(f"Combined text: {text_string}")
293
```

Version

Tile

Files

text-extraction.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

text-extraction.mddocs/