or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli.mdindex.mdpage-manipulation.mdpdf-operations.mdtable-extraction.mdtext-extraction.mdutilities.mdvisual-debugging.md

text-extraction.mddocs/

0

# Text Extraction

1

2

Advanced text extraction capabilities with layout-aware algorithms, word detection, text search, character-level analysis, and comprehensive text processing options.

3

4

## Capabilities

5

6

### Layout-Aware Text Extraction

7

8

Primary text extraction method that preserves document layout and formatting using sophisticated algorithms.

9

10

```python { .api }

11

def extract_text(x_tolerance=3, y_tolerance=3, layout=False,

12

x_density=7.25, y_density=13, **kwargs):

13

"""

14

Extract text using layout-aware algorithm.

15

16

Parameters:

17

- x_tolerance: int or float - Horizontal tolerance for grouping characters

18

- y_tolerance: int or float - Vertical tolerance for grouping characters

19

- layout: bool - Preserve layout with whitespace and positioning

20

- x_density: float - Horizontal character density for layout

21

- y_density: float - Vertical character density for layout

22

- **kwargs: Additional text processing options

23

24

Returns:

25

str: Extracted text with layout preservation

26

"""

27

```

28

29

**Usage Examples:**

30

31

```python

32

with pdfplumber.open("document.pdf") as pdf:

33

page = pdf.pages[0]

34

35

# Basic text extraction

36

text = page.extract_text()

37

print(text)

38

39

# Layout-preserving extraction

40

formatted_text = page.extract_text(layout=True)

41

print(formatted_text)

42

43

# Fine-tuned character grouping

44

precise_text = page.extract_text(x_tolerance=1, y_tolerance=1)

45

print(precise_text)

46

47

# Custom density for layout reconstruction

48

spaced_text = page.extract_text(layout=True, x_density=10, y_density=15)

49

print(spaced_text)

50

```

51

52

### Simple Text Extraction

53

54

Streamlined text extraction without complex layout analysis for performance-critical applications.

55

56

```python { .api }

57

def extract_text_simple(**kwargs):

58

"""

59

Extract text using simple algorithm.

60

61

Parameters:

62

- **kwargs: Text processing options

63

64

Returns:

65

str: Extracted text without layout preservation

66

"""

67

```

68

69

### Word Extraction

70

71

Extract words as objects with detailed position and formatting information.

72

73

```python { .api }

74

def extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False,

75

use_text_flow=False, horizontal_ltr=True, vertical_ttb=True,

76

extra_attrs=None, split_at_punctuation=False, **kwargs):

77

"""

78

Extract words as objects with position data.

79

80

Parameters:

81

- x_tolerance: int or float - Horizontal tolerance for word boundaries

82

- y_tolerance: int or float - Vertical tolerance for word boundaries

83

- keep_blank_chars: bool - Include blank character objects

84

- use_text_flow: bool - Use text flow direction for word detection

85

- horizontal_ltr: bool - Left-to-right reading order for horizontal text

86

- vertical_ttb: bool - Top-to-bottom reading order for vertical text

87

- extra_attrs: List[str] - Additional attributes to include in word objects

88

- split_at_punctuation: bool - Split words at punctuation marks

89

- **kwargs: Additional word processing options

90

91

Returns:

92

List[Dict[str, Any]]: List of word objects with position and formatting

93

"""

94

```

95

96

**Usage Examples:**

97

98

```python

99

with pdfplumber.open("document.pdf") as pdf:

100

page = pdf.pages[0]

101

102

# Extract words with position data

103

words = page.extract_words()

104

for word in words:

105

print(f"'{word['text']}' at ({word['x0']}, {word['top']})")

106

107

# Extract words with custom tolerances

108

tight_words = page.extract_words(x_tolerance=1, y_tolerance=1)

109

110

# Include font information

111

detailed_words = page.extract_words(extra_attrs=['fontname', 'size'])

112

for word in detailed_words:

113

print(f"'{word['text']}' - Font: {word.get('fontname', 'Unknown')} Size: {word.get('size', 'Unknown')}")

114

```

115

116

### Text Line Extraction

117

118

Extract text organized by lines with character-level details and line-level formatting.

119

120

```python { .api }

121

def extract_text_lines(strip=True, return_chars=True, **kwargs):

122

"""

123

Extract text lines with character details.

124

125

Parameters:

126

- strip: bool - Strip whitespace from line text

127

- return_chars: bool - Include character objects in line data

128

- **kwargs: Additional line processing options

129

130

Returns:

131

List[Dict[str, Any]]: List of line objects with text and character data

132

"""

133

```

134

135

**Usage Examples:**

136

137

```python

138

with pdfplumber.open("document.pdf") as pdf:

139

page = pdf.pages[0]

140

141

# Extract text lines

142

lines = page.extract_text_lines()

143

for line in lines:

144

print(f"Line: '{line['text']}' at y={line['top']}")

145

print(f" Contains {len(line.get('chars', []))} characters")

146

147

# Extract lines without character details

148

simple_lines = page.extract_text_lines(return_chars=False)

149

for line in simple_lines:

150

print(line['text'])

151

```

152

153

### Text Search

154

155

Advanced text search with regex support, case sensitivity options, and detailed match information.

156

157

```python { .api }

158

def search(pattern, regex=True, case=True, main_group=0,

159

return_chars=True, return_groups=True, **kwargs):

160

"""

161

Search for text patterns with regex support.

162

163

Parameters:

164

- pattern: str - Search pattern (literal text or regex)

165

- regex: bool - Treat pattern as regular expression

166

- case: bool - Case-sensitive search

167

- main_group: int - Primary regex group for match extraction

168

- return_chars: bool - Include character objects in matches

169

- return_groups: bool - Include regex group information

170

- **kwargs: Additional search options

171

172

Returns:

173

List[Dict[str, Any]]: List of match objects with position and text data

174

"""

175

```

176

177

**Usage Examples:**

178

179

```python

180

with pdfplumber.open("document.pdf") as pdf:

181

page = pdf.pages[0]

182

183

# Simple text search

184

matches = page.search("invoice")

185

for match in matches:

186

print(f"Found '{match['text']}' at ({match['x0']}, {match['top']})")

187

188

# Regex search with groups

189

email_matches = page.search(r'(\w+)@(\w+\.\w+)', regex=True)

190

for match in email_matches:

191

print(f"Email: {match['text']}")

192

print(f"Groups: {match.get('groups', [])}")

193

194

# Case-insensitive search

195

ci_matches = page.search("TOTAL", case=False)

196

197

# Search with character details

198

detailed_matches = page.search("amount", return_chars=True)

199

for match in detailed_matches:

200

chars = match.get('chars', [])

201

print(f"Match uses {len(chars)} characters")

202

```

203

204

### Character Processing

205

206

Low-level character processing and deduplication functions.

207

208

```python { .api }

209

def dedupe_chars(tolerance=1, use_text_flow=False, **kwargs):

210

"""

211

Remove duplicate characters.

212

213

Parameters:

214

- tolerance: int or float - Distance tolerance for duplicate detection

215

- use_text_flow: bool - Consider text flow in deduplication

216

- **kwargs: Additional deduplication options

217

218

Returns:

219

Page: New page object with deduplicated characters

220

"""

221

```

222

223

## Utility Text Functions

224

225

Standalone text processing functions available in the utils module.

226

227

```python { .api }

228

# From pdfplumber.utils

229

def extract_text(chars, **kwargs):

230

"""Extract text from character objects."""

231

232

def extract_text_simple(chars, **kwargs):

233

"""Simple text extraction from characters."""

234

235

def extract_words(chars, **kwargs):

236

"""Extract words from character objects."""

237

238

def dedupe_chars(chars, tolerance=1, **kwargs):

239

"""Remove duplicate characters from list."""

240

241

def chars_to_textmap(chars, **kwargs):

242

"""Convert characters to TextMap object."""

243

244

def collate_line(chars, **kwargs):

245

"""Collate characters into text line."""

246

```

247

248

**Text Processing Constants:**

249

250

```python { .api }

251

# Default tolerance values

252

DEFAULT_X_TOLERANCE = 3

253

DEFAULT_Y_TOLERANCE = 3

254

DEFAULT_X_DENSITY = 7.25

255

DEFAULT_Y_DENSITY = 13

256

```

257

258

## TextMap Class

259

260

Advanced text mapping object for character-level text analysis.

261

262

```python { .api }

263

class TextMap:

264

"""Character-level text mapping with position data."""

265

266

def __init__(self, chars, **kwargs):

267

"""Initialize TextMap from character objects."""

268

269

def as_list(self):

270

"""Convert to list representation."""

271

272

def as_string(self):

273

"""Convert to string representation."""

274

```

275

276

**Usage Examples:**

277

278

```python

279

from pdfplumber.utils import chars_to_textmap

280

281

with pdfplumber.open("document.pdf") as pdf:

282

page = pdf.pages[0]

283

284

# Create TextMap from page characters

285

textmap = chars_to_textmap(page.chars)

286

287

# Convert to different representations

288

text_list = textmap.as_list()

289

text_string = textmap.as_string()

290

291

print(f"TextMap contains {len(text_list)} text elements")

292

print(f"Combined text: {text_string}")

293

```