or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-pdftotext

Simple PDF text extraction library using Poppler backend

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/pdftotext@3.0.x

To install, run

npx @tessl/cli install tessl/pypi-pdftotext@3.0.0

0

# pdftotext

1

2

Simple Python library for extracting text from PDF documents using the Poppler backend. The library provides a minimal but complete API through a single PDF class that supports sequential access to pages, password-protected documents, and multiple text extraction modes for optimal readability.

3

4

## Package Information

5

6

- **Package Name**: pdftotext

7

- **Language**: Python (with C++ extension)

8

- **Installation**: `pip install pdftotext`

9

- **System Dependencies**: libpoppler-cpp, pkg-config, python3-dev

10

11

## Core Imports

12

13

```python

14

import pdftotext

15

```

16

17

## Basic Usage

18

19

```python

20

import pdftotext

21

22

# Load a PDF file

23

with open("document.pdf", "rb") as f:

24

pdf = pdftotext.PDF(f)

25

26

# Check page count

27

print(f"Document has {len(pdf)} pages")

28

29

# Read individual pages

30

print("First page:")

31

print(pdf[0])

32

33

print("Last page:")

34

print(pdf[-1])

35

36

# Iterate through all pages

37

for page_num, page_text in enumerate(pdf):

38

print(f"--- Page {page_num + 1} ---")

39

print(page_text)

40

41

# Read all text as single string

42

full_text = "\n\n".join(pdf)

43

print(full_text)

44

```

45

46

## Capabilities

47

48

### PDF Document Loading

49

50

Load PDF documents from file-like objects with optional password authentication and text extraction mode configuration.

51

52

```python { .api }

53

class PDF:

54

def __init__(self, pdf_file, password="", raw=False, physical=False):

55

"""

56

Initialize PDF object for text extraction.

57

58

Args:

59

pdf_file: A file-like object opened in binary mode containing PDF data

60

password (str, optional): Password to unlock encrypted PDFs. Both owner and user passwords work. Defaults to "".

61

raw (bool, optional): Extract text in content stream order (as stored in PDF). Defaults to False.

62

physical (bool, optional): Extract text in physical layout order (spatial arrangement on page). Defaults to False.

63

64

Raises:

65

pdftotext.Error: If PDF is invalid, corrupted, or password-protected without correct password

66

TypeError: If pdf_file is not a file-like object or opened in text mode

67

ValueError: If both raw and physical are True, or if raw/physical values are invalid

68

69

Note:

70

The raw and physical parameters are mutually exclusive. Default mode provides most readable output

71

by respecting logical document structure. Usually this is preferred over raw or physical modes.

72

"""

73

```

74

75

### Page Access

76

77

Access individual pages as strings using sequence-like interface with support for indexing and iteration.

78

79

```python { .api }

80

def __len__(self) -> int:

81

"""

82

Return the number of pages in the PDF document.

83

84

Returns:

85

int: Number of pages in the document

86

"""

87

88

def __getitem__(self, index: int) -> str:

89

"""

90

Get text content of a specific page.

91

92

Args:

93

index (int): Page index (0-based). Supports negative indexing.

94

95

Returns:

96

str: Text content of the page as UTF-8 string

97

98

Raises:

99

IndexError: If index is out of range

100

pdftotext.Error: If page cannot be read due to corruption

101

"""

102

103

def __iter__(self):

104

"""

105

Enable iteration over pages, yielding page text.

106

107

Yields:

108

str: Text content of each page in sequence

109

110

Example:

111

for page in pdf:

112

print(page)

113

"""

114

```

115

116

### Text Extraction Modes

117

118

Configure how text is extracted from PDF pages to optimize for different document layouts and reading requirements.

119

120

**Default Mode** (recommended): Most readable output that respects logical document structure. Handles multi-column layouts, reading order, and text flow intelligently.

121

122

**Raw Mode** (`raw=True`): Extracts text in the order it appears in the PDF content stream. Useful for debugging or when document structure is less important than preserving original ordering.

123

124

**Physical Mode** (`physical=True`): Extracts text in physical layout order based on spatial arrangement on the page. Can be useful for documents with complex layouts where spatial positioning matters.

125

126

Usage examples:

127

128

```python

129

# Default mode - most readable

130

with open("document.pdf", "rb") as f:

131

pdf = pdftotext.PDF(f)

132

text = pdf[0] # Respects logical structure

133

134

# Raw mode - content stream order

135

with open("document.pdf", "rb") as f:

136

pdf = pdftotext.PDF(f, raw=True)

137

text = pdf[0] # Order as stored in PDF

138

139

# Physical mode - spatial order

140

with open("document.pdf", "rb") as f:

141

pdf = pdftotext.PDF(f, physical=True)

142

text = pdf[0] # Spatial arrangement on page

143

```

144

145

### Password-Protected PDFs

146

147

Handle encrypted PDF documents using owner or user passwords.

148

149

```python

150

# Unlock with password

151

with open("secure_document.pdf", "rb") as f:

152

pdf = pdftotext.PDF(f, password="secret123")

153

text = pdf[0]

154

155

# Both owner and user passwords work

156

with open("encrypted.pdf", "rb") as f:

157

# This works with either password type

158

pdf = pdftotext.PDF(f, password="owner_password")

159

# or

160

pdf = pdftotext.PDF(f, password="user_password")

161

```

162

163

### Error Handling

164

165

Handle PDF-related errors and edge cases gracefully.

166

167

```python { .api }

168

class Error(Exception):

169

"""

170

Exception raised for PDF-related errors.

171

172

Raised when:

173

- PDF file is invalid or corrupted

174

- PDF is password-protected and no/wrong password provided

175

- Poppler library encounters errors during processing

176

- Page cannot be read due to corruption

177

"""

178

```

179

180

Example error handling:

181

182

```python

183

import pdftotext

184

185

try:

186

with open("document.pdf", "rb") as f:

187

pdf = pdftotext.PDF(f)

188

text = pdf[0]

189

except pdftotext.Error as e:

190

print(f"PDF error: {e}")

191

except FileNotFoundError:

192

print("PDF file not found")

193

except IndexError as e:

194

print(f"Page index error: {e}")

195

```

196

197

## Types

198

199

```python { .api }

200

class PDF:

201

"""

202

Main class for PDF text extraction with sequence-like interface.

203

204

Provides:

205

- Sequential access to pages via indexing (pdf[0], pdf[1], etc.)

206

- Length operation (len(pdf))

207

- Iteration support (for page in pdf)

208

- Password authentication for encrypted PDFs

209

- Multiple text extraction modes (default, raw, physical)

210

"""

211

212

class Error(Exception):

213

"""

214

Custom exception class for PDF-related errors.

215

216

Inherits from built-in Exception class and is raised for:

217

- Invalid or corrupted PDF files

218

- Authentication failures on password-protected PDFs

219

- Poppler library processing errors

220

- Page reading errors due to corruption

221

"""

222

```

223

224

## Common Usage Patterns

225

226

### Processing Multi-page Documents

227

228

```python

229

import pdftotext

230

231

with open("report.pdf", "rb") as f:

232

pdf = pdftotext.PDF(f)

233

234

# Process each page

235

for i, page in enumerate(pdf):

236

print(f"=== Page {i + 1} ===")

237

print(page[:100] + "..." if len(page) > 100 else page)

238

239

# Or get all text at once

240

full_document = "\n\n".join(pdf)

241

```

242

243

### Handling Different Document Types

244

245

```python

246

# Regular document

247

with open("document.pdf", "rb") as f:

248

pdf = pdftotext.PDF(f)

249

250

# Password-protected document

251

with open("secure.pdf", "rb") as f:

252

pdf = pdftotext.PDF(f, password="mypassword")

253

254

# Multi-column document (try physical mode)

255

with open("newspaper.pdf", "rb") as f:

256

pdf = pdftotext.PDF(f, physical=True)

257

258

# Document with complex layout (try raw mode)

259

with open("form.pdf", "rb") as f:

260

pdf = pdftotext.PDF(f, raw=True)

261

```

262

263

### Robust Error Handling

264

265

```python

266

import pdftotext

267

268

def extract_pdf_text(filepath, password=None):

269

"""Extract text from PDF with comprehensive error handling."""

270

try:

271

with open(filepath, "rb") as f:

272

if password:

273

pdf = pdftotext.PDF(f, password=password)

274

else:

275

pdf = pdftotext.PDF(f)

276

277

return [page for page in pdf]

278

279

except FileNotFoundError:

280

print(f"File not found: {filepath}")

281

return None

282

except pdftotext.Error as e:

283

print(f"PDF processing error: {e}")

284

return None

285

except Exception as e:

286

print(f"Unexpected error: {e}")

287

return None

288

```