or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli.mdindex.mdpage-manipulation.mdpdf-operations.mdtable-extraction.mdtext-extraction.mdutilities.mdvisual-debugging.md

cli.mddocs/

0

# Command Line Interface

1

2

Complete command-line interface for PDF processing with support for text extraction, object export, structure analysis, and various output formats.

3

4

## Capabilities

5

6

### Main CLI Function

7

8

Entry point for the pdfplumber command-line interface with comprehensive argument parsing.

9

10

```python { .api }

11

def main(args_raw=None):

12

"""

13

CLI entry point with full argument parsing.

14

15

Parameters:

16

- args_raw: List[str], optional - Command line arguments (defaults to sys.argv[1:])

17

18

Returns:

19

None: Outputs results to specified destination

20

"""

21

```

22

23

### Command Line Usage

24

25

The pdfplumber CLI can be invoked in several ways:

26

27

```bash

28

# As installed command

29

pdfplumber document.pdf

30

31

# As Python module

32

python -m pdfplumber.cli document.pdf

33

34

# From Python code

35

import pdfplumber.cli

36

pdfplumber.cli.main(['document.pdf', '--format', 'json'])

37

```

38

39

### Basic Arguments

40

41

Core arguments for specifying input and output behavior.

42

43

```bash

44

# Input file (required, or stdin if not specified)

45

pdfplumber document.pdf

46

47

# Output format (csv, json, text)

48

pdfplumber document.pdf --format json

49

50

# Specify output file

51

pdfplumber document.pdf --format json > output.json

52

```

53

54

### Object Type Selection

55

56

Control which PDF objects to include in the output.

57

58

```bash

59

# Include specific object types

60

pdfplumber document.pdf --types chars,rects,lines

61

62

# Common object types:

63

# - chars: character objects

64

# - rects: rectangle objects

65

# - lines: line objects

66

# - curves: curve objects

67

# - images: image objects

68

# - annots: annotations

69

# - edges: computed edges

70

```

71

72

### Attribute Filtering

73

74

Control which object attributes to include or exclude from output.

75

76

```bash

77

# Include only specific attributes

78

pdfplumber document.pdf --include-attrs text,x0,top,x1,bottom

79

80

# Exclude specific attributes

81

pdfplumber document.pdf --exclude-attrs object_type,stream

82

83

# Common character attributes:

84

# - text: character text content

85

# - x0, top, x1, bottom: positioning

86

# - fontname: font family name

87

# - size: font size

88

# - adv: character advance width

89

```

90

91

### Page Selection

92

93

Process specific pages or page ranges.

94

95

```bash

96

# Single page (0-indexed)

97

pdfplumber document.pdf --pages 0

98

99

# Multiple pages

100

pdfplumber document.pdf --pages 0,2,4

101

102

# Page ranges

103

pdfplumber document.pdf --pages 0-5

104

105

# Mixed ranges and individual pages

106

pdfplumber document.pdf --pages 0,2-5,10

107

```

108

109

### Layout Analysis Parameters

110

111

Configure PDF layout analysis using LAParams settings.

112

113

```bash

114

# JSON-encoded LAParams

115

pdfplumber document.pdf --laparams '{"word_margin": 0.1, "char_margin": 2.0}'

116

117

# Common LAParams options:

118

# - word_margin: horizontal margin for word detection

119

# - char_margin: margin for character grouping

120

# - line_margin: margin for line detection

121

# - boxes_flow: flow threshold for text boxes

122

```

123

124

### Output Formatting

125

126

Control output precision and formatting.

127

128

```bash

129

# Set numeric precision (decimal places)

130

pdfplumber document.pdf --precision 2

131

132

# JSON indentation

133

pdfplumber document.pdf --format json --indent 2

134

135

# Pretty-printed JSON

136

pdfplumber document.pdf --format json --indent 4

137

```

138

139

### Structure Tree Analysis

140

141

Extract and analyze PDF structure tree for accessibility information.

142

143

```bash

144

# Output structure tree as JSON

145

pdfplumber document.pdf --structure

146

147

# Include text content in structure tree

148

pdfplumber document.pdf --structure-text

149

150

# Combine with regular object extraction

151

pdfplumber document.pdf --format json --structure > combined_output.json

152

```

153

154

## Output Formats

155

156

### CSV Format

157

158

Default output format providing tabular data suitable for spreadsheet analysis.

159

160

```bash

161

pdfplumber document.pdf --format csv

162

# Outputs CSV with columns for each object attribute

163

```

164

165

**Example CSV Output:**

166

```csv

167

object_type,page_number,x0,top,x1,bottom,text,fontname,size

168

char,1,72.0,100.0,80.0,110.0,"H","Arial",12.0

169

char,1,80.0,100.0,88.0,110.0,"e","Arial",12.0

170

char,1,88.0,100.0,94.0,110.0,"l","Arial",12.0

171

```

172

173

### JSON Format

174

175

Structured output format ideal for programmatic processing.

176

177

```bash

178

pdfplumber document.pdf --format json

179

# Outputs JSON array of objects

180

```

181

182

**Example JSON Output:**

183

```json

184

[

185

{

186

"object_type": "char",

187

"page_number": 1,

188

"x0": 72.0,

189

"top": 100.0,

190

"x1": 80.0,

191

"bottom": 110.0,

192

"text": "H",

193

"fontname": "Arial",

194

"size": 12.0

195

}

196

]

197

```

198

199

### Text Format

200

201

Simple text extraction output.

202

203

```bash

204

pdfplumber document.pdf --format text

205

# Outputs extracted text content

206

```

207

208

## Advanced Usage Examples

209

210

### Extract Character Data

211

212

```bash

213

# Get all character data with position information

214

pdfplumber document.pdf --types chars --format json --indent 2

215

216

# Get character text and positions only

217

pdfplumber document.pdf --types chars --include-attrs text,x0,top,x1,bottom

218

219

# High-precision character coordinates

220

pdfplumber document.pdf --types chars --precision 4

221

```

222

223

### Analyze Document Structure

224

225

```bash

226

# Get comprehensive object data

227

pdfplumber document.pdf --types chars,rects,lines,curves --format json

228

229

# Focus on text elements

230

pdfplumber document.pdf --types chars --include-attrs text,fontname,size,x0,top

231

232

# Extract accessibility structure

233

pdfplumber document.pdf --structure-text --format json

234

```

235

236

### Process Specific Pages

237

238

```bash

239

# Analyze first page only

240

pdfplumber document.pdf --pages 0 --format json

241

242

# Compare multiple pages

243

pdfplumber document.pdf --pages 0,1,2 --types chars --include-attrs text,page_number

244

245

# Process large document selectively

246

pdfplumber document.pdf --pages 10-20 --format csv

247

```

248

249

### Custom Layout Analysis

250

251

```bash

252

# Tight character grouping

253

pdfplumber document.pdf --laparams '{"char_margin": 1.0, "word_margin": 0.05}'

254

255

# Loose text flow detection

256

pdfplumber document.pdf --laparams '{"boxes_flow": 0.7, "word_margin": 0.2}'

257

258

# Combine with specific output

259

pdfplumber document.pdf --laparams '{"word_margin": 0.1}' --types chars --format json

260

```

261

262

### Data Pipeline Integration

263

264

```bash

265

# Extract to structured data file

266

pdfplumber document.pdf --format json --indent 2 > document_data.json

267

268

# Create CSV for analysis

269

pdfplumber document.pdf --types chars --include-attrs text,x0,top,fontname,size > analysis.csv

270

271

# Process multiple files

272

for file in *.pdf; do

273

pdfplumber "$file" --format json > "${file%.pdf}.json"

274

done

275

```

276

277

### Debugging and Analysis

278

279

```bash

280

# Get all available object attributes

281

pdfplumber document.pdf --types chars --format json --indent 2 | head -20

282

283

# Analyze font usage

284

pdfplumber document.pdf --types chars --include-attrs fontname,size --format csv | sort | uniq -c

285

286

# Extract rectangle information (tables, forms)

287

pdfplumber document.pdf --types rects --include-attrs x0,top,x1,bottom,width,height

288

289

# Comprehensive document analysis

290

pdfplumber document.pdf --types chars,rects,lines,curves,images --structure --format json

291

```

292

293

## Error Handling

294

295

The CLI provides informative error messages for common issues:

296

297

```bash

298

# Invalid file

299

pdfplumber nonexistent.pdf

300

# Error: Could not open file

301

302

# Invalid page range

303

pdfplumber document.pdf --pages 999

304

# Error: Page 999 not found

305

306

# Invalid JSON in laparams

307

pdfplumber document.pdf --laparams '{"invalid": json}'

308

# Error: Invalid JSON in laparams

309

310

# Malformed PDF

311

pdfplumber corrupted.pdf

312

# Error: Malformed PDF document

313

```

314

315

## Integration with Python Scripts

316

317

The CLI can be integrated into Python workflows:

318

319

```python

320

import subprocess

321

import json

322

import tempfile

323

324

def extract_pdf_data(pdf_path, pages=None, object_types=None):

325

"""Extract PDF data using CLI interface."""

326

cmd = ['pdfplumber', pdf_path, '--format', 'json']

327

328

if pages:

329

cmd.extend(['--pages', ','.join(map(str, pages))])

330

331

if object_types:

332

cmd.extend(['--types', ','.join(object_types)])

333

334

result = subprocess.run(cmd, capture_output=True, text=True)

335

336

if result.returncode == 0:

337

return json.loads(result.stdout)

338

else:

339

raise Exception(f"CLI error: {result.stderr}")

340

341

# Usage

342

data = extract_pdf_data("document.pdf", pages=[0, 1], object_types=['chars'])

343

```

344

345

## Performance Considerations

346

347

For large documents or batch processing:

348

349

```bash

350

# Process specific pages to reduce memory usage

351

pdfplumber large_document.pdf --pages 0-10

352

353

# Limit object types to improve processing speed

354

pdfplumber document.pdf --types chars

355

356

# Reduce output size with attribute filtering

357

pdfplumber document.pdf --include-attrs text,x0,top,x1,bottom

358

359

# Use CSV format for better performance with large datasets

360

pdfplumber document.pdf --format csv

361

```