0
# Command Line Interface
1
2
Complete command-line interface for PDF processing with support for text extraction, object export, structure analysis, and various output formats.
3
4
## Capabilities
5
6
### Main CLI Function
7
8
Entry point for the pdfplumber command-line interface with comprehensive argument parsing.
9
10
```python { .api }
11
def main(args_raw=None):
12
"""
13
CLI entry point with full argument parsing.
14
15
Parameters:
16
- args_raw: List[str], optional - Command line arguments (defaults to sys.argv[1:])
17
18
Returns:
19
None: Outputs results to specified destination
20
"""
21
```
22
23
### Command Line Usage
24
25
The pdfplumber CLI can be invoked in several ways:
26
27
```bash
28
# As installed command
29
pdfplumber document.pdf
30
31
# As Python module
32
python -m pdfplumber.cli document.pdf
33
34
# From Python code
35
import pdfplumber.cli
36
pdfplumber.cli.main(['document.pdf', '--format', 'json'])
37
```
38
39
### Basic Arguments
40
41
Core arguments for specifying input and output behavior.
42
43
```bash
44
# Input file (required, or stdin if not specified)
45
pdfplumber document.pdf
46
47
# Output format (csv, json, text)
48
pdfplumber document.pdf --format json
49
50
# Specify output file
51
pdfplumber document.pdf --format json > output.json
52
```
53
54
### Object Type Selection
55
56
Control which PDF objects to include in the output.
57
58
```bash
59
# Include specific object types
60
pdfplumber document.pdf --types chars,rects,lines
61
62
# Common object types:
63
# - chars: character objects
64
# - rects: rectangle objects
65
# - lines: line objects
66
# - curves: curve objects
67
# - images: image objects
68
# - annots: annotations
69
# - edges: computed edges
70
```
71
72
### Attribute Filtering
73
74
Control which object attributes to include or exclude from output.
75
76
```bash
77
# Include only specific attributes
78
pdfplumber document.pdf --include-attrs text,x0,top,x1,bottom
79
80
# Exclude specific attributes
81
pdfplumber document.pdf --exclude-attrs object_type,stream
82
83
# Common character attributes:
84
# - text: character text content
85
# - x0, top, x1, bottom: positioning
86
# - fontname: font family name
87
# - size: font size
88
# - adv: character advance width
89
```
90
91
### Page Selection
92
93
Process specific pages or page ranges.
94
95
```bash
96
# Single page (0-indexed)
97
pdfplumber document.pdf --pages 0
98
99
# Multiple pages
100
pdfplumber document.pdf --pages 0,2,4
101
102
# Page ranges
103
pdfplumber document.pdf --pages 0-5
104
105
# Mixed ranges and individual pages
106
pdfplumber document.pdf --pages 0,2-5,10
107
```
108
109
### Layout Analysis Parameters
110
111
Configure PDF layout analysis using LAParams settings.
112
113
```bash
114
# JSON-encoded LAParams
115
pdfplumber document.pdf --laparams '{"word_margin": 0.1, "char_margin": 2.0}'
116
117
# Common LAParams options:
118
# - word_margin: horizontal margin for word detection
119
# - char_margin: margin for character grouping
120
# - line_margin: margin for line detection
121
# - boxes_flow: flow threshold for text boxes
122
```
123
124
### Output Formatting
125
126
Control output precision and formatting.
127
128
```bash
129
# Set numeric precision (decimal places)
130
pdfplumber document.pdf --precision 2
131
132
# JSON indentation
133
pdfplumber document.pdf --format json --indent 2
134
135
# Pretty-printed JSON
136
pdfplumber document.pdf --format json --indent 4
137
```
138
139
### Structure Tree Analysis
140
141
Extract and analyze PDF structure tree for accessibility information.
142
143
```bash
144
# Output structure tree as JSON
145
pdfplumber document.pdf --structure
146
147
# Include text content in structure tree
148
pdfplumber document.pdf --structure-text
149
150
# Combine with regular object extraction
151
pdfplumber document.pdf --format json --structure > combined_output.json
152
```
153
154
## Output Formats
155
156
### CSV Format
157
158
Default output format providing tabular data suitable for spreadsheet analysis.
159
160
```bash
161
pdfplumber document.pdf --format csv
162
# Outputs CSV with columns for each object attribute
163
```
164
165
**Example CSV Output:**
166
```csv
167
object_type,page_number,x0,top,x1,bottom,text,fontname,size
168
char,1,72.0,100.0,80.0,110.0,"H","Arial",12.0
169
char,1,80.0,100.0,88.0,110.0,"e","Arial",12.0
170
char,1,88.0,100.0,94.0,110.0,"l","Arial",12.0
171
```
172
173
### JSON Format
174
175
Structured output format ideal for programmatic processing.
176
177
```bash
178
pdfplumber document.pdf --format json
179
# Outputs JSON array of objects
180
```
181
182
**Example JSON Output:**
183
```json
184
[
185
{
186
"object_type": "char",
187
"page_number": 1,
188
"x0": 72.0,
189
"top": 100.0,
190
"x1": 80.0,
191
"bottom": 110.0,
192
"text": "H",
193
"fontname": "Arial",
194
"size": 12.0
195
}
196
]
197
```
198
199
### Text Format
200
201
Simple text extraction output.
202
203
```bash
204
pdfplumber document.pdf --format text
205
# Outputs extracted text content
206
```
207
208
## Advanced Usage Examples
209
210
### Extract Character Data
211
212
```bash
213
# Get all character data with position information
214
pdfplumber document.pdf --types chars --format json --indent 2
215
216
# Get character text and positions only
217
pdfplumber document.pdf --types chars --include-attrs text,x0,top,x1,bottom
218
219
# High-precision character coordinates
220
pdfplumber document.pdf --types chars --precision 4
221
```
222
223
### Analyze Document Structure
224
225
```bash
226
# Get comprehensive object data
227
pdfplumber document.pdf --types chars,rects,lines,curves --format json
228
229
# Focus on text elements
230
pdfplumber document.pdf --types chars --include-attrs text,fontname,size,x0,top
231
232
# Extract accessibility structure
233
pdfplumber document.pdf --structure-text --format json
234
```
235
236
### Process Specific Pages
237
238
```bash
239
# Analyze first page only
240
pdfplumber document.pdf --pages 0 --format json
241
242
# Compare multiple pages
243
pdfplumber document.pdf --pages 0,1,2 --types chars --include-attrs text,page_number
244
245
# Process large document selectively
246
pdfplumber document.pdf --pages 10-20 --format csv
247
```
248
249
### Custom Layout Analysis
250
251
```bash
252
# Tight character grouping
253
pdfplumber document.pdf --laparams '{"char_margin": 1.0, "word_margin": 0.05}'
254
255
# Loose text flow detection
256
pdfplumber document.pdf --laparams '{"boxes_flow": 0.7, "word_margin": 0.2}'
257
258
# Combine with specific output
259
pdfplumber document.pdf --laparams '{"word_margin": 0.1}' --types chars --format json
260
```
261
262
### Data Pipeline Integration
263
264
```bash
265
# Extract to structured data file
266
pdfplumber document.pdf --format json --indent 2 > document_data.json
267
268
# Create CSV for analysis
269
pdfplumber document.pdf --types chars --include-attrs text,x0,top,fontname,size > analysis.csv
270
271
# Process multiple files
272
for file in *.pdf; do
273
pdfplumber "$file" --format json > "${file%.pdf}.json"
274
done
275
```
276
277
### Debugging and Analysis
278
279
```bash
280
# Get all available object attributes
281
pdfplumber document.pdf --types chars --format json --indent 2 | head -20
282
283
# Analyze font usage
284
pdfplumber document.pdf --types chars --include-attrs fontname,size --format csv | sort | uniq -c
285
286
# Extract rectangle information (tables, forms)
287
pdfplumber document.pdf --types rects --include-attrs x0,top,x1,bottom,width,height
288
289
# Comprehensive document analysis
290
pdfplumber document.pdf --types chars,rects,lines,curves,images --structure --format json
291
```
292
293
## Error Handling
294
295
The CLI provides informative error messages for common issues:
296
297
```bash
298
# Invalid file
299
pdfplumber nonexistent.pdf
300
# Error: Could not open file
301
302
# Invalid page range
303
pdfplumber document.pdf --pages 999
304
# Error: Page 999 not found
305
306
# Invalid JSON in laparams
307
pdfplumber document.pdf --laparams '{"invalid": json}'
308
# Error: Invalid JSON in laparams
309
310
# Malformed PDF
311
pdfplumber corrupted.pdf
312
# Error: Malformed PDF document
313
```
314
315
## Integration with Python Scripts
316
317
The CLI can be integrated into Python workflows:
318
319
```python
320
import subprocess
321
import json
322
import tempfile
323
324
def extract_pdf_data(pdf_path, pages=None, object_types=None):
325
"""Extract PDF data using CLI interface."""
326
cmd = ['pdfplumber', pdf_path, '--format', 'json']
327
328
if pages:
329
cmd.extend(['--pages', ','.join(map(str, pages))])
330
331
if object_types:
332
cmd.extend(['--types', ','.join(object_types)])
333
334
result = subprocess.run(cmd, capture_output=True, text=True)
335
336
if result.returncode == 0:
337
return json.loads(result.stdout)
338
else:
339
raise Exception(f"CLI error: {result.stderr}")
340
341
# Usage
342
data = extract_pdf_data("document.pdf", pages=[0, 1], object_types=['chars'])
343
```
344
345
## Performance Considerations
346
347
For large documents or batch processing:
348
349
```bash
350
# Process specific pages to reduce memory usage
351
pdfplumber large_document.pdf --pages 0-10
352
353
# Limit object types to improve processing speed
354
pdfplumber document.pdf --types chars
355
356
# Reduce output size with attribute filtering
357
pdfplumber document.pdf --include-attrs text,x0,top,x1,bottom
358
359
# Use CSV format for better performance with large datasets
360
pdfplumber document.pdf --format csv
361
```