Tessl Tile for pypi/pdfplumber@0.11.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli.md index.md page-manipulation.md pdf-operations.md table-extraction.md text-extraction.md utilities.md visual-debugging.md

cli.mddocs/

0
# Command Line Interface
1

2
Complete command-line interface for PDF processing with support for text extraction, object export, structure analysis, and various output formats.
3

4
## Capabilities
5

6
### Main CLI Function
7

8
Entry point for the pdfplumber command-line interface with comprehensive argument parsing.
9

10
```python { .api }
11
def main(args_raw=None):
12
    """
13
    CLI entry point with full argument parsing.
14
    
15
    Parameters:
16
    - args_raw: List[str], optional - Command line arguments (defaults to sys.argv[1:])
17
    
18
    Returns:
19
    None: Outputs results to specified destination
20
    """
21
```
22

23
### Command Line Usage
24

25
The pdfplumber CLI can be invoked in several ways:
26

27
```bash
28
# As installed command
29
pdfplumber document.pdf
30

31
# As Python module
32
python -m pdfplumber.cli document.pdf
33

34
# From Python code
35
import pdfplumber.cli
36
pdfplumber.cli.main(['document.pdf', '--format', 'json'])
37
```
38

39
### Basic Arguments
40

41
Core arguments for specifying input and output behavior.
42

43
```bash
44
# Input file (required, or stdin if not specified)
45
pdfplumber document.pdf
46

47
# Output format (csv, json, text)
48
pdfplumber document.pdf --format json
49

50
# Specify output file
51
pdfplumber document.pdf --format json > output.json
52
```
53

54
### Object Type Selection
55

56
Control which PDF objects to include in the output.
57

58
```bash
59
# Include specific object types
60
pdfplumber document.pdf --types chars,rects,lines
61

62
# Common object types:
63
# - chars: character objects
64
# - rects: rectangle objects  
65
# - lines: line objects
66
# - curves: curve objects
67
# - images: image objects
68
# - annots: annotations
69
# - edges: computed edges
70
```
71

72
### Attribute Filtering
73

74
Control which object attributes to include or exclude from output.
75

76
```bash
77
# Include only specific attributes
78
pdfplumber document.pdf --include-attrs text,x0,top,x1,bottom
79

80
# Exclude specific attributes  
81
pdfplumber document.pdf --exclude-attrs object_type,stream
82

83
# Common character attributes:
84
# - text: character text content
85
# - x0, top, x1, bottom: positioning
86
# - fontname: font family name
87
# - size: font size
88
# - adv: character advance width
89
```
90

91
### Page Selection
92

93
Process specific pages or page ranges.
94

95
```bash
96
# Single page (0-indexed)
97
pdfplumber document.pdf --pages 0
98

99
# Multiple pages
100
pdfplumber document.pdf --pages 0,2,4
101

102
# Page ranges
103
pdfplumber document.pdf --pages 0-5
104

105
# Mixed ranges and individual pages
106
pdfplumber document.pdf --pages 0,2-5,10
107
```
108

109
### Layout Analysis Parameters
110

111
Configure PDF layout analysis using LAParams settings.
112

113
```bash
114
# JSON-encoded LAParams
115
pdfplumber document.pdf --laparams '{"word_margin": 0.1, "char_margin": 2.0}'
116

117
# Common LAParams options:
118
# - word_margin: horizontal margin for word detection
119
# - char_margin: margin for character grouping
120
# - line_margin: margin for line detection
121
# - boxes_flow: flow threshold for text boxes
122
```
123

124
### Output Formatting
125

126
Control output precision and formatting.
127

128
```bash
129
# Set numeric precision (decimal places)
130
pdfplumber document.pdf --precision 2
131

132
# JSON indentation
133
pdfplumber document.pdf --format json --indent 2
134

135
# Pretty-printed JSON
136
pdfplumber document.pdf --format json --indent 4
137
```
138

139
### Structure Tree Analysis
140

141
Extract and analyze PDF structure tree for accessibility information.
142

143
```bash
144
# Output structure tree as JSON
145
pdfplumber document.pdf --structure
146

147
# Include text content in structure tree
148
pdfplumber document.pdf --structure-text
149

150
# Combine with regular object extraction
151
pdfplumber document.pdf --format json --structure > combined_output.json
152
```
153

154
## Output Formats
155

156
### CSV Format
157

158
Default output format providing tabular data suitable for spreadsheet analysis.
159

160
```bash
161
pdfplumber document.pdf --format csv
162
# Outputs CSV with columns for each object attribute
163
```
164

165
**Example CSV Output:**
166
```csv
167
object_type,page_number,x0,top,x1,bottom,text,fontname,size
168
char,1,72.0,100.0,80.0,110.0,"H","Arial",12.0
169
char,1,80.0,100.0,88.0,110.0,"e","Arial",12.0
170
char,1,88.0,100.0,94.0,110.0,"l","Arial",12.0
171
```
172

173
### JSON Format
174

175
Structured output format ideal for programmatic processing.
176

177
```bash
178
pdfplumber document.pdf --format json
179
# Outputs JSON array of objects
180
```
181

182
**Example JSON Output:**
183
```json
184
[
185
  {
186
    "object_type": "char",
187
    "page_number": 1,
188
    "x0": 72.0,
189
    "top": 100.0,
190
    "x1": 80.0,
191
    "bottom": 110.0,
192
    "text": "H",
193
    "fontname": "Arial",
194
    "size": 12.0
195
  }
196
]
197
```
198

199
### Text Format
200

201
Simple text extraction output.
202

203
```bash
204
pdfplumber document.pdf --format text
205
# Outputs extracted text content
206
```
207

208
## Advanced Usage Examples
209

210
### Extract Character Data
211

212
```bash
213
# Get all character data with position information
214
pdfplumber document.pdf --types chars --format json --indent 2
215

216
# Get character text and positions only
217
pdfplumber document.pdf --types chars --include-attrs text,x0,top,x1,bottom
218

219
# High-precision character coordinates
220
pdfplumber document.pdf --types chars --precision 4
221
```
222

223
### Analyze Document Structure
224

225
```bash
226
# Get comprehensive object data
227
pdfplumber document.pdf --types chars,rects,lines,curves --format json
228

229
# Focus on text elements
230
pdfplumber document.pdf --types chars --include-attrs text,fontname,size,x0,top
231

232
# Extract accessibility structure
233
pdfplumber document.pdf --structure-text --format json
234
```
235

236
### Process Specific Pages
237

238
```bash
239
# Analyze first page only
240
pdfplumber document.pdf --pages 0 --format json
241

242
# Compare multiple pages
243
pdfplumber document.pdf --pages 0,1,2 --types chars --include-attrs text,page_number
244

245
# Process large document selectively
246
pdfplumber document.pdf --pages 10-20 --format csv
247
```
248

249
### Custom Layout Analysis
250

251
```bash
252
# Tight character grouping
253
pdfplumber document.pdf --laparams '{"char_margin": 1.0, "word_margin": 0.05}'
254

255
# Loose text flow detection
256
pdfplumber document.pdf --laparams '{"boxes_flow": 0.7, "word_margin": 0.2}'
257

258
# Combine with specific output
259
pdfplumber document.pdf --laparams '{"word_margin": 0.1}' --types chars --format json
260
```
261

262
### Data Pipeline Integration
263

264
```bash
265
# Extract to structured data file
266
pdfplumber document.pdf --format json --indent 2 > document_data.json
267

268
# Create CSV for analysis
269
pdfplumber document.pdf --types chars --include-attrs text,x0,top,fontname,size > analysis.csv
270

271
# Process multiple files
272
for file in *.pdf; do
273
    pdfplumber "$file" --format json > "${file%.pdf}.json"
274
done
275
```
276

277
### Debugging and Analysis
278

279
```bash
280
# Get all available object attributes
281
pdfplumber document.pdf --types chars --format json --indent 2 | head -20
282

283
# Analyze font usage
284
pdfplumber document.pdf --types chars --include-attrs fontname,size --format csv | sort | uniq -c
285

286
# Extract rectangle information (tables, forms)
287
pdfplumber document.pdf --types rects --include-attrs x0,top,x1,bottom,width,height
288

289
# Comprehensive document analysis
290
pdfplumber document.pdf --types chars,rects,lines,curves,images --structure --format json
291
```
292

293
## Error Handling
294

295
The CLI provides informative error messages for common issues:
296

297
```bash
298
# Invalid file
299
pdfplumber nonexistent.pdf
300
# Error: Could not open file
301

302
# Invalid page range
303
pdfplumber document.pdf --pages 999
304
# Error: Page 999 not found
305

306
# Invalid JSON in laparams
307
pdfplumber document.pdf --laparams '{"invalid": json}'
308
# Error: Invalid JSON in laparams
309

310
# Malformed PDF
311
pdfplumber corrupted.pdf
312
# Error: Malformed PDF document
313
```
314

315
## Integration with Python Scripts
316

317
The CLI can be integrated into Python workflows:
318

319
```python
320
import subprocess
321
import json
322
import tempfile
323

324
def extract_pdf_data(pdf_path, pages=None, object_types=None):
325
    """Extract PDF data using CLI interface."""
326
    cmd = ['pdfplumber', pdf_path, '--format', 'json']
327
    
328
    if pages:
329
        cmd.extend(['--pages', ','.join(map(str, pages))])
330
    
331
    if object_types:
332
        cmd.extend(['--types', ','.join(object_types)])
333
    
334
    result = subprocess.run(cmd, capture_output=True, text=True)
335
    
336
    if result.returncode == 0:
337
        return json.loads(result.stdout)
338
    else:
339
        raise Exception(f"CLI error: {result.stderr}")
340

341
# Usage
342
data = extract_pdf_data("document.pdf", pages=[0, 1], object_types=['chars'])
343
```
344

345
## Performance Considerations
346

347
For large documents or batch processing:
348

349
```bash
350
# Process specific pages to reduce memory usage
351
pdfplumber large_document.pdf --pages 0-10
352

353
# Limit object types to improve processing speed  
354
pdfplumber document.pdf --types chars
355

356
# Reduce output size with attribute filtering
357
pdfplumber document.pdf --include-attrs text,x0,top,x1,bottom
358

359
# Use CSV format for better performance with large datasets
360
pdfplumber document.pdf --format csv
361
```

Version

Tile

Files

cli.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

cli.mddocs/