Detailed PDF analysis and extraction library with comprehensive table detection and visual debugging capabilities.
—
Complete command-line interface for PDF processing with support for text extraction, object export, structure analysis, and various output formats.
Entry point for the pdfplumber command-line interface with comprehensive argument parsing.
def main(args_raw=None):
"""
CLI entry point with full argument parsing.
Parameters:
- args_raw: List[str], optional - Command line arguments (defaults to sys.argv[1:])
Returns:
None: Outputs results to specified destination
"""The pdfplumber CLI can be invoked in several ways:
# As installed command
pdfplumber document.pdf
# As Python module
python -m pdfplumber.cli document.pdf
# From Python code
import pdfplumber.cli
pdfplumber.cli.main(['document.pdf', '--format', 'json'])Core arguments for specifying input and output behavior.
# Input file (required, or stdin if not specified)
pdfplumber document.pdf
# Output format (csv, json, text)
pdfplumber document.pdf --format json
# Specify output file
pdfplumber document.pdf --format json > output.jsonControl which PDF objects to include in the output.
# Include specific object types
pdfplumber document.pdf --types chars,rects,lines
# Common object types:
# - chars: character objects
# - rects: rectangle objects
# - lines: line objects
# - curves: curve objects
# - images: image objects
# - annots: annotations
# - edges: computed edgesControl which object attributes to include or exclude from output.
# Include only specific attributes
pdfplumber document.pdf --include-attrs text,x0,top,x1,bottom
# Exclude specific attributes
pdfplumber document.pdf --exclude-attrs object_type,stream
# Common character attributes:
# - text: character text content
# - x0, top, x1, bottom: positioning
# - fontname: font family name
# - size: font size
# - adv: character advance widthProcess specific pages or page ranges.
# Single page (0-indexed)
pdfplumber document.pdf --pages 0
# Multiple pages
pdfplumber document.pdf --pages 0,2,4
# Page ranges
pdfplumber document.pdf --pages 0-5
# Mixed ranges and individual pages
pdfplumber document.pdf --pages 0,2-5,10Configure PDF layout analysis using LAParams settings.
# JSON-encoded LAParams
pdfplumber document.pdf --laparams '{"word_margin": 0.1, "char_margin": 2.0}'
# Common LAParams options:
# - word_margin: horizontal margin for word detection
# - char_margin: margin for character grouping
# - line_margin: margin for line detection
# - boxes_flow: flow threshold for text boxesControl output precision and formatting.
# Set numeric precision (decimal places)
pdfplumber document.pdf --precision 2
# JSON indentation
pdfplumber document.pdf --format json --indent 2
# Pretty-printed JSON
pdfplumber document.pdf --format json --indent 4Extract and analyze PDF structure tree for accessibility information.
# Output structure tree as JSON
pdfplumber document.pdf --structure
# Include text content in structure tree
pdfplumber document.pdf --structure-text
# Combine with regular object extraction
pdfplumber document.pdf --format json --structure > combined_output.jsonDefault output format providing tabular data suitable for spreadsheet analysis.
pdfplumber document.pdf --format csv
# Outputs CSV with columns for each object attributeExample CSV Output:
object_type,page_number,x0,top,x1,bottom,text,fontname,size
char,1,72.0,100.0,80.0,110.0,"H","Arial",12.0
char,1,80.0,100.0,88.0,110.0,"e","Arial",12.0
char,1,88.0,100.0,94.0,110.0,"l","Arial",12.0Structured output format ideal for programmatic processing.
pdfplumber document.pdf --format json
# Outputs JSON array of objectsExample JSON Output:
[
{
"object_type": "char",
"page_number": 1,
"x0": 72.0,
"top": 100.0,
"x1": 80.0,
"bottom": 110.0,
"text": "H",
"fontname": "Arial",
"size": 12.0
}
]Simple text extraction output.
pdfplumber document.pdf --format text
# Outputs extracted text content# Get all character data with position information
pdfplumber document.pdf --types chars --format json --indent 2
# Get character text and positions only
pdfplumber document.pdf --types chars --include-attrs text,x0,top,x1,bottom
# High-precision character coordinates
pdfplumber document.pdf --types chars --precision 4# Get comprehensive object data
pdfplumber document.pdf --types chars,rects,lines,curves --format json
# Focus on text elements
pdfplumber document.pdf --types chars --include-attrs text,fontname,size,x0,top
# Extract accessibility structure
pdfplumber document.pdf --structure-text --format json# Analyze first page only
pdfplumber document.pdf --pages 0 --format json
# Compare multiple pages
pdfplumber document.pdf --pages 0,1,2 --types chars --include-attrs text,page_number
# Process large document selectively
pdfplumber document.pdf --pages 10-20 --format csv# Tight character grouping
pdfplumber document.pdf --laparams '{"char_margin": 1.0, "word_margin": 0.05}'
# Loose text flow detection
pdfplumber document.pdf --laparams '{"boxes_flow": 0.7, "word_margin": 0.2}'
# Combine with specific output
pdfplumber document.pdf --laparams '{"word_margin": 0.1}' --types chars --format json# Extract to structured data file
pdfplumber document.pdf --format json --indent 2 > document_data.json
# Create CSV for analysis
pdfplumber document.pdf --types chars --include-attrs text,x0,top,fontname,size > analysis.csv
# Process multiple files
for file in *.pdf; do
pdfplumber "$file" --format json > "${file%.pdf}.json"
done# Get all available object attributes
pdfplumber document.pdf --types chars --format json --indent 2 | head -20
# Analyze font usage
pdfplumber document.pdf --types chars --include-attrs fontname,size --format csv | sort | uniq -c
# Extract rectangle information (tables, forms)
pdfplumber document.pdf --types rects --include-attrs x0,top,x1,bottom,width,height
# Comprehensive document analysis
pdfplumber document.pdf --types chars,rects,lines,curves,images --structure --format jsonThe CLI provides informative error messages for common issues:
# Invalid file
pdfplumber nonexistent.pdf
# Error: Could not open file
# Invalid page range
pdfplumber document.pdf --pages 999
# Error: Page 999 not found
# Invalid JSON in laparams
pdfplumber document.pdf --laparams '{"invalid": json}'
# Error: Invalid JSON in laparams
# Malformed PDF
pdfplumber corrupted.pdf
# Error: Malformed PDF documentThe CLI can be integrated into Python workflows:
import subprocess
import json
import tempfile
def extract_pdf_data(pdf_path, pages=None, object_types=None):
"""Extract PDF data using CLI interface."""
cmd = ['pdfplumber', pdf_path, '--format', 'json']
if pages:
cmd.extend(['--pages', ','.join(map(str, pages))])
if object_types:
cmd.extend(['--types', ','.join(object_types)])
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
return json.loads(result.stdout)
else:
raise Exception(f"CLI error: {result.stderr}")
# Usage
data = extract_pdf_data("document.pdf", pages=[0, 1], object_types=['chars'])For large documents or batch processing:
# Process specific pages to reduce memory usage
pdfplumber large_document.pdf --pages 0-10
# Limit object types to improve processing speed
pdfplumber document.pdf --types chars
# Reduce output size with attribute filtering
pdfplumber document.pdf --include-attrs text,x0,top,x1,bottom
# Use CSV format for better performance with large datasets
pdfplumber document.pdf --format csvInstall with Tessl CLI
npx tessl i tessl/pypi-pdfplumber