tessl/pypi-pdfplumber

Detailed PDF analysis and extraction library with comprehensive table detection and visual debugging capabilities.

—

Pending

Overview

Eval results

Files

Command Line Interface

Name: tessl/pypi-pdfplumber
Author: tessl

Complete command-line interface for PDF processing with support for text extraction, object export, structure analysis, and various output formats.

Capabilities

Main CLI Function

Entry point for the pdfplumber command-line interface with comprehensive argument parsing.

def main(args_raw=None):
    """
    CLI entry point with full argument parsing.
    
    Parameters:
    - args_raw: List[str], optional - Command line arguments (defaults to sys.argv[1:])
    
    Returns:
    None: Outputs results to specified destination
    """

Command Line Usage

The pdfplumber CLI can be invoked in several ways:

# As installed command
pdfplumber document.pdf

# As Python module
python -m pdfplumber.cli document.pdf

# From Python code
import pdfplumber.cli
pdfplumber.cli.main(['document.pdf', '--format', 'json'])

Basic Arguments

Core arguments for specifying input and output behavior.

# Input file (required, or stdin if not specified)
pdfplumber document.pdf

# Output format (csv, json, text)
pdfplumber document.pdf --format json

# Specify output file
pdfplumber document.pdf --format json > output.json

Object Type Selection

Control which PDF objects to include in the output.

# Include specific object types
pdfplumber document.pdf --types chars,rects,lines

# Common object types:
# - chars: character objects
# - rects: rectangle objects  
# - lines: line objects
# - curves: curve objects
# - images: image objects
# - annots: annotations
# - edges: computed edges

Attribute Filtering

Control which object attributes to include or exclude from output.

# Include only specific attributes
pdfplumber document.pdf --include-attrs text,x0,top,x1,bottom

# Exclude specific attributes  
pdfplumber document.pdf --exclude-attrs object_type,stream

# Common character attributes:
# - text: character text content
# - x0, top, x1, bottom: positioning
# - fontname: font family name
# - size: font size
# - adv: character advance width

Page Selection

Process specific pages or page ranges.

# Single page (0-indexed)
pdfplumber document.pdf --pages 0

# Multiple pages
pdfplumber document.pdf --pages 0,2,4

# Page ranges
pdfplumber document.pdf --pages 0-5

# Mixed ranges and individual pages
pdfplumber document.pdf --pages 0,2-5,10

Layout Analysis Parameters

Configure PDF layout analysis using LAParams settings.

# JSON-encoded LAParams
pdfplumber document.pdf --laparams '{"word_margin": 0.1, "char_margin": 2.0}'

# Common LAParams options:
# - word_margin: horizontal margin for word detection
# - char_margin: margin for character grouping
# - line_margin: margin for line detection
# - boxes_flow: flow threshold for text boxes

Output Formatting

Control output precision and formatting.

# Set numeric precision (decimal places)
pdfplumber document.pdf --precision 2

# JSON indentation
pdfplumber document.pdf --format json --indent 2

# Pretty-printed JSON
pdfplumber document.pdf --format json --indent 4

Structure Tree Analysis

Extract and analyze PDF structure tree for accessibility information.

# Output structure tree as JSON
pdfplumber document.pdf --structure

# Include text content in structure tree
pdfplumber document.pdf --structure-text

# Combine with regular object extraction
pdfplumber document.pdf --format json --structure > combined_output.json

Output Formats

CSV Format

Default output format providing tabular data suitable for spreadsheet analysis.

pdfplumber document.pdf --format csv
# Outputs CSV with columns for each object attribute

Example CSV Output:

object_type,page_number,x0,top,x1,bottom,text,fontname,size
char,1,72.0,100.0,80.0,110.0,"H","Arial",12.0
char,1,80.0,100.0,88.0,110.0,"e","Arial",12.0
char,1,88.0,100.0,94.0,110.0,"l","Arial",12.0

JSON Format

Structured output format ideal for programmatic processing.

pdfplumber document.pdf --format json
# Outputs JSON array of objects

Example JSON Output:

[
  {
    "object_type": "char",
    "page_number": 1,
    "x0": 72.0,
    "top": 100.0,
    "x1": 80.0,
    "bottom": 110.0,
    "text": "H",
    "fontname": "Arial",
    "size": 12.0
  }
]

Text Format

Simple text extraction output.

pdfplumber document.pdf --format text
# Outputs extracted text content

Advanced Usage Examples

Extract Character Data

# Get all character data with position information
pdfplumber document.pdf --types chars --format json --indent 2

# Get character text and positions only
pdfplumber document.pdf --types chars --include-attrs text,x0,top,x1,bottom

# High-precision character coordinates
pdfplumber document.pdf --types chars --precision 4

Analyze Document Structure

# Get comprehensive object data
pdfplumber document.pdf --types chars,rects,lines,curves --format json

# Focus on text elements
pdfplumber document.pdf --types chars --include-attrs text,fontname,size,x0,top

# Extract accessibility structure
pdfplumber document.pdf --structure-text --format json

Process Specific Pages

# Analyze first page only
pdfplumber document.pdf --pages 0 --format json

# Compare multiple pages
pdfplumber document.pdf --pages 0,1,2 --types chars --include-attrs text,page_number

# Process large document selectively
pdfplumber document.pdf --pages 10-20 --format csv

Custom Layout Analysis

# Tight character grouping
pdfplumber document.pdf --laparams '{"char_margin": 1.0, "word_margin": 0.05}'

# Loose text flow detection
pdfplumber document.pdf --laparams '{"boxes_flow": 0.7, "word_margin": 0.2}'

# Combine with specific output
pdfplumber document.pdf --laparams '{"word_margin": 0.1}' --types chars --format json

Data Pipeline Integration

# Extract to structured data file
pdfplumber document.pdf --format json --indent 2 > document_data.json

# Create CSV for analysis
pdfplumber document.pdf --types chars --include-attrs text,x0,top,fontname,size > analysis.csv

# Process multiple files
for file in *.pdf; do
    pdfplumber "$file" --format json > "${file%.pdf}.json"
done

Debugging and Analysis

# Get all available object attributes
pdfplumber document.pdf --types chars --format json --indent 2 | head -20

# Analyze font usage
pdfplumber document.pdf --types chars --include-attrs fontname,size --format csv | sort | uniq -c

# Extract rectangle information (tables, forms)
pdfplumber document.pdf --types rects --include-attrs x0,top,x1,bottom,width,height

# Comprehensive document analysis
pdfplumber document.pdf --types chars,rects,lines,curves,images --structure --format json

Error Handling

The CLI provides informative error messages for common issues:

# Invalid file
pdfplumber nonexistent.pdf
# Error: Could not open file

# Invalid page range
pdfplumber document.pdf --pages 999
# Error: Page 999 not found

# Invalid JSON in laparams
pdfplumber document.pdf --laparams '{"invalid": json}'
# Error: Invalid JSON in laparams

# Malformed PDF
pdfplumber corrupted.pdf
# Error: Malformed PDF document

Integration with Python Scripts

The CLI can be integrated into Python workflows:

import subprocess
import json
import tempfile

def extract_pdf_data(pdf_path, pages=None, object_types=None):
    """Extract PDF data using CLI interface."""
    cmd = ['pdfplumber', pdf_path, '--format', 'json']
    
    if pages:
        cmd.extend(['--pages', ','.join(map(str, pages))])
    
    if object_types:
        cmd.extend(['--types', ','.join(object_types)])
    
    result = subprocess.run(cmd, capture_output=True, text=True)
    
    if result.returncode == 0:
        return json.loads(result.stdout)
    else:
        raise Exception(f"CLI error: {result.stderr}")

# Usage
data = extract_pdf_data("document.pdf", pages=[0, 1], object_types=['chars'])

Performance Considerations

For large documents or batch processing:

# Process specific pages to reduce memory usage
pdfplumber large_document.pdf --pages 0-10

# Limit object types to improve processing speed  
pdfplumber document.pdf --types chars

# Reduce output size with attribute filtering
pdfplumber document.pdf --include-attrs text,x0,top,x1,bottom

# Use CSV format for better performance with large datasets
pdfplumber document.pdf --format csv

Install with Tessl CLI