or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

attachments.md cli-tools.md document-management.md image-bitmap.md index.md page-manipulation.md page-objects.md text-processing.md transformation.md version-info.md

tile.json

tessl/pypi-pypdfium2

Python bindings to PDFium for comprehensive PDF manipulation, rendering, and processing

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/pypdfium2@4.30.x

To install, run

npx @tessl/cli install tessl/pypi-pypdfium2@4.30.0

pypdfium2

Python bindings to PDFium for comprehensive PDF manipulation, rendering, and processing. Built on Google's powerful PDFium library, pypdfium2 provides both high-level helper classes for common PDF operations and low-level raw bindings for advanced functionality.

Package Information

Package Name: pypdfium2
Language: Python
Installation: pip install pypdfium2
Python Requirements: Python 3.6+

Core Imports

import pypdfium2 as pdfium

For direct access to specific classes:

from pypdfium2 import PdfDocument, PdfPage, PdfBitmap

For version information:

from pypdfium2 import PYPDFIUM_INFO, PDFIUM_INFO

Basic Usage

import pypdfium2 as pdfium

# Open a PDF document
pdf = pdfium.PdfDocument("document.pdf")

# Get basic information
print(f"Pages: {len(pdf)}")
print(f"Version: {pdf.get_version()}")
print(f"Metadata: {pdf.get_metadata_dict()}")

# Render first page to image
page = pdf[0]
bitmap = page.render(scale=2.0)
pil_image = bitmap.to_pil()
pil_image.save("page1.png")

# Extract text from page
textpage = page.get_textpage()
text = textpage.get_text_range()
print(f"Page text: {text}")

# Clean up
pdf.close()

Architecture

pypdfium2 follows a layered architecture design:

Helper Classes: High-level Python API (PdfDocument, PdfPage, PdfBitmap, etc.) providing intuitive interfaces for common operations
Raw Bindings: Direct access to PDFium C API functions through pypdfium2.raw module
Type System: Named tuples and data classes for structured information (PdfBitmapInfo, ImageInfo, etc.)
Resource Management: Automatic cleanup with context managers and explicit close() methods
Multi-format Support: PDF reading/writing, image rendering (PIL, NumPy), text extraction

This design enables both simple high-level operations and advanced low-level manipulation while maintaining compatibility with the broader Python ecosystem.

Capabilities

Document Management

Core PDF document operations including loading, creating, saving, and metadata manipulation. Supports password-protected PDFs, form handling, and file attachments.

class PdfDocument:
    def __init__(self, input_data, password=None, autoclose=False): ...
    @classmethod
    def new(cls): ...
    def __len__(self) -> int: ...
    def save(self, dest, version=None, flags=...): ...
    def get_metadata_dict(self, skip_empty=False) -> dict: ...
    def is_tagged(self) -> bool: ...

Document Management

Page Manipulation

Page-level operations including rendering, rotation, dimension management, and bounding box manipulation. Supports various rendering formats and customization options.

class PdfPage:
    def get_size(self) -> tuple[float, float]: ...
    def render(self, rotation=0, scale=1, ...) -> PdfBitmap: ...
    def get_rotation(self) -> int: ...
    def set_rotation(self, rotation): ...
    def get_mediabox(self, fallback_ok=True) -> tuple | None: ...

Page Manipulation

Text Processing

Comprehensive text extraction and search capabilities with support for bounded text extraction, character-level positioning, and full-text search.

class PdfTextPage:
    def get_text_range(self, index=0, count=-1, errors="ignore", force_this=False) -> str: ...
    def get_text_bounded(self, left=None, bottom=None, right=None, top=None, errors="ignore") -> str: ...
    def search(self, text, index=0, match_case=False, match_whole_word=False, consecutive=False) -> PdfTextSearcher: ...
    def get_charbox(self, index, loose=False) -> tuple: ...

Text Processing

Image and Bitmap Operations

Image rendering, manipulation, and extraction with support for multiple output formats including PIL Images, NumPy arrays, and raw bitmaps.

class PdfBitmap:
    @classmethod
    def from_pil(cls, pil_image, recopy=False) -> PdfBitmap: ...
    def to_numpy(self) -> numpy.ndarray: ...
    def to_pil(self) -> PIL.Image: ...
    def fill_rect(self, left, top, width, height, color): ...

Image and Bitmap Operations

Page Objects and Graphics

Manipulation of PDF page objects including images, text, and vector graphics. Supports object transformation, insertion, and removal.

class PdfObject:
    def get_pos(self) -> tuple: ...
    def get_matrix(self) -> PdfMatrix: ...
    def transform(self, matrix): ...

class PdfImage(PdfObject):
    def get_metadata(self) -> ImageInfo: ...
    def extract(self, dest, *args, **kwargs): ...

Page Objects and Graphics

File Attachments

Management of embedded file attachments with support for attachment metadata, data extraction, and modification.

class PdfAttachment:
    def get_name(self) -> str: ...
    def get_data(self) -> ctypes.Array: ...
    def set_data(self, data): ...
    def get_str_value(self, key) -> str: ...

File Attachments

Transformation and Geometry

2D transformation matrices for coordinate system manipulation, rotation, scaling, and translation operations.

class PdfMatrix:
    def __init__(self, a=1, b=0, c=0, d=1, e=0, f=0): ...
    def translate(self, x, y) -> PdfMatrix: ...
    def scale(self, x, y) -> PdfMatrix: ...
    def rotate(self, angle, ccw=False, rad=False) -> PdfMatrix: ...
    def on_point(self, x, y) -> tuple: ...

Transformation and Geometry

Version and Library Information

Access to pypdfium2 and PDFium version information, build details, and feature flags.

PYPDFIUM_INFO: _version_pypdfium2
PDFIUM_INFO: _version_pdfium

# Version properties
version: str
api_tag: tuple[int]
major: int
minor: int
patch: int
build: int  # PDFIUM_INFO only

Version and Library Information

Command Line Interface

Access to pypdfium2's comprehensive command-line tools for batch processing, text extraction, image operations, and document manipulation.

def cli_main(raw_args=None) -> int:
    """Main CLI entry point for pypdfium2 command-line tools."""

def api_main(raw_args=None) -> int:
    """Alternative API entry point with same functionality as cli_main."""

Command Line Interface

Exception Handling

class PdfiumError(RuntimeError):
    """Main exception for PDFium library errors"""
    
class ImageNotExtractableError(Exception):
    """Raised when image cannot be extracted from PDF"""

Common error scenarios include invalid PDF files, unsupported operations, memory allocation failures, and file I/O errors. Always handle exceptions when working with external PDF files or performing complex operations.

Raw Bindings Access

For advanced use cases requiring direct PDFium API access:

from pypdfium2 import raw

# Access low-level PDFium functions
doc_handle = raw.FPDF_LoadDocument(file_path, password)
page_count = raw.FPDF_GetPageCount(doc_handle)

The raw module provides complete access to PDFium's C API with all functions, constants, and structures available for advanced manipulation.

Version

Tile

Files

tessl/pypi-pypdfium2

To install, run

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

pypdfium2

Package Information

Core Imports

Basic Usage

Architecture

Capabilities

Document Management

Page Manipulation

Text Processing

Image and Bitmap Operations

Page Objects and Graphics

File Attachments

Transformation and Geometry

Version and Library Information

Command Line Interface

Exception Handling

Raw Bindings Access

index.mddocs/