or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

attachments.mdcli-tools.mddocument-management.mdimage-bitmap.mdindex.mdpage-manipulation.mdpage-objects.mdtext-processing.mdtransformation.mdversion-info.md
tile.json

tessl/pypi-pypdfium2

Python bindings to PDFium for comprehensive PDF manipulation, rendering, and processing

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/pypdfium2@4.30.x

To install, run

npx @tessl/cli install tessl/pypi-pypdfium2@4.30.0

index.mddocs/

pypdfium2

Python bindings to PDFium for comprehensive PDF manipulation, rendering, and processing. Built on Google's powerful PDFium library, pypdfium2 provides both high-level helper classes for common PDF operations and low-level raw bindings for advanced functionality.

Package Information

  • Package Name: pypdfium2
  • Language: Python
  • Installation: pip install pypdfium2
  • Python Requirements: Python 3.6+

Core Imports

import pypdfium2 as pdfium

For direct access to specific classes:

from pypdfium2 import PdfDocument, PdfPage, PdfBitmap

For version information:

from pypdfium2 import PYPDFIUM_INFO, PDFIUM_INFO

Basic Usage

import pypdfium2 as pdfium

# Open a PDF document
pdf = pdfium.PdfDocument("document.pdf")

# Get basic information
print(f"Pages: {len(pdf)}")
print(f"Version: {pdf.get_version()}")
print(f"Metadata: {pdf.get_metadata_dict()}")

# Render first page to image
page = pdf[0]
bitmap = page.render(scale=2.0)
pil_image = bitmap.to_pil()
pil_image.save("page1.png")

# Extract text from page
textpage = page.get_textpage()
text = textpage.get_text_range()
print(f"Page text: {text}")

# Clean up
pdf.close()

Architecture

pypdfium2 follows a layered architecture design:

  • Helper Classes: High-level Python API (PdfDocument, PdfPage, PdfBitmap, etc.) providing intuitive interfaces for common operations
  • Raw Bindings: Direct access to PDFium C API functions through pypdfium2.raw module
  • Type System: Named tuples and data classes for structured information (PdfBitmapInfo, ImageInfo, etc.)
  • Resource Management: Automatic cleanup with context managers and explicit close() methods
  • Multi-format Support: PDF reading/writing, image rendering (PIL, NumPy), text extraction

This design enables both simple high-level operations and advanced low-level manipulation while maintaining compatibility with the broader Python ecosystem.

Capabilities

Document Management

Core PDF document operations including loading, creating, saving, and metadata manipulation. Supports password-protected PDFs, form handling, and file attachments.

class PdfDocument:
    def __init__(self, input_data, password=None, autoclose=False): ...
    @classmethod
    def new(cls): ...
    def __len__(self) -> int: ...
    def save(self, dest, version=None, flags=...): ...
    def get_metadata_dict(self, skip_empty=False) -> dict: ...
    def is_tagged(self) -> bool: ...

Document Management

Page Manipulation

Page-level operations including rendering, rotation, dimension management, and bounding box manipulation. Supports various rendering formats and customization options.

class PdfPage:
    def get_size(self) -> tuple[float, float]: ...
    def render(self, rotation=0, scale=1, ...) -> PdfBitmap: ...
    def get_rotation(self) -> int: ...
    def set_rotation(self, rotation): ...
    def get_mediabox(self, fallback_ok=True) -> tuple | None: ...

Page Manipulation

Text Processing

Comprehensive text extraction and search capabilities with support for bounded text extraction, character-level positioning, and full-text search.

class PdfTextPage:
    def get_text_range(self, index=0, count=-1, errors="ignore", force_this=False) -> str: ...
    def get_text_bounded(self, left=None, bottom=None, right=None, top=None, errors="ignore") -> str: ...
    def search(self, text, index=0, match_case=False, match_whole_word=False, consecutive=False) -> PdfTextSearcher: ...
    def get_charbox(self, index, loose=False) -> tuple: ...

Text Processing

Image and Bitmap Operations

Image rendering, manipulation, and extraction with support for multiple output formats including PIL Images, NumPy arrays, and raw bitmaps.

class PdfBitmap:
    @classmethod
    def from_pil(cls, pil_image, recopy=False) -> PdfBitmap: ...
    def to_numpy(self) -> numpy.ndarray: ...
    def to_pil(self) -> PIL.Image: ...
    def fill_rect(self, left, top, width, height, color): ...

Image and Bitmap Operations

Page Objects and Graphics

Manipulation of PDF page objects including images, text, and vector graphics. Supports object transformation, insertion, and removal.

class PdfObject:
    def get_pos(self) -> tuple: ...
    def get_matrix(self) -> PdfMatrix: ...
    def transform(self, matrix): ...

class PdfImage(PdfObject):
    def get_metadata(self) -> ImageInfo: ...
    def extract(self, dest, *args, **kwargs): ...

Page Objects and Graphics

File Attachments

Management of embedded file attachments with support for attachment metadata, data extraction, and modification.

class PdfAttachment:
    def get_name(self) -> str: ...
    def get_data(self) -> ctypes.Array: ...
    def set_data(self, data): ...
    def get_str_value(self, key) -> str: ...

File Attachments

Transformation and Geometry

2D transformation matrices for coordinate system manipulation, rotation, scaling, and translation operations.

class PdfMatrix:
    def __init__(self, a=1, b=0, c=0, d=1, e=0, f=0): ...
    def translate(self, x, y) -> PdfMatrix: ...
    def scale(self, x, y) -> PdfMatrix: ...
    def rotate(self, angle, ccw=False, rad=False) -> PdfMatrix: ...
    def on_point(self, x, y) -> tuple: ...

Transformation and Geometry

Version and Library Information

Access to pypdfium2 and PDFium version information, build details, and feature flags.

PYPDFIUM_INFO: _version_pypdfium2
PDFIUM_INFO: _version_pdfium

# Version properties
version: str
api_tag: tuple[int]
major: int
minor: int
patch: int
build: int  # PDFIUM_INFO only

Version and Library Information

Command Line Interface

Access to pypdfium2's comprehensive command-line tools for batch processing, text extraction, image operations, and document manipulation.

def cli_main(raw_args=None) -> int:
    """Main CLI entry point for pypdfium2 command-line tools."""

def api_main(raw_args=None) -> int:
    """Alternative API entry point with same functionality as cli_main."""

Command Line Interface

Exception Handling

class PdfiumError(RuntimeError):
    """Main exception for PDFium library errors"""
    
class ImageNotExtractableError(Exception):
    """Raised when image cannot be extracted from PDF"""

Common error scenarios include invalid PDF files, unsupported operations, memory allocation failures, and file I/O errors. Always handle exceptions when working with external PDF files or performing complex operations.

Raw Bindings Access

For advanced use cases requiring direct PDFium API access:

from pypdfium2 import raw

# Access low-level PDFium functions
doc_handle = raw.FPDF_LoadDocument(file_path, password)
page_count = raw.FPDF_GetPageCount(doc_handle)

The raw module provides complete access to PDFium's C API with all functions, constants, and structures available for advanced manipulation.