Tessl Tile for pypi/pypdf@6.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

annotations.md form-fields.md index.md metadata.md page-operations.md reading-writing.md text-extraction.md utilities.md

utilities.mddocs/

0
# Utilities
1

2
Supporting utilities including page ranges, standard paper sizes, constants, error handling, and type definitions for enhanced developer experience. These utilities provide convenient functionality for common PDF operations.
3

4
## Capabilities
5

6
### Page Ranges
7

8
The PageRange class provides flexible page selection and range specification for PDF operations.
9

10
```python { .api }
11
class PageRange:
12
    def __init__(self, arg):
13
        """
14
        Initialize a page range from various input formats.
15
        
16
        Args:
17
            arg: Range specification - can be:
18
                - slice object (e.g., slice(0, 10, 2))
19
                - PageRange object (copy constructor)
20
                - string (e.g., "1-5", "2,4,6", "1-3,7-9")
21
                - integer (single page)
22
        """
23

24
    @staticmethod
25
    def valid(input) -> bool:
26
        """
27
        Check if input is a valid page range specification.
28
        
29
        Args:
30
            input: Input to validate
31
            
32
        Returns:
33
            True if input is valid for PageRange
34
        """
35

36
    def to_slice(self) -> slice:
37
        """
38
        Convert page range to a slice object.
39
        
40
        Returns:
41
            Equivalent slice object
42
        """
43

44
    def indices(self, n: int) -> tuple[int, int, int]:
45
        """
46
        Get slice indices for a given length.
47
        
48
        Args:
49
            n: Total number of items
50
            
51
        Returns:
52
            Tuple of (start, stop, step) indices
53
        """
54

55
    def __str__(self) -> str:
56
        """String representation of the page range."""
57

58
    def __repr__(self) -> str:
59
        """Developer representation of the page range."""
60

61
    def __eq__(self, other) -> bool:
62
        """Check equality with another PageRange."""
63

64
    def __hash__(self) -> int:
65
        """Hash function for use in sets and dictionaries."""
66

67
    def __add__(self, other):
68
        """Add two page ranges together."""
69
```
70

71
### Page Range Parsing
72

73
Utility function for parsing filename and page range combinations.
74

75
```python { .api }
76
def parse_filename_page_ranges(fnprs: list[str]) -> tuple[list[str], list[PageRange]]:
77
    """
78
    Parse filename and page range strings.
79
    
80
    Args:
81
        fnprs: List of strings in format "filename[pages]" or just "filename"
82
               Examples: ["doc.pdf[1-5]", "other.pdf", "file.pdf[2,4,6-8]"]
83
    
84
    Returns:
85
        Tuple of (filenames, page_ranges):
86
        - filenames: List of extracted filenames
87
        - page_ranges: List of corresponding PageRange objects
88
    """
89
```
90

91
### Paper Sizes
92

93
Standard paper size definitions for creating properly sized documents.
94

95
```python { .api }
96
class PaperSize:
97
    """Standard paper size definitions in points (72 points = 1 inch)."""
98
    
99
    # ISO A series (most common internationally)
100
    A0: tuple[float, float] = (2384, 3370)  # 841 × 1189 mm
101
    A1: tuple[float, float] = (1684, 2384)  # 594 × 841 mm
102
    A2: tuple[float, float] = (1191, 1684)  # 420 × 594 mm
103
    A3: tuple[float, float] = (842, 1191)   # 297 × 420 mm
104
    A4: tuple[float, float] = (595, 842)    # 210 × 297 mm
105
    A5: tuple[float, float] = (420, 595)    # 148 × 210 mm
106
    A6: tuple[float, float] = (298, 420)    # 105 × 148 mm
107
    A7: tuple[float, float] = (210, 298)    # 74 × 105 mm
108
    A8: tuple[float, float] = (147, 210)    # 52 × 74 mm
109
    
110
    # Envelope sizes
111
    C4: tuple[float, float] = (649, 918)    # 229 × 324 mm envelope
112
```
113

114
### Constants and Enums
115

116
PDF-specific constants, enums, and flags for various operations.
117

118
```python { .api }
119
from enum import IntEnum, IntFlag
120

121
class PasswordType(IntEnum):
122
    """Types of PDF passwords."""
123
    NOT_DECRYPTED = 0
124
    USER_PASSWORD = 1
125
    OWNER_PASSWORD = 2
126

127
class ImageType(IntFlag):
128
    """Types of images that can be extracted or processed."""
129
    NONE = 0
130
    XOBJECT_IMAGES = 1      # Form XObject images
131
    INLINE_IMAGES = 2       # Inline images in content streams
132
    DRAWING_IMAGES = 4      # Images created by drawing operations
133
    IMAGES = XOBJECT_IMAGES | INLINE_IMAGES  # Standard image types
134
    ALL = XOBJECT_IMAGES | INLINE_IMAGES | DRAWING_IMAGES  # All image types
135

136
class ObjectDeletionFlag(IntFlag):
137
    """Flags for controlling object deletion in PDFs."""
138
    NONE = 0
139
    TEXT = 1                # Text objects
140
    LINKS = 2               # Link annotations
141
    ATTACHMENTS = 4         # File attachments
142
    OBJECTS_3D = 8          # 3D objects
143
    ALL_ANNOTATIONS = 16    # All annotation types
144
    XOBJECT_IMAGES = 32     # Form XObject images
145
    INLINE_IMAGES = 64      # Inline images
146
    DRAWING_IMAGES = 128    # Drawing-based images
147
    IMAGES = XOBJECT_IMAGES | INLINE_IMAGES | DRAWING_IMAGES  # All images
148
```
149

150
### Error Handling
151

152
Comprehensive exception hierarchy for proper error handling in PDF operations.
153

154
```python { .api }
155
class PyPdfError(Exception):
156
    """Base exception for all pypdf errors."""
157

158
class DeprecationError(PyPdfError):
159
    """Raised when deprecated functionality is used."""
160

161
class DependencyError(PyPdfError):
162
    """Raised when required dependencies are missing."""
163

164
class PdfReadError(PyPdfError):
165
    """Raised when PDF reading fails."""
166

167
class PdfStreamError(PdfReadError):
168
    """Raised when PDF stream processing fails."""
169

170
class FileNotDecryptedError(PdfReadError):
171
    """Raised when trying to access encrypted content without decryption."""
172

173
class WrongPasswordError(PdfReadError):
174
    """Raised when incorrect password is provided for encrypted PDF."""
175

176
class EmptyFileError(PdfReadError):
177
    """Raised when PDF file is empty or invalid."""
178

179
class ParseError(PyPdfError):
180
    """Raised when PDF parsing fails."""
181

182
class PageSizeNotDefinedError(ParseError):
183
    """Raised when page size cannot be determined."""
184

185
class EmptyImageDataError(ParseError):
186
    """Raised when image data is empty or invalid."""
187

188
class LimitReachedError(ParseError):
189
    """Raised when processing limits are exceeded."""
190

191
class PdfReadWarning(UserWarning):
192
    """Warning for non-fatal PDF reading issues."""
193
```
194

195
## Usage Examples
196

197
### Working with Page Ranges
198

199
```python
200
from pypdf import PdfReader, PdfWriter, PageRange
201

202
reader = PdfReader("document.pdf")
203
writer = PdfWriter()
204

205
# Create page ranges in different ways
206
range1 = PageRange("1-5")      # Pages 1 through 5
207
range2 = PageRange("2,4,6")    # Pages 2, 4, and 6
208
range3 = PageRange(slice(0, 10, 2))  # Every other page from 0 to 9
209

210
# Use page range to select pages
211
for page_num in range(len(reader.pages)):
212
    if page_num in range1.indices(len(reader.pages)):
213
        writer.add_page(reader.pages[page_num])
214

215
with open("selected_pages.pdf", "wb") as output:
216
    writer.write(output)
217
```
218

219
### Page Range Validation and Conversion
220

221
```python
222
from pypdf import PageRange
223

224
# Validate page range inputs
225
inputs = ["1-10", "2,4,6", "invalid", slice(0, 5)]
226

227
for inp in inputs:
228
    if PageRange.valid(inp):
229
        pr = PageRange(inp)
230
        print(f"Valid range: {inp} -> {pr}")
231
        print(f"  As slice: {pr.to_slice()}")
232
        print(f"  Indices for 20 pages: {pr.indices(20)}")
233
    else:
234
        print(f"Invalid range: {inp}")
235
```
236

237
### Parsing Filename and Page Ranges
238

239
```python
240
from pypdf import parse_filename_page_ranges
241

242
# Parse combined filename and page specifications
243
file_specs = [
244
    "document.pdf[1-10]",
245
    "report.pdf[2,4,6-8]",
246
    "book.pdf",  # No page range specified
247
    "chapter1.pdf[5-]"  # From page 5 to end
248
]
249

250
filenames, page_ranges = parse_filename_page_ranges(file_specs)
251

252
for filename, page_range in zip(filenames, page_ranges):
253
    print(f"File: {filename}")
254
    if page_range:
255
        print(f"  Pages: {page_range}")
256
    else:
257
        print(f"  Pages: All")
258
```
259

260
### Using Standard Paper Sizes
261

262
```python
263
from pypdf import PdfWriter, PageObject, PaperSize
264

265
writer = PdfWriter()
266

267
# Create pages with standard sizes
268
sizes_to_create = [
269
    ("Letter", (612, 792)),     # US Letter
270
    ("A4", PaperSize.A4),       # ISO A4
271
    ("A3", PaperSize.A3),       # ISO A3
272
    ("Legal", (612, 1008))      # US Legal
273
]
274

275
for name, (width, height) in sizes_to_create:
276
    page = PageObject.create_blank_page(width, height)
277
    writer.add_page(page)
278
    print(f"Created {name} page: {width} x {height} points")
279

280
with open("standard_sizes.pdf", "wb") as output:
281
    writer.write(output)
282
```
283

284
### Error Handling Best Practices
285

286
```python
287
from pypdf import PdfReader, PdfWriter
288
from pypdf.errors import (
289
    PdfReadError, FileNotDecryptedError, WrongPasswordError,
290
    EmptyFileError, ParseError
291
)
292

293
def safe_pdf_operation(pdf_path: str, password: str = None):
294
    """Safely perform PDF operations with comprehensive error handling."""
295
    
296
    try:
297
        reader = PdfReader(pdf_path, password=password)
298
        
299
        if reader.is_encrypted and not password:
300
            raise FileNotDecryptedError("PDF is encrypted but no password provided")
301
        
302
        writer = PdfWriter()
303
        
304
        # Process each page safely
305
        for page_num, page in enumerate(reader.pages):
306
            try:
307
                # Attempt to extract text to verify page is readable
308
                text = page.extract_text()
309
                writer.add_page(page)
310
                print(f"Processed page {page_num + 1}: {len(text)} characters")
311
                
312
            except ParseError as e:
313
                print(f"Warning: Could not process page {page_num + 1}: {e}")
314
                # Skip problematic page or add blank page
315
                blank_page = PageObject.create_blank_page(612, 792)
316
                writer.add_page(blank_page)
317
        
318
        # Save result
319
        output_path = pdf_path.replace('.pdf', '_processed.pdf')
320
        with open(output_path, "wb") as output:
321
            writer.write(output)
322
            
323
        print(f"Successfully processed {pdf_path}")
324
        return True
325
        
326
    except FileNotDecryptedError:
327
        print(f"Error: {pdf_path} is encrypted. Please provide password.")
328
        return False
329
        
330
    except WrongPasswordError:
331
        print(f"Error: Incorrect password for {pdf_path}")
332
        return False
333
        
334
    except EmptyFileError:
335
        print(f"Error: {pdf_path} is empty or corrupted")
336
        return False
337
        
338
    except PdfReadError as e:
339
        print(f"Error reading {pdf_path}: {e}")
340
        return False
341
        
342
    except Exception as e:
343
        print(f"Unexpected error processing {pdf_path}: {e}")
344
        return False
345

346
# Use the safe operation
347
success = safe_pdf_operation("document.pdf")
348
if not success:
349
    success = safe_pdf_operation("document.pdf", password="secret")
350
```
351

352
### Working with Image Types
353

354
```python
355
from pypdf import PdfReader, ImageType
356

357
reader = PdfReader("document_with_images.pdf")
358

359
for page_num, page in enumerate(reader.pages):
360
    print(f"Page {page_num + 1}:")
361
    
362
    # Extract different types of images
363
    try:
364
        # All images
365
        all_images = page.images
366
        print(f"  Total images: {len(all_images)}")
367
        
368
        # You can specify image types when working with image extraction
369
        # (This would be used in specific image extraction methods)
370
        print(f"  Image types available: {list(ImageType)}")
371
        
372
    except Exception as e:
373
        print(f"  Error accessing images: {e}")
374
```
375

376
### Utility Functions for Common Operations
377

378
```python
379
from pypdf import PdfReader, PdfWriter, PageRange, PaperSize
380
from pypdf.errors import PyPdfError
381

382
def extract_page_range(input_pdf: str, output_pdf: str, page_range_str: str):
383
    """Extract specific pages to new PDF."""
384
    try:
385
        reader = PdfReader(input_pdf)
386
        writer = PdfWriter()
387
        
388
        # Parse page range
389
        page_range = PageRange(page_range_str)
390
        start, stop, step = page_range.indices(len(reader.pages))
391
        
392
        # Extract pages
393
        for i in range(start, stop, step):
394
            if i < len(reader.pages):
395
                writer.add_page(reader.pages[i])
396
        
397
        with open(output_pdf, "wb") as output:
398
            writer.write(output)
399
        
400
        print(f"Extracted pages {page_range_str} to {output_pdf}")
401
        
402
    except PyPdfError as e:
403
        print(f"PDF Error: {e}")
404
    except Exception as e:
405
        print(f"Error: {e}")
406

407
def create_blank_document(output_pdf: str, page_count: int = 1, size: str = "A4"):
408
    """Create a blank PDF document."""
409
    writer = PdfWriter()
410
    
411
    # Get paper size
412
    if hasattr(PaperSize, size):
413
        width, height = getattr(PaperSize, size)
414
    else:
415
        # Default to A4 if size not found
416
        width, height = PaperSize.A4
417
        print(f"Unknown size '{size}', using A4")
418
    
419
    # Create blank pages
420
    for _ in range(page_count):
421
        page = PageObject.create_blank_page(width, height)
422
        writer.add_page(page)
423
    
424
    with open(output_pdf, "wb") as output:
425
        writer.write(output)
426
    
427
    print(f"Created {page_count} blank {size} pages in {output_pdf}")
428

429
def get_pdf_info(pdf_path: str) -> dict:
430
    """Get comprehensive PDF information."""
431
    try:
432
        reader = PdfReader(pdf_path)
433
        
434
        info = {
435
            "filename": pdf_path,
436
            "page_count": len(reader.pages),
437
            "is_encrypted": reader.is_encrypted,
438
            "pdf_version": reader.pdf_header,
439
            "metadata": {},
440
            "page_sizes": []
441
        }
442
        
443
        # Get metadata
444
        if reader.metadata:
445
            info["metadata"] = {
446
                "title": reader.metadata.title,
447
                "author": reader.metadata.author,
448
                "subject": reader.metadata.subject,
449
                "creator": reader.metadata.creator,
450
                "producer": reader.metadata.producer
451
            }
452
        
453
        # Get page sizes
454
        for i, page in enumerate(reader.pages):
455
            try:
456
                width = float(page.mediabox.width)
457
                height = float(page.mediabox.height)
458
                info["page_sizes"].append({
459
                    "page": i + 1,
460
                    "width": width,
461
                    "height": height,
462
                    "size_points": f"{width} x {height}"
463
                })
464
            except:
465
                info["page_sizes"].append({
466
                    "page": i + 1,
467
                    "error": "Could not determine size"
468
                })
469
        
470
        return info
471
        
472
    except Exception as e:
473
        return {
474
            "filename": pdf_path,
475
            "error": str(e)
476
        }
477

478
# Use utility functions
479
extract_page_range("document.pdf", "pages_1_to_5.pdf", "1-5")
480
create_blank_document("blank.pdf", 10, "A4")
481
info = get_pdf_info("document.pdf")
482
print(f"PDF Info: {info}")
483
```
484

485
## Error Classes and Exception Handling
486

487
### Exception Hierarchy
488

489
pypdf provides a comprehensive exception hierarchy for different types of PDF processing errors.
490

491
```python { .api }
492
# Base exception classes
493
class PyPdfError(Exception):
494
    """Base class for all exceptions raised by pypdf."""
495

496
class PdfReadError(PyPdfError):
497
    """Raised when there is an issue reading a PDF file."""
498

499
class PdfStreamError(PdfReadError):
500
    """Raised when there is an issue reading the stream of data in a PDF file."""
501

502
class ParseError(PyPdfError):
503
    """Raised when there is an issue parsing a PDF file."""
504

505
# File access and decryption errors
506
class FileNotDecryptedError(PdfReadError):
507
    """Raised when an encrypted PDF has not been successfully decrypted."""
508

509
class WrongPasswordError(FileNotDecryptedError):
510
    """Raised when the wrong password is used to decrypt an encrypted PDF."""
511

512
class EmptyFileError(PdfReadError):
513
    """Raised when a PDF file is empty or has no content."""
514

515
# Specific operation errors
516
class PageSizeNotDefinedError(PyPdfError):
517
    """Raised when the page size of a PDF document is not defined."""
518

519
class EmptyImageDataError(PyPdfError):
520
    """Raised when trying to process an image that has no data."""
521

522
class LimitReachedError(PyPdfError):
523
    """Raised when a limit is reached."""
524

525
# Dependency and deprecation errors
526
class DependencyError(Exception):
527
    """Raised when a required dependency is not available."""
528

529
class DeprecationError(Exception):
530
    """Raised when a deprecated feature is used."""
531

532
# Warnings
533
class PdfReadWarning(UserWarning):
534
    """Issued when there is a potential issue reading a PDF file, but it can still be read."""
535
```
536

537
### User Access Permission Constants
538

539
```python { .api }
540
from pypdf.constants import UserAccessPermissions
541

542
class UserAccessPermissions(IntFlag):
543
    """PDF user access permissions for encryption."""
544
    
545
    PRINT = 4                         # Allow printing
546
    MODIFY = 8                        # Allow document modification
547
    EXTRACT = 16                      # Allow text/graphics extraction
548
    ADD_OR_MODIFY = 32                # Allow annotations/form fields
549
    FILL_FORM_FIELDS = 256            # Allow form field filling
550
    EXTRACT_TEXT_AND_GRAPHICS = 512   # Allow accessibility extraction
551
    ASSEMBLE_DOC = 1024               # Allow document assembly
552
    PRINT_TO_REPRESENTATION = 2048    # Allow high-quality printing
553
    
554
    @classmethod
555
    def all(cls) -> "UserAccessPermissions":
556
        """Get all permissions enabled."""
557
        
558
    def to_dict(self) -> dict[str, bool]:
559
        """Convert permissions to dictionary format."""
560
        
561
    @classmethod
562
    def from_dict(cls, value: dict[str, bool]) -> "UserAccessPermissions":
563
        """Create permissions from dictionary format."""
564
```
565

566
### Stream and Parsing Constants
567

568
```python { .api }
569
# Stream processing constants
570
STREAM_TRUNCATED_PREMATURELY = "Stream has ended unexpectedly"
571

572
# Core PDF structure constants
573
class Core:
574
    OUTLINES = "/Outlines"
575
    THREADS = "/Threads"
576
    PAGE = "/Page"
577
    PAGES = "/Pages" 
578
    CATALOG = "/Catalog"
579

580
class TrailerKeys:
581
    SIZE = "/Size"
582
    PREV = "/Prev"
583
    ROOT = "/Root"
584
    ENCRYPT = "/Encrypt"
585
    INFO = "/Info"
586
    ID = "/ID"
587
```

Version

Tile

Files

utilities.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

utilities.mddocs/