Tessl Tile for pypi/airbyte-source-s3@4.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

configuration.md core-source.md file-formats.md index.md stream-operations.md utilities.md zip-support.md

zip-support.mddocs/

0
# ZIP File Support
1

2
Comprehensive ZIP file extraction and streaming support for processing compressed S3 files. The S3 connector provides efficient handling of ZIP archives, including streaming decompression and individual file access without downloading entire archives.
3

4
## Capabilities
5

6
### ZIP File Handler
7

8
Main class for discovering and accessing files within ZIP archives stored in S3, supporting both standard and ZIP64 formats.
9

10
```python { .api }
11
class ZipFileHandler:
12
    """
13
    Handles ZIP file discovery and metadata extraction from S3.
14
    Supports both standard ZIP and ZIP64 formats with efficient partial file reading.
15
    """
16
    
17
    # ZIP format constants
18
    EOCD_SIGNATURE: bytes = b"\x50\x4b\x05\x06"
19
    """End of Central Directory signature"""
20
    
21
    ZIP64_LOCATOR_SIGNATURE: bytes = b"\x50\x4b\x06\x07"
22
    """ZIP64 End of Central Directory Locator signature"""
23
    
24
    EOCD_CENTRAL_DIR_START_OFFSET: int = 16
25
    """Offset to central directory start in EOCD record"""
26
    
27
    ZIP64_EOCD_OFFSET: int = 8
28
    """Offset to ZIP64 EOCD in locator record"""
29
    
30
    ZIP64_EOCD_SIZE: int = 56
31
    """Size of ZIP64 End of Central Directory record"""
32
    
33
    ZIP64_CENTRAL_DIR_START_OFFSET: int = 48
34
    """Offset to central directory start in ZIP64 EOCD"""
35
    
36
    def __init__(self, s3_client: BaseClient, config: Config):
37
        """
38
        Initialize ZIP file handler with S3 client and configuration.
39
        
40
        Args:
41
            s3_client: Configured S3 client for file access
42
            config: S3 connector configuration
43
        """
44
    
45
    def get_zip_files(self, filename: str) -> Tuple[List[zipfile.ZipInfo], int]:
46
        """
47
        Extracts ZIP file metadata and returns list of contained files.
48
        
49
        Args:
50
            filename: S3 key of the ZIP file
51
            
52
        Returns:
53
            Tuple of (list of ZipInfo objects, central directory start offset)
54
            
55
        Raises:
56
            ValueError: If file is not a valid ZIP archive
57
            ClientError: If S3 access fails
58
        """
59
    
60
    def _fetch_data_from_s3(self, filename: str, start: int, size: Optional[int] = None) -> bytes:
61
        """
62
        Fetches specific byte range from S3 object.
63
        
64
        Args:
65
            filename: S3 key of the file
66
            start: Starting byte position
67
            size: Number of bytes to fetch (None for rest of file)
68
            
69
        Returns:
70
            Bytes data from the specified range
71
        """
72
    
73
    def _find_signature(
74
        self, 
75
        filename: str, 
76
        signature: bytes, 
77
        initial_buffer_size: int = BUFFER_SIZE_DEFAULT, 
78
        max_buffer_size: int = MAX_BUFFER_SIZE_DEFAULT
79
    ) -> Optional[bytes]:
80
        """
81
        Locates ZIP signature by reading backwards from end of file.
82
        
83
        Args:
84
            filename: S3 key of the ZIP file
85
            signature: Byte signature to search for
86
            initial_buffer_size: Initial buffer size for searching
87
            max_buffer_size: Maximum buffer size to prevent excessive memory usage
88
            
89
        Returns:
90
            Buffer containing the signature, or None if not found
91
        """
92
    
93
    def _fetch_zip64_data(self, filename: str) -> bytes:
94
        """
95
        Fetches ZIP64 Extended Information Extra Field data.
96
        
97
        Args:
98
            filename: S3 key of the ZIP file
99
            
100
        Returns:
101
            ZIP64 extra field data
102
        """
103
    
104
    def _get_central_directory_start(self, filename: str) -> int:
105
        """
106
        Determines the start offset of the central directory.
107
        
108
        Args:
109
            filename: S3 key of the ZIP file
110
            
111
        Returns:
112
            Byte offset where central directory begins
113
        """
114
```
115

116
### Remote File Inside Archive
117

118
Extended RemoteFile class representing a file contained within a ZIP archive, including compression metadata.
119

120
```python { .api }
121
class RemoteFileInsideArchive(RemoteFile):
122
    """
123
    Represents a file inside a ZIP archive with compression metadata.
124
    Extends RemoteFile with ZIP-specific information.
125
    """
126
    
127
    start_offset: int
128
    """Byte offset where compressed data begins in the ZIP file"""
129
    
130
    compressed_size: int
131
    """Size of the compressed data in bytes"""
132
    
133
    uncompressed_size: int
134
    """Size of the uncompressed data in bytes"""
135
    
136
    compression_method: int
137
    """ZIP compression method (0=stored, 8=deflated, etc.)"""
138
```
139

140
### Decompressed Stream
141

142
Streaming decompression interface for reading compressed files from ZIP archives without loading entire files into memory.
143

144
```python { .api }
145
class DecompressedStream(io.IOBase):
146
    """
147
    Provides streaming decompression of files within ZIP archives.
148
    Supports seek operations and efficient memory usage for large compressed files.
149
    """
150
    
151
    LOCAL_FILE_HEADER_SIZE: int = 30
152
    """Size of ZIP local file header"""
153
    
154
    NAME_LENGTH_OFFSET: int = 26
155
    """Offset to filename length in local file header"""
156
    
157
    def __init__(
158
        self, 
159
        file_obj: IO[bytes], 
160
        file_info: RemoteFileInsideArchive, 
161
        buffer_size: int = BUFFER_SIZE_DEFAULT
162
    ):
163
        """
164
        Initialize decompressed stream for a file inside ZIP archive.
165
        
166
        Args:
167
            file_obj: File-like object for the ZIP archive
168
            file_info: Metadata about the file inside the archive
169
            buffer_size: Buffer size for decompression operations
170
        """
171
    
172
    def read(self, size: int = -1) -> bytes:
173
        """
174
        Read decompressed data from the stream.
175
        
176
        Args:
177
            size: Number of bytes to read (-1 for all remaining data)
178
            
179
        Returns:
180
            Decompressed bytes data
181
        """
182
    
183
    def seek(self, offset: int, whence: int = io.SEEK_SET) -> int:
184
        """
185
        Seek to a specific position in the decompressed stream.
186
        
187
        Args:
188
            offset: Byte offset to seek to
189
            whence: Seek mode (SEEK_SET, SEEK_CUR, SEEK_END)
190
            
191
        Returns:
192
            New absolute position in the stream
193
        """
194
    
195
    def tell(self) -> int:
196
        """
197
        Get current position in the decompressed stream.
198
        
199
        Returns:
200
            Current byte position
201
        """
202
    
203
    def readable(self) -> bool:
204
        """
205
        Check if stream is readable.
206
        
207
        Returns:
208
            True if stream can be read from
209
        """
210
    
211
    def seekable(self) -> bool:
212
        """
213
        Check if stream supports seeking.
214
        
215
        Returns:
216
            True if stream supports seek operations
217
        """
218
    
219
    def close(self):
220
        """Close the decompressed stream and release resources."""
221
    
222
    def _calculate_actual_start(self, file_start: int) -> int:
223
        """
224
        Calculate actual start position accounting for local file header.
225
        
226
        Args:
227
            file_start: Start position from central directory
228
            
229
        Returns:
230
            Actual start position of compressed data
231
        """
232
    
233
    def _reset_decompressor(self):
234
        """Reset the decompression state for seeking operations."""
235
    
236
    def _decompress_chunk(self, chunk: bytes) -> bytes:
237
        """
238
        Decompress a chunk of data using the appropriate algorithm.
239
        
240
        Args:
241
            chunk: Compressed data chunk
242
            
243
        Returns:
244
            Decompressed data chunk
245
        """
246
```
247

248
### ZIP Content Reader
249

250
High-level interface for reading content from files within ZIP archives, providing both text and binary reading capabilities.
251

252
```python { .api }
253
class ZipContentReader:
254
    """
255
    High-level interface for reading content from ZIP archive files.
256
    Provides text and binary reading modes with encoding support.
257
    """
258
    
259
    def __init__(
260
        self, 
261
        decompressed_stream: DecompressedStream, 
262
        encoding: Optional[str] = None, 
263
        buffer_size: int = BUFFER_SIZE_DEFAULT
264
    ):
265
        """
266
        Initialize ZIP content reader.
267
        
268
        Args:
269
            decompressed_stream: DecompressedStream for the file
270
            encoding: Text encoding for string operations (None for binary mode)
271
            buffer_size: Buffer size for reading operations
272
        """
273
    
274
    def __iter__(self):
275
        """
276
        Iterator interface for reading lines from the file.
277
        
278
        Yields:
279
            Lines from the file (str if encoding specified, bytes otherwise)
280
        """
281
    
282
    def __next__(self) -> Union[str, bytes]:
283
        """
284
        Get next line from the file.
285
        
286
        Returns:
287
            Next line from file
288
            
289
        Raises:
290
            StopIteration: When end of file is reached
291
        """
292
    
293
    def __enter__(self) -> "ZipContentReader":
294
        """Context manager entry."""
295
    
296
    def __exit__(self, exc_type, exc_value, traceback) -> None:
297
        """Context manager exit."""
298
    
299
    def readline(self, limit: int = -1) -> Union[str, bytes]:
300
        """
301
        Read a single line from the file.
302
        
303
        Args:
304
            limit: Maximum number of characters/bytes to read
305
            
306
        Returns:
307
            Single line from file
308
        """
309
    
310
    def read(self, size: int = -1) -> Union[str, bytes]:
311
        """
312
        Read data from the file.
313
        
314
        Args:
315
            size: Number of characters/bytes to read (-1 for all)
316
            
317
        Returns:
318
            File content as string or bytes
319
        """
320
    
321
    def seek(self, offset: int, whence: int = io.SEEK_SET) -> int:
322
        """
323
        Seek to position in the file.
324
        
325
        Args:
326
            offset: Position to seek to
327
            whence: Seek mode
328
            
329
        Returns:
330
            New position in file
331
        """
332
    
333
    def close(self):
334
        """Close the content reader and release resources."""
335
    
336
    def tell(self) -> int:
337
        """
338
        Get current position in the file.
339
        
340
        Returns:
341
            Current position
342
        """
343
    
344
    @property
345
    def closed(self) -> bool:
346
        """
347
        Check if reader is closed.
348
        
349
        Returns:
350
            True if reader is closed
351
        """
352
```
353

354
## Usage Examples
355

356
### Basic ZIP File Processing
357

358
```python
359
from source_s3.v4 import SourceS3StreamReader, Config
360
from source_s3.v4.zip_reader import ZipFileHandler, ZipContentReader, DecompressedStream
361

362
# Configure S3 connection
363
config = Config(
364
    bucket="my-data-bucket",
365
    aws_access_key_id="your-access-key",
366
    aws_secret_access_key="your-secret-key",
367
    region_name="us-east-1"
368
)
369

370
# Create stream reader and get S3 client
371
reader = SourceS3StreamReader()
372
reader.config = config
373
s3_client = reader.s3_client
374

375
# Initialize ZIP handler
376
zip_handler = ZipFileHandler(s3_client, config)
377

378
# Discover files in ZIP archive
379
zip_files, central_dir_offset = zip_handler.get_zip_files("data/archive.zip")
380

381
print(f"Found {len(zip_files)} files in archive:")
382
for zip_info in zip_files:
383
    print(f"  - {zip_info.filename} ({zip_info.file_size} bytes)")
384
```
385

386
### Reading Individual Files from ZIP
387

388
```python
389
import io
390
from source_s3.v4.zip_reader import RemoteFileInsideArchive, DecompressedStream, ZipContentReader
391

392
# Select a specific file from the ZIP
393
target_file = zip_files[0]  # First file in archive
394

395
# Create RemoteFileInsideArchive object
396
archive_file = RemoteFileInsideArchive(
397
    uri=f"s3://my-data-bucket/data/archive.zip/{target_file.filename}",
398
    start_offset=target_file.header_offset,
399
    compressed_size=target_file.compress_size,
400
    uncompressed_size=target_file.file_size,
401
    compression_method=target_file.compress_type,
402
    last_modified=None
403
)
404

405
# Open S3 object
406
s3_response = s3_client.get_object(Bucket="my-data-bucket", Key="data/archive.zip")
407
s3_stream = s3_response['Body']
408

409
# Create decompressed stream
410
decompressed = DecompressedStream(s3_stream, archive_file)
411

412
# Read content with encoding (for text files)
413
with ZipContentReader(decompressed, encoding="utf-8") as reader:
414
    content = reader.read()
415
    print(f"File content ({len(content)} characters):")
416
    print(content[:500])  # First 500 characters
417
```
418

419
### Processing CSV Files from ZIP
420

421
```python
422
import csv
423
from io import StringIO
424

425
# Assuming we have a CSV file in the ZIP
426
csv_file = next(f for f in zip_files if f.filename.endswith('.csv'))
427

428
# Create archive file representation
429
csv_archive_file = RemoteFileInsideArchive(
430
    uri=f"s3://my-data-bucket/data/archive.zip/{csv_file.filename}",
431
    start_offset=csv_file.header_offset,
432
    compressed_size=csv_file.compress_size,
433
    uncompressed_size=csv_file.file_size,
434
    compression_method=csv_file.compress_type,
435
    last_modified=None
436
)
437

438
# Process CSV data
439
s3_response = s3_client.get_object(Bucket="my-data-bucket", Key="data/archive.zip")
440
decompressed = DecompressedStream(s3_response['Body'], csv_archive_file)
441

442
with ZipContentReader(decompressed, encoding="utf-8") as reader:
443
    csv_content = reader.read()
444
    csv_reader = csv.DictReader(StringIO(csv_content))
445
    
446
    for row_num, row in enumerate(csv_reader):
447
        print(f"Row {row_num}: {row}")
448
        if row_num >= 5:  # Show first 5 rows
449
            break
450
```
451

452
### Streaming Large Files from ZIP
453

454
```python
455
# For large files, use streaming approach
456
large_file = max(zip_files, key=lambda f: f.file_size)
457

458
large_archive_file = RemoteFileInsideArchive(
459
    uri=f"s3://my-data-bucket/data/archive.zip/{large_file.filename}",
460
    start_offset=large_file.header_offset,
461
    compressed_size=large_file.compress_size,
462
    uncompressed_size=large_file.file_size,
463
    compression_method=large_file.compress_type,
464
    last_modified=None
465
)
466

467
# Stream content in chunks
468
s3_response = s3_client.get_object(Bucket="my-data-bucket", Key="data/archive.zip")
469
decompressed = DecompressedStream(s3_response['Body'], large_archive_file, buffer_size=64*1024)
470

471
with ZipContentReader(decompressed, encoding="utf-8") as reader:
472
    chunk_size = 1024 * 1024  # 1MB chunks
473
    total_size = 0
474
    
475
    while True:
476
        chunk = reader.read(chunk_size)
477
        if not chunk:
478
            break
479
            
480
        total_size += len(chunk)
481
        print(f"Processed {total_size} characters...")
482
        
483
        # Process chunk here
484
        # process_data_chunk(chunk)
485
```
486

487
### Error Handling with ZIP Files
488

489
```python
490
from botocore.exceptions import ClientError
491

492
try:
493
    # Attempt to process ZIP file
494
    zip_files, _ = zip_handler.get_zip_files("data/potentially-corrupt.zip")
495
    
496
    for zip_info in zip_files:
497
        try:
498
            archive_file = RemoteFileInsideArchive(
499
                uri=f"s3://my-data-bucket/data/potentially-corrupt.zip/{zip_info.filename}",
500
                start_offset=zip_info.header_offset,
501
                compressed_size=zip_info.compress_size,
502
                uncompressed_size=zip_info.file_size,
503
                compression_method=zip_info.compress_type,
504
                last_modified=None
505
            )
506
            
507
            s3_response = s3_client.get_object(
508
                Bucket="my-data-bucket", 
509
                Key="data/potentially-corrupt.zip"
510
            )
511
            decompressed = DecompressedStream(s3_response['Body'], archive_file)
512
            
513
            with ZipContentReader(decompressed, encoding="utf-8") as reader:
514
                content = reader.read()
515
                print(f"Successfully processed {zip_info.filename}")
516
                
517
        except Exception as e:
518
            print(f"Failed to process {zip_info.filename}: {e}")
519
            continue
520
            
521
except ValueError as e:
522
    print(f"Invalid ZIP file: {e}")
523
except ClientError as e:
524
    print(f"S3 access error: {e}")
525
```
526

527
## Performance Considerations
528

529
### Buffer Size Optimization
530
- **Small files**: Use default buffer size (1MB) for optimal memory usage
531
- **Large files**: Increase buffer size (4-16MB) for better I/O performance
532
- **Memory constraints**: Decrease buffer size if memory is limited
533

534
### Streaming vs. Full Read
535
- **Large files**: Always use streaming approach to avoid memory issues
536
- **Small files**: Full read can be more efficient for files under 10MB
537
- **Mixed sizes**: Implement size-based strategy selection
538

539
### S3 Transfer Optimization
540
- **Range requests**: ZIP handler uses efficient byte-range requests
541
- **Connection reuse**: S3 client connection pooling improves performance
542
- **Regional proximity**: Use S3 buckets in same region as processing
543

544
### Compression Method Support
545
- **Stored (method 0)**: No compression, fastest extraction
546
- **Deflated (method 8)**: Standard compression, good balance of speed/size
547
- **Other methods**: Limited support, may require additional libraries

Version

Tile

Files

zip-support.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

zip-support.mddocs/