0
# Process Pool Downloads
1
2
High-performance multiprocessing-based download functionality that bypasses Python's Global Interpreter Lock (GIL) limitations for improved throughput on multi-core systems. This module provides an alternative to the thread-based TransferManager for download-only scenarios requiring maximum performance.
3
4
## Capabilities
5
6
### ProcessPoolDownloader
7
8
The main downloader class that uses multiple processes for concurrent S3 downloads, providing true parallelism and better CPU utilization compared to thread-based approaches.
9
10
```python { .api }
11
class ProcessPoolDownloader:
12
"""
13
Multiprocessing-based S3 downloader for high-performance downloads.
14
15
Args:
16
client_kwargs (dict, optional): Arguments for creating S3 clients in each process
17
config (ProcessTransferConfig, optional): Configuration for download behavior
18
"""
19
def __init__(self, client_kwargs=None, config=None): ...
20
21
def download_file(self, bucket, key, filename, extra_args=None, expected_size=None):
22
"""
23
Download an S3 object to a local file using multiple processes.
24
25
Args:
26
bucket (str): S3 bucket name
27
key (str): S3 object key/name
28
filename (str): Local file path to download to
29
extra_args (dict, optional): Additional S3 operation arguments
30
expected_size (int, optional): Expected size of the object (avoids HEAD request)
31
32
Returns:
33
ProcessPoolTransferFuture: Future object for tracking download progress
34
"""
35
36
def shutdown(self):
37
"""
38
Shutdown the downloader and wait for all downloads to complete.
39
"""
40
41
def __enter__(self):
42
"""Context manager entry."""
43
44
def __exit__(self, exc_type, exc_val, exc_tb):
45
"""Context manager exit with automatic shutdown."""
46
```
47
48
### ProcessTransferConfig
49
50
Configuration class for controlling ProcessPool downloader behavior including multipart thresholds and process concurrency.
51
52
```python { .api }
53
class ProcessTransferConfig:
54
"""
55
Configuration for ProcessPoolDownloader with multiprocessing-specific options.
56
57
Args:
58
multipart_threshold (int): Size threshold for ranged downloads (default: 8MB)
59
multipart_chunksize (int): Size of each download chunk (default: 8MB)
60
max_request_processes (int): Maximum number of download processes (default: 10)
61
"""
62
def __init__(
63
self,
64
multipart_threshold=8 * 1024 * 1024,
65
multipart_chunksize=8 * 1024 * 1024,
66
max_request_processes=10
67
): ...
68
69
multipart_threshold: int
70
multipart_chunksize: int
71
max_request_processes: int
72
```
73
74
### ProcessPoolTransferFuture
75
76
Future object representing a ProcessPool download operation with methods for monitoring progress and retrieving results.
77
78
```python { .api }
79
class ProcessPoolTransferFuture:
80
"""
81
Future representing a ProcessPool download request.
82
"""
83
def done(self) -> bool:
84
"""
85
Check if the download is complete.
86
87
Returns:
88
bool: True if download is complete (success or failure), False otherwise
89
"""
90
91
def result(self):
92
"""
93
Get the download result, blocking until complete.
94
95
Returns:
96
None: Returns None on successful completion
97
98
Raises:
99
Exception: Any exception that occurred during download
100
"""
101
102
def cancel(self):
103
"""
104
Cancel the download if possible.
105
106
Returns:
107
bool: True if cancellation was successful, False otherwise
108
"""
109
110
@property
111
def meta(self) -> 'ProcessPoolTransferMeta':
112
"""
113
Transfer metadata object containing call arguments and status information.
114
115
Returns:
116
ProcessPoolTransferMeta: Metadata object for this download
117
"""
118
```
119
120
### ProcessPoolTransferMeta
121
122
Metadata container providing information about a ProcessPool download including call arguments and transfer ID.
123
124
```python { .api }
125
class ProcessPoolTransferMeta:
126
"""
127
Metadata about a ProcessPoolTransferFuture containing call arguments and transfer information.
128
"""
129
@property
130
def call_args(self):
131
"""
132
The original call arguments used for the download.
133
134
Returns:
135
CallArgs: Object containing method arguments (bucket, key, filename, etc.)
136
"""
137
138
@property
139
def transfer_id(self):
140
"""
141
Unique identifier for this transfer.
142
143
Returns:
144
str: Transfer ID string
145
"""
146
```
147
148
## Usage Examples
149
150
### Basic ProcessPool Download
151
152
```python
153
import boto3
154
from s3transfer.processpool import ProcessPoolDownloader, ProcessTransferConfig
155
156
# Create downloader with custom configuration
157
config = ProcessTransferConfig(
158
multipart_threshold=16 * 1024 * 1024, # 16MB
159
multipart_chunksize=8 * 1024 * 1024, # 8MB chunks
160
max_request_processes=15 # 15 concurrent processes
161
)
162
163
downloader = ProcessPoolDownloader(
164
client_kwargs={'region_name': 'us-west-2'},
165
config=config
166
)
167
168
try:
169
# Download a file
170
future = downloader.download_file('my-bucket', 'large-file.zip', '/tmp/downloaded-file.zip')
171
172
# Wait for completion
173
future.result() # Blocks until complete
174
print("Download completed successfully")
175
176
finally:
177
downloader.shutdown()
178
```
179
180
### Context Manager Usage
181
182
```python
183
from s3transfer.processpool import ProcessPoolDownloader
184
185
# Using context manager for automatic cleanup
186
with ProcessPoolDownloader() as downloader:
187
future = downloader.download_file('my-bucket', 'data.csv', '/tmp/data.csv')
188
future.result()
189
print("Download completed")
190
# Downloader automatically shut down
191
```
192
193
### Multiple Concurrent Downloads
194
195
```python
196
from s3transfer.processpool import ProcessPoolDownloader
197
198
files_to_download = [
199
('my-bucket', 'file1.txt', '/tmp/file1.txt'),
200
('my-bucket', 'file2.txt', '/tmp/file2.txt'),
201
('my-bucket', 'file3.txt', '/tmp/file3.txt'),
202
]
203
204
with ProcessPoolDownloader() as downloader:
205
futures = []
206
207
# Start all downloads
208
for bucket, key, filename in files_to_download:
209
future = downloader.download_file(bucket, key, filename)
210
futures.append(future)
211
212
# Wait for all to complete
213
for future in futures:
214
future.result()
215
216
print("All downloads completed")
217
```
218
219
## Performance Considerations
220
221
### When to Use ProcessPool vs TransferManager
222
223
**Use ProcessPoolDownloader when:**
224
- Downloading many files concurrently
225
- Downloading very large files requiring maximum throughput
226
- CPU resources are available (multi-core systems)
227
- Download-only operations (no uploads or copies needed)
228
- Python GIL is a bottleneck in your application
229
230
**Use TransferManager when:**
231
- Mixed operations (uploads, downloads, copies) are needed
232
- Lower memory overhead is important
233
- Simpler threading model is preferred
234
- Working with smaller files or fewer concurrent operations
235
236
### Memory and Resource Usage
237
238
- ProcessPool uses more memory due to separate process overhead
239
- Each process maintains its own S3 client and connection pool
240
- Better CPU utilization on multi-core systems
241
- Higher throughput for I/O intensive workloads
242
- Process isolation provides better fault tolerance
243
244
### Configuration Tuning
245
246
- **multipart_threshold**: Lower values use more processes, higher throughput but more overhead
247
- **multipart_chunksize**: Smaller chunks provide better parallelism, larger chunks reduce overhead
248
- **max_request_processes**: Should typically match or slightly exceed CPU core count