or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

bandwidth-management.mdconfiguration.mdcrt-support.mdexception-handling.mdfile-utilities.mdfutures-coordination.mdindex.mdlegacy-transfer.mdprocess-pool-downloads.mdsubscribers-callbacks.mdtransfer-manager.md

process-pool-downloads.mddocs/

0

# Process Pool Downloads

1

2

High-performance multiprocessing-based download functionality that bypasses Python's Global Interpreter Lock (GIL) limitations for improved throughput on multi-core systems. This module provides an alternative to the thread-based TransferManager for download-only scenarios requiring maximum performance.

3

4

## Capabilities

5

6

### ProcessPoolDownloader

7

8

The main downloader class that uses multiple processes for concurrent S3 downloads, providing true parallelism and better CPU utilization compared to thread-based approaches.

9

10

```python { .api }

11

class ProcessPoolDownloader:

12

"""

13

Multiprocessing-based S3 downloader for high-performance downloads.

14

15

Args:

16

client_kwargs (dict, optional): Arguments for creating S3 clients in each process

17

config (ProcessTransferConfig, optional): Configuration for download behavior

18

"""

19

def __init__(self, client_kwargs=None, config=None): ...

20

21

def download_file(self, bucket, key, filename, extra_args=None, expected_size=None):

22

"""

23

Download an S3 object to a local file using multiple processes.

24

25

Args:

26

bucket (str): S3 bucket name

27

key (str): S3 object key/name

28

filename (str): Local file path to download to

29

extra_args (dict, optional): Additional S3 operation arguments

30

expected_size (int, optional): Expected size of the object (avoids HEAD request)

31

32

Returns:

33

ProcessPoolTransferFuture: Future object for tracking download progress

34

"""

35

36

def shutdown(self):

37

"""

38

Shutdown the downloader and wait for all downloads to complete.

39

"""

40

41

def __enter__(self):

42

"""Context manager entry."""

43

44

def __exit__(self, exc_type, exc_val, exc_tb):

45

"""Context manager exit with automatic shutdown."""

46

```

47

48

### ProcessTransferConfig

49

50

Configuration class for controlling ProcessPool downloader behavior including multipart thresholds and process concurrency.

51

52

```python { .api }

53

class ProcessTransferConfig:

54

"""

55

Configuration for ProcessPoolDownloader with multiprocessing-specific options.

56

57

Args:

58

multipart_threshold (int): Size threshold for ranged downloads (default: 8MB)

59

multipart_chunksize (int): Size of each download chunk (default: 8MB)

60

max_request_processes (int): Maximum number of download processes (default: 10)

61

"""

62

def __init__(

63

self,

64

multipart_threshold=8 * 1024 * 1024,

65

multipart_chunksize=8 * 1024 * 1024,

66

max_request_processes=10

67

): ...

68

69

multipart_threshold: int

70

multipart_chunksize: int

71

max_request_processes: int

72

```

73

74

### ProcessPoolTransferFuture

75

76

Future object representing a ProcessPool download operation with methods for monitoring progress and retrieving results.

77

78

```python { .api }

79

class ProcessPoolTransferFuture:

80

"""

81

Future representing a ProcessPool download request.

82

"""

83

def done(self) -> bool:

84

"""

85

Check if the download is complete.

86

87

Returns:

88

bool: True if download is complete (success or failure), False otherwise

89

"""

90

91

def result(self):

92

"""

93

Get the download result, blocking until complete.

94

95

Returns:

96

None: Returns None on successful completion

97

98

Raises:

99

Exception: Any exception that occurred during download

100

"""

101

102

def cancel(self):

103

"""

104

Cancel the download if possible.

105

106

Returns:

107

bool: True if cancellation was successful, False otherwise

108

"""

109

110

@property

111

def meta(self) -> 'ProcessPoolTransferMeta':

112

"""

113

Transfer metadata object containing call arguments and status information.

114

115

Returns:

116

ProcessPoolTransferMeta: Metadata object for this download

117

"""

118

```

119

120

### ProcessPoolTransferMeta

121

122

Metadata container providing information about a ProcessPool download including call arguments and transfer ID.

123

124

```python { .api }

125

class ProcessPoolTransferMeta:

126

"""

127

Metadata about a ProcessPoolTransferFuture containing call arguments and transfer information.

128

"""

129

@property

130

def call_args(self):

131

"""

132

The original call arguments used for the download.

133

134

Returns:

135

CallArgs: Object containing method arguments (bucket, key, filename, etc.)

136

"""

137

138

@property

139

def transfer_id(self):

140

"""

141

Unique identifier for this transfer.

142

143

Returns:

144

str: Transfer ID string

145

"""

146

```

147

148

## Usage Examples

149

150

### Basic ProcessPool Download

151

152

```python

153

import boto3

154

from s3transfer.processpool import ProcessPoolDownloader, ProcessTransferConfig

155

156

# Create downloader with custom configuration

157

config = ProcessTransferConfig(

158

multipart_threshold=16 * 1024 * 1024, # 16MB

159

multipart_chunksize=8 * 1024 * 1024, # 8MB chunks

160

max_request_processes=15 # 15 concurrent processes

161

)

162

163

downloader = ProcessPoolDownloader(

164

client_kwargs={'region_name': 'us-west-2'},

165

config=config

166

)

167

168

try:

169

# Download a file

170

future = downloader.download_file('my-bucket', 'large-file.zip', '/tmp/downloaded-file.zip')

171

172

# Wait for completion

173

future.result() # Blocks until complete

174

print("Download completed successfully")

175

176

finally:

177

downloader.shutdown()

178

```

179

180

### Context Manager Usage

181

182

```python

183

from s3transfer.processpool import ProcessPoolDownloader

184

185

# Using context manager for automatic cleanup

186

with ProcessPoolDownloader() as downloader:

187

future = downloader.download_file('my-bucket', 'data.csv', '/tmp/data.csv')

188

future.result()

189

print("Download completed")

190

# Downloader automatically shut down

191

```

192

193

### Multiple Concurrent Downloads

194

195

```python

196

from s3transfer.processpool import ProcessPoolDownloader

197

198

files_to_download = [

199

('my-bucket', 'file1.txt', '/tmp/file1.txt'),

200

('my-bucket', 'file2.txt', '/tmp/file2.txt'),

201

('my-bucket', 'file3.txt', '/tmp/file3.txt'),

202

]

203

204

with ProcessPoolDownloader() as downloader:

205

futures = []

206

207

# Start all downloads

208

for bucket, key, filename in files_to_download:

209

future = downloader.download_file(bucket, key, filename)

210

futures.append(future)

211

212

# Wait for all to complete

213

for future in futures:

214

future.result()

215

216

print("All downloads completed")

217

```

218

219

## Performance Considerations

220

221

### When to Use ProcessPool vs TransferManager

222

223

**Use ProcessPoolDownloader when:**

224

- Downloading many files concurrently

225

- Downloading very large files requiring maximum throughput

226

- CPU resources are available (multi-core systems)

227

- Download-only operations (no uploads or copies needed)

228

- Python GIL is a bottleneck in your application

229

230

**Use TransferManager when:**

231

- Mixed operations (uploads, downloads, copies) are needed

232

- Lower memory overhead is important

233

- Simpler threading model is preferred

234

- Working with smaller files or fewer concurrent operations

235

236

### Memory and Resource Usage

237

238

- ProcessPool uses more memory due to separate process overhead

239

- Each process maintains its own S3 client and connection pool

240

- Better CPU utilization on multi-core systems

241

- Higher throughput for I/O intensive workloads

242

- Process isolation provides better fault tolerance

243

244

### Configuration Tuning

245

246

- **multipart_threshold**: Lower values use more processes, higher throughput but more overhead

247

- **multipart_chunksize**: Smaller chunks provide better parallelism, larger chunks reduce overhead

248

- **max_request_processes**: Should typically match or slightly exceed CPU core count