A library to parallelize pandas operations on all available CPUs with minimal code changes
npx @tessl/cli install tessl/pypi-pandarallel@1.6.00
# Pandarallel
1
2
An easy-to-use library that parallelizes pandas operations across all available CPUs with minimal code changes. Pandarallel transforms standard pandas methods into parallelized versions by simply changing method calls from `df.apply()` to `df.parallel_apply()`, providing automatic progress bars and seamless integration into existing pandas workflows.
3
4
## Package Information
5
6
- **Package Name**: pandarallel
7
- **Language**: Python
8
- **Installation**: `pip install pandarallel`
9
10
## Core Imports
11
12
```python
13
from pandarallel import pandarallel
14
```
15
16
## Basic Usage
17
18
```python
19
from pandarallel import pandarallel
20
import pandas as pd
21
import math
22
23
# Initialize pandarallel to enable parallel processing
24
pandarallel.initialize(progress_bar=True)
25
26
# Create sample DataFrame
27
df = pd.DataFrame({
28
'a': [1, 2, 3, 4, 5],
29
'b': [0.1, 0.2, 0.3, 0.4, 0.5]
30
})
31
32
# Define a function to apply
33
def compute_function(row):
34
return math.sin(row.a**2) + math.sin(row.b**2)
35
36
# Use parallel version instead of regular apply
37
result = df.parallel_apply(compute_function, axis=1)
38
39
# Works with Series too
40
series_result = df.a.parallel_apply(lambda x: math.sqrt(x**2))
41
42
# And with groupby operations
43
grouped_result = df.groupby('a').parallel_apply(lambda group: group.b.sum())
44
```
45
46
## Capabilities
47
48
### Initialization
49
50
Configure pandarallel to enable parallel processing and add parallel methods to pandas objects.
51
52
```python { .api }
53
from typing import Optional
54
55
@classmethod
56
def initialize(
57
cls,
58
shm_size_mb=None,
59
nb_workers=None,
60
progress_bar=False,
61
verbose=2,
62
use_memory_fs: Optional[bool] = None
63
) -> None:
64
"""
65
Initialize pandarallel and add parallel methods to pandas objects.
66
67
Args:
68
shm_size_mb (int, optional): Shared memory size in MB (deprecated parameter)
69
nb_workers (int, optional): Number of worker processes. Defaults to number of physical CPU cores (detected automatically)
70
progress_bar (bool): Enable progress bars during parallel operations. Default: False
71
verbose (int): Verbosity level (0=silent, 1=warnings, 2=info). Default: 2
72
use_memory_fs (bool, optional): Use memory file system for data transfer. Auto-detected if None
73
74
Returns:
75
None
76
"""
77
```
78
79
### DataFrame Parallel Methods
80
81
Parallelized versions of DataFrame operations that maintain the same API as their pandas counterparts.
82
83
```python { .api }
84
def parallel_apply(self, func, axis=0, raw=False, result_type=None, args=(), **kwargs):
85
"""
86
Parallel version of DataFrame.apply().
87
88
Args:
89
func (function): Function to apply to each column or row
90
axis (int or str): Apply function along axis (0/'index' for rows, 1/'columns' for columns)
91
raw (bool): Pass raw ndarray instead of Series to function
92
result_type (str): Control return type ('expand', 'reduce', 'broadcast')
93
args (tuple): Positional arguments to pass to func
94
**kwargs: Additional keyword arguments to pass to func
95
96
Returns:
97
Series or DataFrame: Result of applying func
98
"""
99
100
def parallel_applymap(self, func, na_action=None, **kwargs):
101
"""
102
Parallel version of DataFrame.applymap().
103
104
Args:
105
func (function): Function to apply to each element
106
na_action (str): Action to take for NaN values ('ignore' or None)
107
**kwargs: Additional keyword arguments to pass to func
108
109
Returns:
110
DataFrame: Result of applying func to each element
111
"""
112
```
113
114
### Series Parallel Methods
115
116
Parallelized versions of Series operations.
117
118
```python { .api }
119
def parallel_apply(self, func, convert_dtype=True, args=(), *, by_row='compat', **kwargs):
120
"""
121
Parallel version of Series.apply().
122
123
Args:
124
func (function): Function to apply to each element
125
convert_dtype (bool): Try to infer better dtype for elementwise function results
126
args (tuple): Positional arguments to pass to func
127
by_row (str): Apply function row-wise ('compat' for compatibility mode)
128
**kwargs: Additional keyword arguments to pass to func
129
130
Returns:
131
Series or DataFrame: Result of applying func
132
"""
133
134
def parallel_map(self, arg, na_action=None, *args, **kwargs):
135
"""
136
Parallel version of Series.map().
137
138
Args:
139
arg (function, dict, or Series): Mapping function or correspondence
140
na_action (str): Action to take for NaN values ('ignore' or None)
141
*args: Additional positional arguments to pass to mapping function
142
**kwargs: Additional keyword arguments to pass to mapping function
143
144
Returns:
145
Series: Result of mapping values
146
"""
147
```
148
149
### GroupBy Parallel Methods
150
151
Parallelized versions of GroupBy operations.
152
153
```python { .api }
154
def parallel_apply(self, func, *args, **kwargs):
155
"""
156
Parallel version of GroupBy.apply() for DataFrameGroupBy.
157
158
Args:
159
func (function): Function to apply to each group
160
*args: Positional arguments to pass to func
161
**kwargs: Keyword arguments to pass to func
162
163
Returns:
164
Series or DataFrame: Result of applying func to each group
165
"""
166
```
167
168
### Rolling Window Parallel Methods
169
170
Parallelized versions of rolling window operations.
171
172
```python { .api }
173
def parallel_apply(self, func, raw=False, engine=None, engine_kwargs=None, args=(), **kwargs):
174
"""
175
Parallel version of Rolling.apply().
176
177
Args:
178
func (function): Function to apply to each rolling window
179
raw (bool): Pass raw ndarray instead of Series to function
180
engine (str): Execution engine ('cython' or 'numba')
181
engine_kwargs (dict): Engine-specific kwargs
182
args (tuple): Positional arguments to pass to func
183
**kwargs: Additional keyword arguments to pass to func
184
185
Returns:
186
Series or DataFrame: Result of applying func to rolling windows
187
"""
188
```
189
190
191
### Rolling GroupBy Parallel Methods
192
193
Parallelized versions of rolling operations on grouped data.
194
195
```python { .api }
196
def parallel_apply(self, func, raw=False, engine=None, engine_kwargs=None, args=(), **kwargs):
197
"""
198
Parallel version of RollingGroupby.apply().
199
200
Args:
201
func (function): Function to apply to each rolling group window
202
raw (bool): Pass raw ndarray instead of Series to function
203
engine (str): Execution engine ('cython' or 'numba')
204
engine_kwargs (dict): Engine-specific kwargs
205
args (tuple): Positional arguments to pass to func
206
**kwargs: Additional keyword arguments to pass to func
207
208
Returns:
209
Series or DataFrame: Result of applying func to rolling group windows
210
"""
211
```
212
213
### Expanding GroupBy Parallel Methods
214
215
Parallelized versions of expanding operations on grouped data.
216
217
```python { .api }
218
def parallel_apply(self, func, raw=False, engine=None, engine_kwargs=None, args=(), **kwargs):
219
"""
220
Parallel version of ExpandingGroupby.apply().
221
222
Args:
223
func (function): Function to apply to each expanding group window
224
raw (bool): Pass raw ndarray instead of Series to function
225
engine (str): Execution engine ('cython' or 'numba')
226
engine_kwargs (dict): Engine-specific kwargs
227
args (tuple): Positional arguments to pass to func
228
**kwargs: Additional keyword arguments to pass to func
229
230
Returns:
231
Series or DataFrame: Result of applying func to expanding group windows
232
"""
233
```
234
235
## Usage Examples
236
237
### DataFrame Operations
238
239
```python
240
import pandas as pd
241
import numpy as np
242
import math
243
from pandarallel import pandarallel
244
245
# Initialize with progress bars
246
pandarallel.initialize(progress_bar=True, nb_workers=4)
247
248
# Create sample data
249
df = pd.DataFrame({
250
'a': np.random.randint(1, 8, 1000000),
251
'b': np.random.rand(1000000)
252
})
253
254
# Parallel apply on rows (axis=1)
255
def row_function(row):
256
return math.sin(row.a**2) + math.sin(row.b**2)
257
258
result = df.parallel_apply(row_function, axis=1)
259
260
# Parallel applymap on each element
261
def element_function(x):
262
return math.sin(x**2) - math.cos(x**2)
263
264
result = df.parallel_applymap(element_function)
265
```
266
267
### Series Operations
268
269
```python
270
# Parallel apply on Series
271
series = pd.Series(np.random.rand(1000000) + 1)
272
273
def series_function(x, power=2, bias=0):
274
return math.log10(math.sqrt(math.exp(x**power))) + bias
275
276
result = series.parallel_apply(series_function, args=(2,), bias=3)
277
278
# Parallel map with dictionary
279
mapping = {i: i**2 for i in range(1, 100)}
280
result = series.parallel_map(mapping)
281
```
282
283
### GroupBy Operations
284
285
```python
286
# Create grouped data
287
df_grouped = pd.DataFrame({
288
'group': np.random.randint(1, 100, 1000000),
289
'value': np.random.rand(1000000)
290
})
291
292
def group_function(group_df):
293
total = 0
294
for item in group_df.value:
295
total += math.log10(math.sqrt(math.exp(item**2)))
296
return total / len(group_df.value)
297
298
result = df_grouped.groupby('group').parallel_apply(group_function)
299
```
300
301
### Rolling Window Operations
302
303
```python
304
# Rolling window with parallel apply
305
df_rolling = pd.DataFrame({
306
'values': range(100000)
307
})
308
309
def rolling_function(window):
310
return window.iloc[0] + window.iloc[1]**2 + window.iloc[2]**3
311
312
result = df_rolling.values.rolling(4).parallel_apply(rolling_function, raw=False)
313
```
314
315
## Configuration Options
316
317
### Worker Count
318
319
```python
320
# Use specific number of workers
321
pandarallel.initialize(nb_workers=8)
322
323
# Use all available CPU cores (default)
324
pandarallel.initialize()
325
```
326
327
### Progress Bars
328
329
```python
330
# Enable progress bars
331
pandarallel.initialize(progress_bar=True)
332
333
# Disable progress bars (default)
334
pandarallel.initialize(progress_bar=False)
335
```
336
337
### Memory File System
338
339
```python
340
# Force use of memory file system (faster for large data)
341
pandarallel.initialize(use_memory_fs=True)
342
343
# Force use of pipes (more compatible)
344
pandarallel.initialize(use_memory_fs=False)
345
346
# Auto-detect (default) - uses memory fs if /dev/shm is available
347
pandarallel.initialize()
348
```
349
350
### Verbosity Control
351
352
```python
353
# Silent mode
354
pandarallel.initialize(verbose=0)
355
356
# Show warnings only
357
pandarallel.initialize(verbose=1)
358
359
# Show info messages (default)
360
pandarallel.initialize(verbose=2)
361
```
362
363
## Error Handling
364
365
All parallel methods maintain the same error handling behavior as their pandas counterparts. If an exception occurs in any worker process, the entire operation will fail and raise the exception.
366
367
Common considerations:
368
- Ensure functions passed to parallel methods are serializable (avoid closures with local variables)
369
- Functions should not rely on global state that might not be available in worker processes
370
- On Windows, the multiprocessing context uses 'spawn', which requires functions to be importable
371
372
## Performance Considerations
373
374
- Parallel processing adds overhead - best for computationally intensive operations
375
- Memory file system (`use_memory_fs=True`) provides better performance for large datasets
376
- Progress bars add slight overhead but provide useful feedback for long-running operations
377
- Worker count should typically match the number of physical CPU cores for optimal performance