Tessl Tile for pypi/pandarallel@1.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-pandarallel

A library to parallelize pandas operations on all available CPUs with minimal code changes

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/pandarallel@1.6.x

To install, run

npx @tessl/cli install tessl/pypi-pandarallel@1.6.0

0
# Pandarallel
1

2
An easy-to-use library that parallelizes pandas operations across all available CPUs with minimal code changes. Pandarallel transforms standard pandas methods into parallelized versions by simply changing method calls from `df.apply()` to `df.parallel_apply()`, providing automatic progress bars and seamless integration into existing pandas workflows.
3

4
## Package Information
5

6
- **Package Name**: pandarallel
7
- **Language**: Python
8
- **Installation**: `pip install pandarallel`
9

10
## Core Imports
11

12
```python
13
from pandarallel import pandarallel
14
```
15

16
## Basic Usage
17

18
```python
19
from pandarallel import pandarallel
20
import pandas as pd
21
import math
22

23
# Initialize pandarallel to enable parallel processing
24
pandarallel.initialize(progress_bar=True)
25

26
# Create sample DataFrame
27
df = pd.DataFrame({
28
    'a': [1, 2, 3, 4, 5],
29
    'b': [0.1, 0.2, 0.3, 0.4, 0.5]
30
})
31

32
# Define a function to apply
33
def compute_function(row):
34
    return math.sin(row.a**2) + math.sin(row.b**2)
35

36
# Use parallel version instead of regular apply
37
result = df.parallel_apply(compute_function, axis=1)
38

39
# Works with Series too
40
series_result = df.a.parallel_apply(lambda x: math.sqrt(x**2))
41

42
# And with groupby operations
43
grouped_result = df.groupby('a').parallel_apply(lambda group: group.b.sum())
44
```
45

46
## Capabilities
47

48
### Initialization
49

50
Configure pandarallel to enable parallel processing and add parallel methods to pandas objects.
51

52
```python { .api }
53
from typing import Optional
54

55
@classmethod
56
def initialize(
57
    cls,
58
    shm_size_mb=None,
59
    nb_workers=None,
60
    progress_bar=False,
61
    verbose=2,
62
    use_memory_fs: Optional[bool] = None
63
) -> None:
64
    """
65
    Initialize pandarallel and add parallel methods to pandas objects.
66
    
67
    Args:
68
        shm_size_mb (int, optional): Shared memory size in MB (deprecated parameter)
69
        nb_workers (int, optional): Number of worker processes. Defaults to number of physical CPU cores (detected automatically)
70
        progress_bar (bool): Enable progress bars during parallel operations. Default: False
71
        verbose (int): Verbosity level (0=silent, 1=warnings, 2=info). Default: 2
72
        use_memory_fs (bool, optional): Use memory file system for data transfer. Auto-detected if None
73
    
74
    Returns:
75
        None
76
    """
77
```
78

79
### DataFrame Parallel Methods
80

81
Parallelized versions of DataFrame operations that maintain the same API as their pandas counterparts.
82

83
```python { .api }
84
def parallel_apply(self, func, axis=0, raw=False, result_type=None, args=(), **kwargs):
85
    """
86
    Parallel version of DataFrame.apply().
87
    
88
    Args:
89
        func (function): Function to apply to each column or row
90
        axis (int or str): Apply function along axis (0/'index' for rows, 1/'columns' for columns)
91
        raw (bool): Pass raw ndarray instead of Series to function
92
        result_type (str): Control return type ('expand', 'reduce', 'broadcast')
93
        args (tuple): Positional arguments to pass to func
94
        **kwargs: Additional keyword arguments to pass to func
95
    
96
    Returns:
97
        Series or DataFrame: Result of applying func
98
    """
99

100
def parallel_applymap(self, func, na_action=None, **kwargs):
101
    """
102
    Parallel version of DataFrame.applymap().
103
    
104
    Args:
105
        func (function): Function to apply to each element
106
        na_action (str): Action to take for NaN values ('ignore' or None)
107
        **kwargs: Additional keyword arguments to pass to func
108
    
109
    Returns:
110
        DataFrame: Result of applying func to each element
111
    """
112
```
113

114
### Series Parallel Methods
115

116
Parallelized versions of Series operations.
117

118
```python { .api }
119
def parallel_apply(self, func, convert_dtype=True, args=(), *, by_row='compat', **kwargs):
120
    """
121
    Parallel version of Series.apply().
122
    
123
    Args:
124
        func (function): Function to apply to each element
125
        convert_dtype (bool): Try to infer better dtype for elementwise function results
126
        args (tuple): Positional arguments to pass to func
127
        by_row (str): Apply function row-wise ('compat' for compatibility mode)
128
        **kwargs: Additional keyword arguments to pass to func
129
    
130
    Returns:
131
        Series or DataFrame: Result of applying func
132
    """
133

134
def parallel_map(self, arg, na_action=None, *args, **kwargs):
135
    """
136
    Parallel version of Series.map().
137
    
138
    Args:
139
        arg (function, dict, or Series): Mapping function or correspondence
140
        na_action (str): Action to take for NaN values ('ignore' or None)
141
        *args: Additional positional arguments to pass to mapping function
142
        **kwargs: Additional keyword arguments to pass to mapping function
143
    
144
    Returns:
145
        Series: Result of mapping values
146
    """
147
```
148

149
### GroupBy Parallel Methods
150

151
Parallelized versions of GroupBy operations.
152

153
```python { .api }
154
def parallel_apply(self, func, *args, **kwargs):
155
    """
156
    Parallel version of GroupBy.apply() for DataFrameGroupBy.
157
    
158
    Args:
159
        func (function): Function to apply to each group
160
        *args: Positional arguments to pass to func
161
        **kwargs: Keyword arguments to pass to func
162
    
163
    Returns:
164
        Series or DataFrame: Result of applying func to each group
165
    """
166
```
167

168
### Rolling Window Parallel Methods
169

170
Parallelized versions of rolling window operations.
171

172
```python { .api }
173
def parallel_apply(self, func, raw=False, engine=None, engine_kwargs=None, args=(), **kwargs):
174
    """
175
    Parallel version of Rolling.apply().
176
    
177
    Args:
178
        func (function): Function to apply to each rolling window
179
        raw (bool): Pass raw ndarray instead of Series to function
180
        engine (str): Execution engine ('cython' or 'numba')
181
        engine_kwargs (dict): Engine-specific kwargs
182
        args (tuple): Positional arguments to pass to func
183
        **kwargs: Additional keyword arguments to pass to func
184
    
185
    Returns:
186
        Series or DataFrame: Result of applying func to rolling windows
187
    """
188
```
189

190

191
### Rolling GroupBy Parallel Methods
192

193
Parallelized versions of rolling operations on grouped data.
194

195
```python { .api }
196
def parallel_apply(self, func, raw=False, engine=None, engine_kwargs=None, args=(), **kwargs):
197
    """
198
    Parallel version of RollingGroupby.apply().
199
    
200
    Args:
201
        func (function): Function to apply to each rolling group window
202
        raw (bool): Pass raw ndarray instead of Series to function  
203
        engine (str): Execution engine ('cython' or 'numba')
204
        engine_kwargs (dict): Engine-specific kwargs
205
        args (tuple): Positional arguments to pass to func
206
        **kwargs: Additional keyword arguments to pass to func
207
    
208
    Returns:
209
        Series or DataFrame: Result of applying func to rolling group windows
210
    """
211
```
212

213
### Expanding GroupBy Parallel Methods
214

215
Parallelized versions of expanding operations on grouped data.
216

217
```python { .api }
218
def parallel_apply(self, func, raw=False, engine=None, engine_kwargs=None, args=(), **kwargs):
219
    """
220
    Parallel version of ExpandingGroupby.apply().
221
    
222
    Args:
223
        func (function): Function to apply to each expanding group window
224
        raw (bool): Pass raw ndarray instead of Series to function
225
        engine (str): Execution engine ('cython' or 'numba') 
226
        engine_kwargs (dict): Engine-specific kwargs
227
        args (tuple): Positional arguments to pass to func
228
        **kwargs: Additional keyword arguments to pass to func
229
    
230
    Returns:
231
        Series or DataFrame: Result of applying func to expanding group windows
232
    """
233
```
234

235
## Usage Examples
236

237
### DataFrame Operations
238

239
```python
240
import pandas as pd
241
import numpy as np
242
import math
243
from pandarallel import pandarallel
244

245
# Initialize with progress bars
246
pandarallel.initialize(progress_bar=True, nb_workers=4)
247

248
# Create sample data
249
df = pd.DataFrame({
250
    'a': np.random.randint(1, 8, 1000000),
251
    'b': np.random.rand(1000000)
252
})
253

254
# Parallel apply on rows (axis=1)
255
def row_function(row):
256
    return math.sin(row.a**2) + math.sin(row.b**2)
257

258
result = df.parallel_apply(row_function, axis=1)
259

260
# Parallel applymap on each element
261
def element_function(x):
262
    return math.sin(x**2) - math.cos(x**2)
263

264
result = df.parallel_applymap(element_function)
265
```
266

267
### Series Operations
268

269
```python
270
# Parallel apply on Series
271
series = pd.Series(np.random.rand(1000000) + 1)
272

273
def series_function(x, power=2, bias=0):
274
    return math.log10(math.sqrt(math.exp(x**power))) + bias
275

276
result = series.parallel_apply(series_function, args=(2,), bias=3)
277

278
# Parallel map with dictionary
279
mapping = {i: i**2 for i in range(1, 100)}
280
result = series.parallel_map(mapping)
281
```
282

283
### GroupBy Operations
284

285
```python
286
# Create grouped data
287
df_grouped = pd.DataFrame({
288
    'group': np.random.randint(1, 100, 1000000),
289
    'value': np.random.rand(1000000)
290
})
291

292
def group_function(group_df):
293
    total = 0
294
    for item in group_df.value:
295
        total += math.log10(math.sqrt(math.exp(item**2)))
296
    return total / len(group_df.value)
297

298
result = df_grouped.groupby('group').parallel_apply(group_function)
299
```
300

301
### Rolling Window Operations
302

303
```python
304
# Rolling window with parallel apply
305
df_rolling = pd.DataFrame({
306
    'values': range(100000)
307
})
308

309
def rolling_function(window):
310
    return window.iloc[0] + window.iloc[1]**2 + window.iloc[2]**3
311

312
result = df_rolling.values.rolling(4).parallel_apply(rolling_function, raw=False)
313
```
314

315
## Configuration Options
316

317
### Worker Count
318

319
```python
320
# Use specific number of workers
321
pandarallel.initialize(nb_workers=8)
322

323
# Use all available CPU cores (default)
324
pandarallel.initialize()
325
```
326

327
### Progress Bars
328

329
```python
330
# Enable progress bars
331
pandarallel.initialize(progress_bar=True)
332

333
# Disable progress bars (default)
334
pandarallel.initialize(progress_bar=False)
335
```
336

337
### Memory File System
338

339
```python
340
# Force use of memory file system (faster for large data)
341
pandarallel.initialize(use_memory_fs=True)
342

343
# Force use of pipes (more compatible)
344
pandarallel.initialize(use_memory_fs=False)
345

346
# Auto-detect (default) - uses memory fs if /dev/shm is available
347
pandarallel.initialize()
348
```
349

350
### Verbosity Control
351

352
```python
353
# Silent mode
354
pandarallel.initialize(verbose=0)
355

356
# Show warnings only
357
pandarallel.initialize(verbose=1)
358

359
# Show info messages (default)
360
pandarallel.initialize(verbose=2)
361
```
362

363
## Error Handling
364

365
All parallel methods maintain the same error handling behavior as their pandas counterparts. If an exception occurs in any worker process, the entire operation will fail and raise the exception.
366

367
Common considerations:
368
- Ensure functions passed to parallel methods are serializable (avoid closures with local variables)
369
- Functions should not rely on global state that might not be available in worker processes
370
- On Windows, the multiprocessing context uses 'spawn', which requires functions to be importable
371

372
## Performance Considerations
373

374
- Parallel processing adds overhead - best for computationally intensive operations
375
- Memory file system (`use_memory_fs=True`) provides better performance for large datasets
376
- Progress bars add slight overhead but provide useful feedback for long-running operations
377
- Worker count should typically match the number of physical CPU cores for optimal performance