or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.md

index.mddocs/

0

# Pandarallel

1

2

An easy-to-use library that parallelizes pandas operations across all available CPUs with minimal code changes. Pandarallel transforms standard pandas methods into parallelized versions by simply changing method calls from `df.apply()` to `df.parallel_apply()`, providing automatic progress bars and seamless integration into existing pandas workflows.

3

4

## Package Information

5

6

- **Package Name**: pandarallel

7

- **Language**: Python

8

- **Installation**: `pip install pandarallel`

9

10

## Core Imports

11

12

```python

13

from pandarallel import pandarallel

14

```

15

16

## Basic Usage

17

18

```python

19

from pandarallel import pandarallel

20

import pandas as pd

21

import math

22

23

# Initialize pandarallel to enable parallel processing

24

pandarallel.initialize(progress_bar=True)

25

26

# Create sample DataFrame

27

df = pd.DataFrame({

28

'a': [1, 2, 3, 4, 5],

29

'b': [0.1, 0.2, 0.3, 0.4, 0.5]

30

})

31

32

# Define a function to apply

33

def compute_function(row):

34

return math.sin(row.a**2) + math.sin(row.b**2)

35

36

# Use parallel version instead of regular apply

37

result = df.parallel_apply(compute_function, axis=1)

38

39

# Works with Series too

40

series_result = df.a.parallel_apply(lambda x: math.sqrt(x**2))

41

42

# And with groupby operations

43

grouped_result = df.groupby('a').parallel_apply(lambda group: group.b.sum())

44

```

45

46

## Capabilities

47

48

### Initialization

49

50

Configure pandarallel to enable parallel processing and add parallel methods to pandas objects.

51

52

```python { .api }

53

from typing import Optional

54

55

@classmethod

56

def initialize(

57

cls,

58

shm_size_mb=None,

59

nb_workers=None,

60

progress_bar=False,

61

verbose=2,

62

use_memory_fs: Optional[bool] = None

63

) -> None:

64

"""

65

Initialize pandarallel and add parallel methods to pandas objects.

66

67

Args:

68

shm_size_mb (int, optional): Shared memory size in MB (deprecated parameter)

69

nb_workers (int, optional): Number of worker processes. Defaults to number of physical CPU cores (detected automatically)

70

progress_bar (bool): Enable progress bars during parallel operations. Default: False

71

verbose (int): Verbosity level (0=silent, 1=warnings, 2=info). Default: 2

72

use_memory_fs (bool, optional): Use memory file system for data transfer. Auto-detected if None

73

74

Returns:

75

None

76

"""

77

```

78

79

### DataFrame Parallel Methods

80

81

Parallelized versions of DataFrame operations that maintain the same API as their pandas counterparts.

82

83

```python { .api }

84

def parallel_apply(self, func, axis=0, raw=False, result_type=None, args=(), **kwargs):

85

"""

86

Parallel version of DataFrame.apply().

87

88

Args:

89

func (function): Function to apply to each column or row

90

axis (int or str): Apply function along axis (0/'index' for rows, 1/'columns' for columns)

91

raw (bool): Pass raw ndarray instead of Series to function

92

result_type (str): Control return type ('expand', 'reduce', 'broadcast')

93

args (tuple): Positional arguments to pass to func

94

**kwargs: Additional keyword arguments to pass to func

95

96

Returns:

97

Series or DataFrame: Result of applying func

98

"""

99

100

def parallel_applymap(self, func, na_action=None, **kwargs):

101

"""

102

Parallel version of DataFrame.applymap().

103

104

Args:

105

func (function): Function to apply to each element

106

na_action (str): Action to take for NaN values ('ignore' or None)

107

**kwargs: Additional keyword arguments to pass to func

108

109

Returns:

110

DataFrame: Result of applying func to each element

111

"""

112

```

113

114

### Series Parallel Methods

115

116

Parallelized versions of Series operations.

117

118

```python { .api }

119

def parallel_apply(self, func, convert_dtype=True, args=(), *, by_row='compat', **kwargs):

120

"""

121

Parallel version of Series.apply().

122

123

Args:

124

func (function): Function to apply to each element

125

convert_dtype (bool): Try to infer better dtype for elementwise function results

126

args (tuple): Positional arguments to pass to func

127

by_row (str): Apply function row-wise ('compat' for compatibility mode)

128

**kwargs: Additional keyword arguments to pass to func

129

130

Returns:

131

Series or DataFrame: Result of applying func

132

"""

133

134

def parallel_map(self, arg, na_action=None, *args, **kwargs):

135

"""

136

Parallel version of Series.map().

137

138

Args:

139

arg (function, dict, or Series): Mapping function or correspondence

140

na_action (str): Action to take for NaN values ('ignore' or None)

141

*args: Additional positional arguments to pass to mapping function

142

**kwargs: Additional keyword arguments to pass to mapping function

143

144

Returns:

145

Series: Result of mapping values

146

"""

147

```

148

149

### GroupBy Parallel Methods

150

151

Parallelized versions of GroupBy operations.

152

153

```python { .api }

154

def parallel_apply(self, func, *args, **kwargs):

155

"""

156

Parallel version of GroupBy.apply() for DataFrameGroupBy.

157

158

Args:

159

func (function): Function to apply to each group

160

*args: Positional arguments to pass to func

161

**kwargs: Keyword arguments to pass to func

162

163

Returns:

164

Series or DataFrame: Result of applying func to each group

165

"""

166

```

167

168

### Rolling Window Parallel Methods

169

170

Parallelized versions of rolling window operations.

171

172

```python { .api }

173

def parallel_apply(self, func, raw=False, engine=None, engine_kwargs=None, args=(), **kwargs):

174

"""

175

Parallel version of Rolling.apply().

176

177

Args:

178

func (function): Function to apply to each rolling window

179

raw (bool): Pass raw ndarray instead of Series to function

180

engine (str): Execution engine ('cython' or 'numba')

181

engine_kwargs (dict): Engine-specific kwargs

182

args (tuple): Positional arguments to pass to func

183

**kwargs: Additional keyword arguments to pass to func

184

185

Returns:

186

Series or DataFrame: Result of applying func to rolling windows

187

"""

188

```

189

190

191

### Rolling GroupBy Parallel Methods

192

193

Parallelized versions of rolling operations on grouped data.

194

195

```python { .api }

196

def parallel_apply(self, func, raw=False, engine=None, engine_kwargs=None, args=(), **kwargs):

197

"""

198

Parallel version of RollingGroupby.apply().

199

200

Args:

201

func (function): Function to apply to each rolling group window

202

raw (bool): Pass raw ndarray instead of Series to function

203

engine (str): Execution engine ('cython' or 'numba')

204

engine_kwargs (dict): Engine-specific kwargs

205

args (tuple): Positional arguments to pass to func

206

**kwargs: Additional keyword arguments to pass to func

207

208

Returns:

209

Series or DataFrame: Result of applying func to rolling group windows

210

"""

211

```

212

213

### Expanding GroupBy Parallel Methods

214

215

Parallelized versions of expanding operations on grouped data.

216

217

```python { .api }

218

def parallel_apply(self, func, raw=False, engine=None, engine_kwargs=None, args=(), **kwargs):

219

"""

220

Parallel version of ExpandingGroupby.apply().

221

222

Args:

223

func (function): Function to apply to each expanding group window

224

raw (bool): Pass raw ndarray instead of Series to function

225

engine (str): Execution engine ('cython' or 'numba')

226

engine_kwargs (dict): Engine-specific kwargs

227

args (tuple): Positional arguments to pass to func

228

**kwargs: Additional keyword arguments to pass to func

229

230

Returns:

231

Series or DataFrame: Result of applying func to expanding group windows

232

"""

233

```

234

235

## Usage Examples

236

237

### DataFrame Operations

238

239

```python

240

import pandas as pd

241

import numpy as np

242

import math

243

from pandarallel import pandarallel

244

245

# Initialize with progress bars

246

pandarallel.initialize(progress_bar=True, nb_workers=4)

247

248

# Create sample data

249

df = pd.DataFrame({

250

'a': np.random.randint(1, 8, 1000000),

251

'b': np.random.rand(1000000)

252

})

253

254

# Parallel apply on rows (axis=1)

255

def row_function(row):

256

return math.sin(row.a**2) + math.sin(row.b**2)

257

258

result = df.parallel_apply(row_function, axis=1)

259

260

# Parallel applymap on each element

261

def element_function(x):

262

return math.sin(x**2) - math.cos(x**2)

263

264

result = df.parallel_applymap(element_function)

265

```

266

267

### Series Operations

268

269

```python

270

# Parallel apply on Series

271

series = pd.Series(np.random.rand(1000000) + 1)

272

273

def series_function(x, power=2, bias=0):

274

return math.log10(math.sqrt(math.exp(x**power))) + bias

275

276

result = series.parallel_apply(series_function, args=(2,), bias=3)

277

278

# Parallel map with dictionary

279

mapping = {i: i**2 for i in range(1, 100)}

280

result = series.parallel_map(mapping)

281

```

282

283

### GroupBy Operations

284

285

```python

286

# Create grouped data

287

df_grouped = pd.DataFrame({

288

'group': np.random.randint(1, 100, 1000000),

289

'value': np.random.rand(1000000)

290

})

291

292

def group_function(group_df):

293

total = 0

294

for item in group_df.value:

295

total += math.log10(math.sqrt(math.exp(item**2)))

296

return total / len(group_df.value)

297

298

result = df_grouped.groupby('group').parallel_apply(group_function)

299

```

300

301

### Rolling Window Operations

302

303

```python

304

# Rolling window with parallel apply

305

df_rolling = pd.DataFrame({

306

'values': range(100000)

307

})

308

309

def rolling_function(window):

310

return window.iloc[0] + window.iloc[1]**2 + window.iloc[2]**3

311

312

result = df_rolling.values.rolling(4).parallel_apply(rolling_function, raw=False)

313

```

314

315

## Configuration Options

316

317

### Worker Count

318

319

```python

320

# Use specific number of workers

321

pandarallel.initialize(nb_workers=8)

322

323

# Use all available CPU cores (default)

324

pandarallel.initialize()

325

```

326

327

### Progress Bars

328

329

```python

330

# Enable progress bars

331

pandarallel.initialize(progress_bar=True)

332

333

# Disable progress bars (default)

334

pandarallel.initialize(progress_bar=False)

335

```

336

337

### Memory File System

338

339

```python

340

# Force use of memory file system (faster for large data)

341

pandarallel.initialize(use_memory_fs=True)

342

343

# Force use of pipes (more compatible)

344

pandarallel.initialize(use_memory_fs=False)

345

346

# Auto-detect (default) - uses memory fs if /dev/shm is available

347

pandarallel.initialize()

348

```

349

350

### Verbosity Control

351

352

```python

353

# Silent mode

354

pandarallel.initialize(verbose=0)

355

356

# Show warnings only

357

pandarallel.initialize(verbose=1)

358

359

# Show info messages (default)

360

pandarallel.initialize(verbose=2)

361

```

362

363

## Error Handling

364

365

All parallel methods maintain the same error handling behavior as their pandas counterparts. If an exception occurs in any worker process, the entire operation will fail and raise the exception.

366

367

Common considerations:

368

- Ensure functions passed to parallel methods are serializable (avoid closures with local variables)

369

- Functions should not rely on global state that might not be available in worker processes

370

- On Windows, the multiprocessing context uses 'spawn', which requires functions to be importable

371

372

## Performance Considerations

373

374

- Parallel processing adds overhead - best for computationally intensive operations

375

- Memory file system (`use_memory_fs=True`) provides better performance for large datasets

376

- Progress bars add slight overhead but provide useful feedback for long-running operations

377

- Worker count should typically match the number of physical CPU cores for optimal performance