or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-data-structures.mddata-manipulation.mdindex.mdio-operations.mdpandas-compatibility.mdtesting-utilities.mdtype-checking.md

pandas-compatibility.mddocs/

0

# Pandas Compatibility

1

2

cuDF provides seamless pandas compatibility through `cudf.pandas`, which enables automatic GPU acceleration for existing pandas code. This system provides transparent fallback to CPU pandas for unsupported operations while leveraging GPU acceleration when beneficial.

3

4

## Import Statements

5

6

```python

7

# Pandas acceleration mode

8

import cudf.pandas

9

cudf.pandas.install() # Enable automatic acceleration

10

11

# Profiling utilities

12

from cudf.pandas import Profiler

13

14

# IPython integration

15

%load_ext cudf.pandas # In Jupyter/IPython

16

17

# Proxy utilities

18

from cudf.pandas import (

19

as_proxy_object, is_proxy_object, is_proxy_instance

20

)

21

```

22

23

## Acceleration Mode

24

25

Drop-in replacement system that automatically accelerates pandas operations with GPU when beneficial.

26

27

```{ .api }

28

def install() -> None:

29

"""

30

Enable cuDF pandas accelerator mode for automatic GPU acceleration

31

32

Installs cuDF as a pandas accelerator that intercepts pandas operations

33

and routes them to GPU when possible. Provides transparent fallback

34

to CPU pandas for unsupported operations.

35

36

After installation, existing pandas code automatically benefits from

37

GPU acceleration without modification. Operations that cannot be

38

accelerated fall back to pandas seamlessly.

39

40

Features:

41

- Automatic GPU acceleration for supported operations

42

- Transparent fallback to CPU pandas for unsupported operations

43

- Zero code changes required for existing pandas workflows

44

- Maintains pandas API compatibility and behavior

45

- Intelligent routing based on data size and operation type

46

47

Examples:

48

# Enable acceleration globally

49

import cudf.pandas

50

cudf.pandas.install()

51

52

# Now pandas operations automatically use GPU when beneficial

53

import pandas as pd

54

df = pd.DataFrame({'x': range(1000000), 'y': range(1000000)})

55

result = df.groupby('x').sum() # Automatically uses GPU

56

57

# Fallback for unsupported operations

58

result = df.some_unsupported_operation() # Uses CPU pandas

59

60

# Works with existing pandas code unchanged

61

df.to_csv('output.csv') # GPU-accelerated I/O when possible

62

"""

63

```

64

65

## Performance Profiling

66

67

Tools for analyzing pandas code to identify GPU acceleration opportunities.

68

69

```{ .api }

70

class Profiler:

71

"""

72

Performance profiler for pandas acceleration opportunities

73

74

Analyzes pandas operations to identify performance bottlenecks and

75

acceleration potential. Provides insights into which operations

76

benefit from GPU acceleration and performance improvements achieved.

77

78

Attributes:

79

results: dict containing profiling results and statistics

80

81

Methods:

82

start(): Begin profiling pandas operations

83

stop(): End profiling and collect results

84

print_stats(): Display profiling statistics

85

get_results(): Return detailed profiling data

86

87

Examples:

88

# Basic profiling workflow

89

import cudf.pandas

90

cudf.pandas.install()

91

92

profiler = cudf.pandas.Profiler()

93

profiler.start()

94

95

# Run pandas operations to profile

96

import pandas as pd

97

df = pd.DataFrame({'A': range(10000), 'B': range(10000)})

98

result1 = df.groupby('A').sum()

99

result2 = df.merge(df, on='A')

100

result3 = df.sort_values('B')

101

102

profiler.stop()

103

profiler.print_stats()

104

105

# Get detailed results

106

stats = profiler.get_results()

107

print(f"GPU accelerated operations: {stats['gpu_ops']}")

108

print(f"CPU fallback operations: {stats['cpu_ops']}")

109

print(f"Total speedup: {stats['speedup']:.2f}x")

110

"""

111

112

def start(self) -> None:

113

"""

114

Begin profiling pandas operations

115

116

Starts collecting performance metrics for pandas operations

117

including execution time, memory usage, and routing decisions.

118

"""

119

120

def stop(self) -> None:

121

"""

122

End profiling and collect results

123

124

Stops profiling and computes final statistics including

125

performance improvements and operation categorization.

126

"""

127

128

def print_stats(self) -> None:

129

"""

130

Display profiling statistics in readable format

131

132

Prints summary of profiled operations including:

133

- Total operations analyzed

134

- GPU vs CPU operation breakdown

135

- Performance improvements achieved

136

- Memory usage patterns

137

- Recommendations for optimization

138

"""

139

140

def get_results(self) -> dict:

141

"""

142

Return detailed profiling data as dictionary

143

144

Returns:

145

dict: Comprehensive profiling results containing:

146

- operation_times: Execution times for each operation

147

- routing_decisions: GPU vs CPU routing for operations

148

- memory_usage: Memory consumption patterns

149

- speedups: Performance improvements achieved

150

- recommendations: Optimization suggestions

151

"""

152

```

153

154

## IPython Integration

155

156

Magic commands and extensions for Jupyter notebook integration.

157

158

```{ .api }

159

def load_ipython_extension(ipython) -> None:

160

"""

161

Load cuDF pandas IPython extension for notebook integration

162

163

Provides magic commands and enhanced display formatting for

164

cuDF pandas operations in Jupyter notebooks and IPython.

165

166

Magic Commands Available:

167

%%cudf_pandas_profile: Profile cell operations for acceleration opportunities

168

%cudf_pandas_status: Show current acceleration status and statistics

169

%cudf_pandas_fallback: Display recent fallback operations and reasons

170

171

Parameters:

172

ipython: IPython.InteractiveShell

173

IPython shell instance to extend

174

175

Examples:

176

# In Jupyter notebook

177

%load_ext cudf.pandas

178

179

# Profile a cell's operations

180

%%cudf_pandas_profile

181

import pandas as pd

182

df = pd.DataFrame({'A': range(10000)})

183

result = df.groupby('A').count()

184

185

# Check acceleration status

186

%cudf_pandas_status

187

188

# See fallback operations

189

%cudf_pandas_fallback

190

"""

191

```

192

193

## Proxy Object System

194

195

Utilities for working with the proxy object system that enables transparent acceleration.

196

197

```{ .api }

198

def as_proxy_object(obj, typ=None) -> object:

199

"""

200

Wrap object as proxy for pandas acceleration

201

202

Creates proxy object that intercepts method calls and routes them

203

to appropriate backend (GPU cuDF or CPU pandas). Used internally

204

by the acceleration system.

205

206

Parameters:

207

obj: Any

208

Object to wrap as proxy (typically cuDF object)

209

typ: type, optional

210

Target proxy type (typically pandas type)

211

212

Returns:

213

object: Proxy object that behaves like pandas but uses cuDF backend

214

215

Examples:

216

# Typically used internally, but can be used explicitly

217

import cudf

218

cudf_df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

219

220

# Create proxy that behaves like pandas DataFrame

221

proxy_df = cudf.pandas.as_proxy_object(cudf_df)

222

223

# Proxy behaves like pandas but uses cuDF backend

224

result = proxy_df.sum() # Uses cuDF implementation

225

type(result).__name__ # Shows 'Series' (pandas-like interface)

226

"""

227

228

def is_proxy_object(obj) -> bool:

229

"""

230

Check if object is a proxy object for pandas acceleration

231

232

Determines whether an object is part of the cuDF pandas proxy system,

233

meaning it routes operations between cuDF and pandas backends.

234

235

Parameters:

236

obj: Any

237

Object to check for proxy status

238

239

Returns:

240

bool: True if object is proxy object, False otherwise

241

242

Examples:

243

import cudf.pandas

244

cudf.pandas.install()

245

import pandas as pd

246

247

# Create DataFrame (automatically proxied after install)

248

df = pd.DataFrame({'A': [1, 2, 3]})

249

250

# Check if it's a proxy

251

is_proxy = cudf.pandas.is_proxy_object(df) # True

252

253

# Regular Python objects are not proxies

254

regular_list = [1, 2, 3]

255

is_proxy = cudf.pandas.is_proxy_object(regular_list) # False

256

257

# Native cuDF objects are not proxies

258

import cudf

259

cudf_df = cudf.DataFrame({'A': [1, 2, 3]})

260

is_proxy = cudf.pandas.is_proxy_object(cudf_df) # False

261

"""

262

263

def is_proxy_instance(obj, typ) -> bool:

264

"""

265

Check if object is instance of proxy class for given type

266

267

More specific check that verifies an object is a proxy instance

268

of a particular pandas type (DataFrame, Series, etc.).

269

270

Parameters:

271

obj: Any

272

Object to check

273

typ: type

274

Type to check proxy instance against (e.g., pd.DataFrame)

275

276

Returns:

277

bool: True if object is proxy instance of specified type

278

279

Examples:

280

import cudf.pandas

281

cudf.pandas.install()

282

import pandas as pd

283

284

# Create proxied objects

285

df = pd.DataFrame({'A': [1, 2, 3]})

286

series = pd.Series([1, 2, 3])

287

288

# Check specific proxy types

289

is_df_proxy = cudf.pandas.is_proxy_instance(df, pd.DataFrame) # True

290

is_series_proxy = cudf.pandas.is_proxy_instance(series, pd.Series) # True

291

292

# Cross-type checks return False

293

is_df_as_series = cudf.pandas.is_proxy_instance(df, pd.Series) # False

294

295

# Non-proxy objects return False

296

regular_dict = {'A': [1, 2, 3]}

297

is_dict_proxy = cudf.pandas.is_proxy_instance(regular_dict, pd.DataFrame) # False

298

"""

299

```

300

301

## Acceleration Behavior

302

303

### Automatic Routing

304

305

The cuDF pandas system intelligently routes operations based on several factors:

306

307

```python

308

# Operations automatically routed to GPU when beneficial

309

import cudf.pandas

310

cudf.pandas.install()

311

import pandas as pd

312

313

# Large dataset operations -> GPU acceleration

314

large_df = pd.DataFrame({'x': range(1000000), 'y': range(1000000)})

315

result = large_df.groupby('x').sum() # Uses cuDF GPU acceleration

316

317

# Small dataset operations -> CPU pandas (lower overhead)

318

small_df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})

319

result = small_df.sum() # Uses CPU pandas

320

321

# Supported operations -> GPU when data size warrants it

322

gpu_result = large_df.merge(large_df, on='x') # GPU acceleration

323

324

# Unsupported operations -> automatic fallback to pandas

325

fallback_result = large_df.some_pandas_only_method() # CPU fallback

326

```

327

328

### Performance Thresholds

329

330

```python

331

# The system considers multiple factors for routing decisions:

332

333

# 1. Data size thresholds

334

small_data = pd.Series(range(100)) # -> CPU pandas

335

large_data = pd.Series(range(100000)) # -> GPU cuDF

336

337

# 2. Operation complexity

338

simple_op = df['col'].sum() # -> GPU for large data

339

complex_op = df.apply(custom_function) # -> CPU fallback

340

341

# 3. Memory availability

342

# GPU operations require sufficient GPU memory

343

# Automatic fallback if GPU memory insufficient

344

345

# 4. Operation support

346

supported_ops = ['groupby', 'merge', 'concat', 'sort_values'] # -> GPU

347

unsupported_ops = ['some_pandas_specific_method'] # -> CPU

348

```

349

350

### Configuration Options

351

352

```python

353

# Configure acceleration behavior (conceptual - actual API may vary)

354

import cudf.pandas

355

356

# Install with custom thresholds

357

cudf.pandas.install(

358

min_data_size=10000, # Minimum rows for GPU acceleration

359

memory_fraction=0.8, # Max GPU memory fraction to use

360

fallback_warnings=True # Warn on fallback operations

361

)

362

363

# Disable acceleration for specific operations

364

cudf.pandas.configure(

365

disable_operations=['apply', 'applymap'], # Force CPU for these

366

enable_profiling=True, # Enable automatic profiling

367

cache_conversions=True # Cache pandas<->cuDF conversions

368

)

369

```

370

371

## Common Usage Patterns

372

373

### Drop-in Acceleration

374

375

```python

376

# Existing pandas code - no changes needed

377

import cudf.pandas

378

cudf.pandas.install()

379

380

# Now all pandas imports automatically use acceleration

381

import pandas as pd

382

import numpy as np

383

384

# Large-scale data processing (automatically accelerated)

385

df = pd.read_csv('large_dataset.csv') # GPU-accelerated I/O

386

df_grouped = df.groupby('category').agg({

387

'sales': 'sum',

388

'quantity': 'mean'

389

}) # GPU-accelerated groupby

390

391

# Join operations

392

df_merged = df.merge(df_grouped, on='category') # GPU-accelerated merge

393

394

# Output operations

395

df_merged.to_parquet('output.parquet') # GPU-accelerated I/O

396

```

397

398

### Performance Analysis

399

400

```python

401

# Profile existing pandas workflows

402

import cudf.pandas

403

cudf.pandas.install()

404

405

profiler = cudf.pandas.Profiler()

406

profiler.start()

407

408

# Run existing pandas pipeline

409

import pandas as pd

410

df = pd.read_csv('data.csv')

411

processed = (df

412

.fillna(0)

413

.groupby('category')

414

.agg({'value': ['sum', 'mean', 'std']})

415

.reset_index()

416

)

417

processed.to_csv('results.csv')

418

419

profiler.stop()

420

stats = profiler.get_results()

421

422

print(f"Operations accelerated: {stats['accelerated_ops']}")

423

print(f"Fallback operations: {stats['fallback_ops']}")

424

print(f"Overall speedup: {stats['total_speedup']:.2f}x")

425

print(f"Memory savings: {stats['memory_reduction']:.1f}%")

426

```

427

428

### Gradual Migration

429

430

```python

431

# Hybrid approach - mix cuDF and pandas as needed

432

import cudf

433

import pandas as pd

434

import cudf.pandas

435

436

# Explicit cuDF for known GPU-beneficial operations

437

cudf_df = cudf.read_parquet('large_data.parquet') # Explicit GPU

438

processed_cudf = cudf_df.groupby('key').sum()

439

440

# Convert to pandas for unsupported operations

441

pandas_df = processed_cudf.to_pandas()

442

result = pandas_df.some_pandas_only_operation()

443

444

# Convert back for further GPU processing

445

final_cudf = cudf.from_pandas(result)

446

final_result = final_cudf.sort_values('column')

447

```

448

449

## Compatibility Matrix

450

451

### Fully Supported Operations

452

- **I/O**: `read_csv`, `read_parquet`, `to_csv`, `to_parquet`

453

- **Groupby**: Standard aggregations (`sum`, `mean`, `count`, `min`, `max`)

454

- **Joins**: `merge`, `concat`, `join`

455

- **Sorting**: `sort_values`, `sort_index`

456

- **Filtering**: Boolean indexing, `query`

457

- **Reshaping**: `pivot_table`, `melt`, `stack`, `unstack`

458

459

### Partial Support (Selective Acceleration)

460

- **String Operations**: Common string methods with GPU acceleration

461

- **DateTime Operations**: Basic datetime arithmetic and formatting

462

- **Statistical Operations**: Standard statistical functions

463

- **Window Operations**: Rolling and expanding windows

464

465

### Fallback Operations (CPU Only)

466

- **Custom Functions**: User-defined functions in `apply`, `map`

467

- **Advanced String Operations**: Complex regex and advanced text processing

468

- **Specialized Statistical Methods**: Advanced statistical functions

469

- **Plot Operations**: Matplotlib integration (uses CPU data)

470

471

## Performance Benefits

472

473

### Typical Speedups

474

- **Large Groupby Operations**: 10-100x faster than pandas

475

- **I/O Operations**: 2-20x faster for Parquet, CSV reading/writing

476

- **Join Operations**: 5-50x faster for large table joins

477

- **Sorting**: 3-30x faster for large datasets

478

- **Aggregations**: 10-100x faster for numerical aggregations

479

480

### Memory Efficiency

481

- **Columnar Storage**: More memory-efficient data representation

482

- **GPU Memory Management**: Automatic memory optimization

483

- **Reduced Copying**: Fewer data copies between operations

484

- **Memory Pools**: Efficient memory allocation and reuse

485

486

### Best Practices

487

- **Let the System Decide**: Trust automatic routing for most operations

488

- **Profile Regularly**: Use `Profiler` to identify optimization opportunities

489

- **Monitor Fallbacks**: Check for unexpected CPU fallbacks that might indicate issues

490

- **Batch Operations**: Combine operations to maximize GPU efficiency

491

- **Memory Awareness**: Consider GPU memory limits for very large datasets