0
# Pandas Compatibility
1
2
cuDF provides seamless pandas compatibility through `cudf.pandas`, which enables automatic GPU acceleration for existing pandas code. This system provides transparent fallback to CPU pandas for unsupported operations while leveraging GPU acceleration when beneficial.
3
4
## Import Statements
5
6
```python
7
# Pandas acceleration mode
8
import cudf.pandas
9
cudf.pandas.install() # Enable automatic acceleration
10
11
# Profiling utilities
12
from cudf.pandas import Profiler
13
14
# IPython integration
15
%load_ext cudf.pandas # In Jupyter/IPython
16
17
# Proxy utilities
18
from cudf.pandas import (
19
as_proxy_object, is_proxy_object, is_proxy_instance
20
)
21
```
22
23
## Acceleration Mode
24
25
Drop-in replacement system that automatically accelerates pandas operations with GPU when beneficial.
26
27
```{ .api }
28
def install() -> None:
29
"""
30
Enable cuDF pandas accelerator mode for automatic GPU acceleration
31
32
Installs cuDF as a pandas accelerator that intercepts pandas operations
33
and routes them to GPU when possible. Provides transparent fallback
34
to CPU pandas for unsupported operations.
35
36
After installation, existing pandas code automatically benefits from
37
GPU acceleration without modification. Operations that cannot be
38
accelerated fall back to pandas seamlessly.
39
40
Features:
41
- Automatic GPU acceleration for supported operations
42
- Transparent fallback to CPU pandas for unsupported operations
43
- Zero code changes required for existing pandas workflows
44
- Maintains pandas API compatibility and behavior
45
- Intelligent routing based on data size and operation type
46
47
Examples:
48
# Enable acceleration globally
49
import cudf.pandas
50
cudf.pandas.install()
51
52
# Now pandas operations automatically use GPU when beneficial
53
import pandas as pd
54
df = pd.DataFrame({'x': range(1000000), 'y': range(1000000)})
55
result = df.groupby('x').sum() # Automatically uses GPU
56
57
# Fallback for unsupported operations
58
result = df.some_unsupported_operation() # Uses CPU pandas
59
60
# Works with existing pandas code unchanged
61
df.to_csv('output.csv') # GPU-accelerated I/O when possible
62
"""
63
```
64
65
## Performance Profiling
66
67
Tools for analyzing pandas code to identify GPU acceleration opportunities.
68
69
```{ .api }
70
class Profiler:
71
"""
72
Performance profiler for pandas acceleration opportunities
73
74
Analyzes pandas operations to identify performance bottlenecks and
75
acceleration potential. Provides insights into which operations
76
benefit from GPU acceleration and performance improvements achieved.
77
78
Attributes:
79
results: dict containing profiling results and statistics
80
81
Methods:
82
start(): Begin profiling pandas operations
83
stop(): End profiling and collect results
84
print_stats(): Display profiling statistics
85
get_results(): Return detailed profiling data
86
87
Examples:
88
# Basic profiling workflow
89
import cudf.pandas
90
cudf.pandas.install()
91
92
profiler = cudf.pandas.Profiler()
93
profiler.start()
94
95
# Run pandas operations to profile
96
import pandas as pd
97
df = pd.DataFrame({'A': range(10000), 'B': range(10000)})
98
result1 = df.groupby('A').sum()
99
result2 = df.merge(df, on='A')
100
result3 = df.sort_values('B')
101
102
profiler.stop()
103
profiler.print_stats()
104
105
# Get detailed results
106
stats = profiler.get_results()
107
print(f"GPU accelerated operations: {stats['gpu_ops']}")
108
print(f"CPU fallback operations: {stats['cpu_ops']}")
109
print(f"Total speedup: {stats['speedup']:.2f}x")
110
"""
111
112
def start(self) -> None:
113
"""
114
Begin profiling pandas operations
115
116
Starts collecting performance metrics for pandas operations
117
including execution time, memory usage, and routing decisions.
118
"""
119
120
def stop(self) -> None:
121
"""
122
End profiling and collect results
123
124
Stops profiling and computes final statistics including
125
performance improvements and operation categorization.
126
"""
127
128
def print_stats(self) -> None:
129
"""
130
Display profiling statistics in readable format
131
132
Prints summary of profiled operations including:
133
- Total operations analyzed
134
- GPU vs CPU operation breakdown
135
- Performance improvements achieved
136
- Memory usage patterns
137
- Recommendations for optimization
138
"""
139
140
def get_results(self) -> dict:
141
"""
142
Return detailed profiling data as dictionary
143
144
Returns:
145
dict: Comprehensive profiling results containing:
146
- operation_times: Execution times for each operation
147
- routing_decisions: GPU vs CPU routing for operations
148
- memory_usage: Memory consumption patterns
149
- speedups: Performance improvements achieved
150
- recommendations: Optimization suggestions
151
"""
152
```
153
154
## IPython Integration
155
156
Magic commands and extensions for Jupyter notebook integration.
157
158
```{ .api }
159
def load_ipython_extension(ipython) -> None:
160
"""
161
Load cuDF pandas IPython extension for notebook integration
162
163
Provides magic commands and enhanced display formatting for
164
cuDF pandas operations in Jupyter notebooks and IPython.
165
166
Magic Commands Available:
167
%%cudf_pandas_profile: Profile cell operations for acceleration opportunities
168
%cudf_pandas_status: Show current acceleration status and statistics
169
%cudf_pandas_fallback: Display recent fallback operations and reasons
170
171
Parameters:
172
ipython: IPython.InteractiveShell
173
IPython shell instance to extend
174
175
Examples:
176
# In Jupyter notebook
177
%load_ext cudf.pandas
178
179
# Profile a cell's operations
180
%%cudf_pandas_profile
181
import pandas as pd
182
df = pd.DataFrame({'A': range(10000)})
183
result = df.groupby('A').count()
184
185
# Check acceleration status
186
%cudf_pandas_status
187
188
# See fallback operations
189
%cudf_pandas_fallback
190
"""
191
```
192
193
## Proxy Object System
194
195
Utilities for working with the proxy object system that enables transparent acceleration.
196
197
```{ .api }
198
def as_proxy_object(obj, typ=None) -> object:
199
"""
200
Wrap object as proxy for pandas acceleration
201
202
Creates proxy object that intercepts method calls and routes them
203
to appropriate backend (GPU cuDF or CPU pandas). Used internally
204
by the acceleration system.
205
206
Parameters:
207
obj: Any
208
Object to wrap as proxy (typically cuDF object)
209
typ: type, optional
210
Target proxy type (typically pandas type)
211
212
Returns:
213
object: Proxy object that behaves like pandas but uses cuDF backend
214
215
Examples:
216
# Typically used internally, but can be used explicitly
217
import cudf
218
cudf_df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
219
220
# Create proxy that behaves like pandas DataFrame
221
proxy_df = cudf.pandas.as_proxy_object(cudf_df)
222
223
# Proxy behaves like pandas but uses cuDF backend
224
result = proxy_df.sum() # Uses cuDF implementation
225
type(result).__name__ # Shows 'Series' (pandas-like interface)
226
"""
227
228
def is_proxy_object(obj) -> bool:
229
"""
230
Check if object is a proxy object for pandas acceleration
231
232
Determines whether an object is part of the cuDF pandas proxy system,
233
meaning it routes operations between cuDF and pandas backends.
234
235
Parameters:
236
obj: Any
237
Object to check for proxy status
238
239
Returns:
240
bool: True if object is proxy object, False otherwise
241
242
Examples:
243
import cudf.pandas
244
cudf.pandas.install()
245
import pandas as pd
246
247
# Create DataFrame (automatically proxied after install)
248
df = pd.DataFrame({'A': [1, 2, 3]})
249
250
# Check if it's a proxy
251
is_proxy = cudf.pandas.is_proxy_object(df) # True
252
253
# Regular Python objects are not proxies
254
regular_list = [1, 2, 3]
255
is_proxy = cudf.pandas.is_proxy_object(regular_list) # False
256
257
# Native cuDF objects are not proxies
258
import cudf
259
cudf_df = cudf.DataFrame({'A': [1, 2, 3]})
260
is_proxy = cudf.pandas.is_proxy_object(cudf_df) # False
261
"""
262
263
def is_proxy_instance(obj, typ) -> bool:
264
"""
265
Check if object is instance of proxy class for given type
266
267
More specific check that verifies an object is a proxy instance
268
of a particular pandas type (DataFrame, Series, etc.).
269
270
Parameters:
271
obj: Any
272
Object to check
273
typ: type
274
Type to check proxy instance against (e.g., pd.DataFrame)
275
276
Returns:
277
bool: True if object is proxy instance of specified type
278
279
Examples:
280
import cudf.pandas
281
cudf.pandas.install()
282
import pandas as pd
283
284
# Create proxied objects
285
df = pd.DataFrame({'A': [1, 2, 3]})
286
series = pd.Series([1, 2, 3])
287
288
# Check specific proxy types
289
is_df_proxy = cudf.pandas.is_proxy_instance(df, pd.DataFrame) # True
290
is_series_proxy = cudf.pandas.is_proxy_instance(series, pd.Series) # True
291
292
# Cross-type checks return False
293
is_df_as_series = cudf.pandas.is_proxy_instance(df, pd.Series) # False
294
295
# Non-proxy objects return False
296
regular_dict = {'A': [1, 2, 3]}
297
is_dict_proxy = cudf.pandas.is_proxy_instance(regular_dict, pd.DataFrame) # False
298
"""
299
```
300
301
## Acceleration Behavior
302
303
### Automatic Routing
304
305
The cuDF pandas system intelligently routes operations based on several factors:
306
307
```python
308
# Operations automatically routed to GPU when beneficial
309
import cudf.pandas
310
cudf.pandas.install()
311
import pandas as pd
312
313
# Large dataset operations -> GPU acceleration
314
large_df = pd.DataFrame({'x': range(1000000), 'y': range(1000000)})
315
result = large_df.groupby('x').sum() # Uses cuDF GPU acceleration
316
317
# Small dataset operations -> CPU pandas (lower overhead)
318
small_df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
319
result = small_df.sum() # Uses CPU pandas
320
321
# Supported operations -> GPU when data size warrants it
322
gpu_result = large_df.merge(large_df, on='x') # GPU acceleration
323
324
# Unsupported operations -> automatic fallback to pandas
325
fallback_result = large_df.some_pandas_only_method() # CPU fallback
326
```
327
328
### Performance Thresholds
329
330
```python
331
# The system considers multiple factors for routing decisions:
332
333
# 1. Data size thresholds
334
small_data = pd.Series(range(100)) # -> CPU pandas
335
large_data = pd.Series(range(100000)) # -> GPU cuDF
336
337
# 2. Operation complexity
338
simple_op = df['col'].sum() # -> GPU for large data
339
complex_op = df.apply(custom_function) # -> CPU fallback
340
341
# 3. Memory availability
342
# GPU operations require sufficient GPU memory
343
# Automatic fallback if GPU memory insufficient
344
345
# 4. Operation support
346
supported_ops = ['groupby', 'merge', 'concat', 'sort_values'] # -> GPU
347
unsupported_ops = ['some_pandas_specific_method'] # -> CPU
348
```
349
350
### Configuration Options
351
352
```python
353
# Configure acceleration behavior (conceptual - actual API may vary)
354
import cudf.pandas
355
356
# Install with custom thresholds
357
cudf.pandas.install(
358
min_data_size=10000, # Minimum rows for GPU acceleration
359
memory_fraction=0.8, # Max GPU memory fraction to use
360
fallback_warnings=True # Warn on fallback operations
361
)
362
363
# Disable acceleration for specific operations
364
cudf.pandas.configure(
365
disable_operations=['apply', 'applymap'], # Force CPU for these
366
enable_profiling=True, # Enable automatic profiling
367
cache_conversions=True # Cache pandas<->cuDF conversions
368
)
369
```
370
371
## Common Usage Patterns
372
373
### Drop-in Acceleration
374
375
```python
376
# Existing pandas code - no changes needed
377
import cudf.pandas
378
cudf.pandas.install()
379
380
# Now all pandas imports automatically use acceleration
381
import pandas as pd
382
import numpy as np
383
384
# Large-scale data processing (automatically accelerated)
385
df = pd.read_csv('large_dataset.csv') # GPU-accelerated I/O
386
df_grouped = df.groupby('category').agg({
387
'sales': 'sum',
388
'quantity': 'mean'
389
}) # GPU-accelerated groupby
390
391
# Join operations
392
df_merged = df.merge(df_grouped, on='category') # GPU-accelerated merge
393
394
# Output operations
395
df_merged.to_parquet('output.parquet') # GPU-accelerated I/O
396
```
397
398
### Performance Analysis
399
400
```python
401
# Profile existing pandas workflows
402
import cudf.pandas
403
cudf.pandas.install()
404
405
profiler = cudf.pandas.Profiler()
406
profiler.start()
407
408
# Run existing pandas pipeline
409
import pandas as pd
410
df = pd.read_csv('data.csv')
411
processed = (df
412
.fillna(0)
413
.groupby('category')
414
.agg({'value': ['sum', 'mean', 'std']})
415
.reset_index()
416
)
417
processed.to_csv('results.csv')
418
419
profiler.stop()
420
stats = profiler.get_results()
421
422
print(f"Operations accelerated: {stats['accelerated_ops']}")
423
print(f"Fallback operations: {stats['fallback_ops']}")
424
print(f"Overall speedup: {stats['total_speedup']:.2f}x")
425
print(f"Memory savings: {stats['memory_reduction']:.1f}%")
426
```
427
428
### Gradual Migration
429
430
```python
431
# Hybrid approach - mix cuDF and pandas as needed
432
import cudf
433
import pandas as pd
434
import cudf.pandas
435
436
# Explicit cuDF for known GPU-beneficial operations
437
cudf_df = cudf.read_parquet('large_data.parquet') # Explicit GPU
438
processed_cudf = cudf_df.groupby('key').sum()
439
440
# Convert to pandas for unsupported operations
441
pandas_df = processed_cudf.to_pandas()
442
result = pandas_df.some_pandas_only_operation()
443
444
# Convert back for further GPU processing
445
final_cudf = cudf.from_pandas(result)
446
final_result = final_cudf.sort_values('column')
447
```
448
449
## Compatibility Matrix
450
451
### Fully Supported Operations
452
- **I/O**: `read_csv`, `read_parquet`, `to_csv`, `to_parquet`
453
- **Groupby**: Standard aggregations (`sum`, `mean`, `count`, `min`, `max`)
454
- **Joins**: `merge`, `concat`, `join`
455
- **Sorting**: `sort_values`, `sort_index`
456
- **Filtering**: Boolean indexing, `query`
457
- **Reshaping**: `pivot_table`, `melt`, `stack`, `unstack`
458
459
### Partial Support (Selective Acceleration)
460
- **String Operations**: Common string methods with GPU acceleration
461
- **DateTime Operations**: Basic datetime arithmetic and formatting
462
- **Statistical Operations**: Standard statistical functions
463
- **Window Operations**: Rolling and expanding windows
464
465
### Fallback Operations (CPU Only)
466
- **Custom Functions**: User-defined functions in `apply`, `map`
467
- **Advanced String Operations**: Complex regex and advanced text processing
468
- **Specialized Statistical Methods**: Advanced statistical functions
469
- **Plot Operations**: Matplotlib integration (uses CPU data)
470
471
## Performance Benefits
472
473
### Typical Speedups
474
- **Large Groupby Operations**: 10-100x faster than pandas
475
- **I/O Operations**: 2-20x faster for Parquet, CSV reading/writing
476
- **Join Operations**: 5-50x faster for large table joins
477
- **Sorting**: 3-30x faster for large datasets
478
- **Aggregations**: 10-100x faster for numerical aggregations
479
480
### Memory Efficiency
481
- **Columnar Storage**: More memory-efficient data representation
482
- **GPU Memory Management**: Automatic memory optimization
483
- **Reduced Copying**: Fewer data copies between operations
484
- **Memory Pools**: Efficient memory allocation and reuse
485
486
### Best Practices
487
- **Let the System Decide**: Trust automatic routing for most operations
488
- **Profile Regularly**: Use `Profiler` to identify optimization opportunities
489
- **Monitor Fallbacks**: Check for unexpected CPU fallbacks that might indicate issues
490
- **Batch Operations**: Combine operations to maximize GPU efficiency
491
- **Memory Awareness**: Consider GPU memory limits for very large datasets