Tessl Tile for pypi/cudf-cu12@25.8.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-data-structures.md data-manipulation.md index.md io-operations.md pandas-compatibility.md testing-utilities.md type-checking.md

pandas-compatibility.mddocs/

0
# Pandas Compatibility
1

2
cuDF provides seamless pandas compatibility through `cudf.pandas`, which enables automatic GPU acceleration for existing pandas code. This system provides transparent fallback to CPU pandas for unsupported operations while leveraging GPU acceleration when beneficial.
3

4
## Import Statements
5

6
```python
7
# Pandas acceleration mode
8
import cudf.pandas
9
cudf.pandas.install()  # Enable automatic acceleration
10

11
# Profiling utilities
12
from cudf.pandas import Profiler
13

14
# IPython integration  
15
%load_ext cudf.pandas  # In Jupyter/IPython
16

17
# Proxy utilities
18
from cudf.pandas import (
19
    as_proxy_object, is_proxy_object, is_proxy_instance
20
)
21
```
22

23
## Acceleration Mode
24

25
Drop-in replacement system that automatically accelerates pandas operations with GPU when beneficial.
26

27
```{ .api }
28
def install() -> None:
29
    """
30
    Enable cuDF pandas accelerator mode for automatic GPU acceleration
31
    
32
    Installs cuDF as a pandas accelerator that intercepts pandas operations
33
    and routes them to GPU when possible. Provides transparent fallback
34
    to CPU pandas for unsupported operations.
35
    
36
    After installation, existing pandas code automatically benefits from
37
    GPU acceleration without modification. Operations that cannot be 
38
    accelerated fall back to pandas seamlessly.
39
    
40
    Features:
41
        - Automatic GPU acceleration for supported operations
42
        - Transparent fallback to CPU pandas for unsupported operations  
43
        - Zero code changes required for existing pandas workflows
44
        - Maintains pandas API compatibility and behavior
45
        - Intelligent routing based on data size and operation type
46
        
47
    Examples:
48
        # Enable acceleration globally
49
        import cudf.pandas
50
        cudf.pandas.install()
51
        
52
        # Now pandas operations automatically use GPU when beneficial
53
        import pandas as pd
54
        df = pd.DataFrame({'x': range(1000000), 'y': range(1000000)})
55
        result = df.groupby('x').sum()  # Automatically uses GPU
56
        
57
        # Fallback for unsupported operations
58
        result = df.some_unsupported_operation()  # Uses CPU pandas
59
        
60
        # Works with existing pandas code unchanged
61
        df.to_csv('output.csv')  # GPU-accelerated I/O when possible
62
    """
63
```
64

65
## Performance Profiling
66

67
Tools for analyzing pandas code to identify GPU acceleration opportunities.
68

69
```{ .api }
70
class Profiler:
71
    """
72
    Performance profiler for pandas acceleration opportunities
73
    
74
    Analyzes pandas operations to identify performance bottlenecks and
75
    acceleration potential. Provides insights into which operations
76
    benefit from GPU acceleration and performance improvements achieved.
77
    
78
    Attributes:
79
        results: dict containing profiling results and statistics
80
        
81
    Methods:
82
        start(): Begin profiling pandas operations
83
        stop(): End profiling and collect results
84
        print_stats(): Display profiling statistics  
85
        get_results(): Return detailed profiling data
86
        
87
    Examples:
88
        # Basic profiling workflow
89
        import cudf.pandas
90
        cudf.pandas.install()
91
        
92
        profiler = cudf.pandas.Profiler()
93
        profiler.start()
94
        
95
        # Run pandas operations to profile
96
        import pandas as pd
97
        df = pd.DataFrame({'A': range(10000), 'B': range(10000)})
98
        result1 = df.groupby('A').sum()
99
        result2 = df.merge(df, on='A')
100
        result3 = df.sort_values('B')
101
        
102
        profiler.stop()
103
        profiler.print_stats()
104
        
105
        # Get detailed results
106
        stats = profiler.get_results()
107
        print(f"GPU accelerated operations: {stats['gpu_ops']}")
108
        print(f"CPU fallback operations: {stats['cpu_ops']}")
109
        print(f"Total speedup: {stats['speedup']:.2f}x")
110
    """
111
    
112
    def start(self) -> None:
113
        """
114
        Begin profiling pandas operations
115
        
116
        Starts collecting performance metrics for pandas operations
117
        including execution time, memory usage, and routing decisions.
118
        """
119
    
120
    def stop(self) -> None:
121
        """
122
        End profiling and collect results
123
        
124
        Stops profiling and computes final statistics including
125
        performance improvements and operation categorization.
126
        """
127
    
128
    def print_stats(self) -> None:
129
        """
130
        Display profiling statistics in readable format
131
        
132
        Prints summary of profiled operations including:
133
        - Total operations analyzed
134
        - GPU vs CPU operation breakdown  
135
        - Performance improvements achieved
136
        - Memory usage patterns
137
        - Recommendations for optimization
138
        """
139
    
140
    def get_results(self) -> dict:
141
        """
142
        Return detailed profiling data as dictionary
143
        
144
        Returns:
145
            dict: Comprehensive profiling results containing:
146
                - operation_times: Execution times for each operation
147
                - routing_decisions: GPU vs CPU routing for operations
148
                - memory_usage: Memory consumption patterns
149
                - speedups: Performance improvements achieved
150
                - recommendations: Optimization suggestions
151
        """
152
```
153

154
## IPython Integration
155

156
Magic commands and extensions for Jupyter notebook integration.
157

158
```{ .api }
159
def load_ipython_extension(ipython) -> None:
160
    """
161
    Load cuDF pandas IPython extension for notebook integration
162
    
163
    Provides magic commands and enhanced display formatting for
164
    cuDF pandas operations in Jupyter notebooks and IPython.
165
    
166
    Magic Commands Available:
167
        %%cudf_pandas_profile: Profile cell operations for acceleration opportunities
168
        %cudf_pandas_status: Show current acceleration status and statistics
169
        %cudf_pandas_fallback: Display recent fallback operations and reasons
170
        
171
    Parameters:
172
        ipython: IPython.InteractiveShell
173
            IPython shell instance to extend
174
            
175
    Examples:
176
        # In Jupyter notebook
177
        %load_ext cudf.pandas
178
        
179
        # Profile a cell's operations  
180
        %%cudf_pandas_profile
181
        import pandas as pd
182
        df = pd.DataFrame({'A': range(10000)})
183
        result = df.groupby('A').count()
184
        
185
        # Check acceleration status
186
        %cudf_pandas_status
187
        
188
        # See fallback operations
189
        %cudf_pandas_fallback
190
    """
191
```
192

193
## Proxy Object System
194

195
Utilities for working with the proxy object system that enables transparent acceleration.
196

197
```{ .api }
198
def as_proxy_object(obj, typ=None) -> object:
199
    """
200
    Wrap object as proxy for pandas acceleration
201
    
202
    Creates proxy object that intercepts method calls and routes them
203
    to appropriate backend (GPU cuDF or CPU pandas). Used internally
204
    by the acceleration system.
205
    
206
    Parameters:
207
        obj: Any
208
            Object to wrap as proxy (typically cuDF object)
209
        typ: type, optional
210
            Target proxy type (typically pandas type)
211
            
212
    Returns:
213
        object: Proxy object that behaves like pandas but uses cuDF backend
214
        
215
    Examples:
216
        # Typically used internally, but can be used explicitly
217
        import cudf
218
        cudf_df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
219
        
220
        # Create proxy that behaves like pandas DataFrame
221
        proxy_df = cudf.pandas.as_proxy_object(cudf_df)
222
        
223
        # Proxy behaves like pandas but uses cuDF backend
224
        result = proxy_df.sum()  # Uses cuDF implementation
225
        type(result).__name__  # Shows 'Series' (pandas-like interface)
226
    """
227

228
def is_proxy_object(obj) -> bool:
229
    """
230
    Check if object is a proxy object for pandas acceleration
231
    
232
    Determines whether an object is part of the cuDF pandas proxy system,
233
    meaning it routes operations between cuDF and pandas backends.
234
    
235
    Parameters:
236
        obj: Any
237
            Object to check for proxy status
238
            
239
    Returns:
240
        bool: True if object is proxy object, False otherwise
241
        
242
    Examples:
243
        import cudf.pandas
244
        cudf.pandas.install()
245
        import pandas as pd
246
        
247
        # Create DataFrame (automatically proxied after install)
248
        df = pd.DataFrame({'A': [1, 2, 3]})
249
        
250
        # Check if it's a proxy
251
        is_proxy = cudf.pandas.is_proxy_object(df)  # True
252
        
253
        # Regular Python objects are not proxies
254
        regular_list = [1, 2, 3]
255
        is_proxy = cudf.pandas.is_proxy_object(regular_list)  # False
256
        
257
        # Native cuDF objects are not proxies  
258
        import cudf
259
        cudf_df = cudf.DataFrame({'A': [1, 2, 3]})
260
        is_proxy = cudf.pandas.is_proxy_object(cudf_df)  # False
261
    """
262

263
def is_proxy_instance(obj, typ) -> bool:
264
    """
265
    Check if object is instance of proxy class for given type
266
    
267
    More specific check that verifies an object is a proxy instance
268
    of a particular pandas type (DataFrame, Series, etc.).
269
    
270
    Parameters:
271
        obj: Any
272
            Object to check
273
        typ: type
274
            Type to check proxy instance against (e.g., pd.DataFrame)
275
            
276
    Returns:
277
        bool: True if object is proxy instance of specified type
278
        
279
    Examples:
280
        import cudf.pandas
281
        cudf.pandas.install()
282
        import pandas as pd
283
        
284
        # Create proxied objects
285
        df = pd.DataFrame({'A': [1, 2, 3]})
286
        series = pd.Series([1, 2, 3])
287
        
288
        # Check specific proxy types
289
        is_df_proxy = cudf.pandas.is_proxy_instance(df, pd.DataFrame)  # True
290
        is_series_proxy = cudf.pandas.is_proxy_instance(series, pd.Series)  # True
291
        
292
        # Cross-type checks return False
293
        is_df_as_series = cudf.pandas.is_proxy_instance(df, pd.Series)  # False
294
        
295
        # Non-proxy objects return False
296
        regular_dict = {'A': [1, 2, 3]}
297
        is_dict_proxy = cudf.pandas.is_proxy_instance(regular_dict, pd.DataFrame)  # False
298
    """
299
```
300

301
## Acceleration Behavior
302

303
### Automatic Routing
304

305
The cuDF pandas system intelligently routes operations based on several factors:
306

307
```python
308
# Operations automatically routed to GPU when beneficial
309
import cudf.pandas
310
cudf.pandas.install()
311
import pandas as pd
312

313
# Large dataset operations -> GPU acceleration
314
large_df = pd.DataFrame({'x': range(1000000), 'y': range(1000000)})
315
result = large_df.groupby('x').sum()  # Uses cuDF GPU acceleration
316

317
# Small dataset operations -> CPU pandas (lower overhead)
318
small_df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
319
result = small_df.sum()  # Uses CPU pandas
320

321
# Supported operations -> GPU when data size warrants it
322
gpu_result = large_df.merge(large_df, on='x')  # GPU acceleration
323

324
# Unsupported operations -> automatic fallback to pandas
325
fallback_result = large_df.some_pandas_only_method()  # CPU fallback
326
```
327

328
### Performance Thresholds
329

330
```python
331
# The system considers multiple factors for routing decisions:
332

333
# 1. Data size thresholds
334
small_data = pd.Series(range(100))      # -> CPU pandas  
335
large_data = pd.Series(range(100000))   # -> GPU cuDF
336

337
# 2. Operation complexity
338
simple_op = df['col'].sum()             # -> GPU for large data
339
complex_op = df.apply(custom_function)  # -> CPU fallback
340

341
# 3. Memory availability  
342
# GPU operations require sufficient GPU memory
343
# Automatic fallback if GPU memory insufficient
344

345
# 4. Operation support
346
supported_ops = ['groupby', 'merge', 'concat', 'sort_values']  # -> GPU
347
unsupported_ops = ['some_pandas_specific_method']              # -> CPU
348
```
349

350
### Configuration Options
351

352
```python
353
# Configure acceleration behavior (conceptual - actual API may vary)
354
import cudf.pandas
355

356
# Install with custom thresholds
357
cudf.pandas.install(
358
    min_data_size=10000,      # Minimum rows for GPU acceleration
359
    memory_fraction=0.8,      # Max GPU memory fraction to use
360
    fallback_warnings=True    # Warn on fallback operations
361
)
362

363
# Disable acceleration for specific operations
364
cudf.pandas.configure(
365
    disable_operations=['apply', 'applymap'],  # Force CPU for these
366
    enable_profiling=True,     # Enable automatic profiling
367
    cache_conversions=True     # Cache pandas<->cuDF conversions
368
)
369
```
370

371
## Common Usage Patterns
372

373
### Drop-in Acceleration
374

375
```python
376
# Existing pandas code - no changes needed
377
import cudf.pandas
378
cudf.pandas.install()
379

380
# Now all pandas imports automatically use acceleration
381
import pandas as pd
382
import numpy as np
383

384
# Large-scale data processing (automatically accelerated)
385
df = pd.read_csv('large_dataset.csv')  # GPU-accelerated I/O
386
df_grouped = df.groupby('category').agg({
387
    'sales': 'sum',
388
    'quantity': 'mean'
389
})  # GPU-accelerated groupby
390

391
# Join operations
392
df_merged = df.merge(df_grouped, on='category')  # GPU-accelerated merge
393

394
# Output operations  
395
df_merged.to_parquet('output.parquet')  # GPU-accelerated I/O
396
```
397

398
### Performance Analysis
399

400
```python
401
# Profile existing pandas workflows
402
import cudf.pandas
403
cudf.pandas.install()
404

405
profiler = cudf.pandas.Profiler()
406
profiler.start()
407

408
# Run existing pandas pipeline
409
import pandas as pd
410
df = pd.read_csv('data.csv')
411
processed = (df
412
    .fillna(0)
413
    .groupby('category')
414
    .agg({'value': ['sum', 'mean', 'std']})
415
    .reset_index()
416
)
417
processed.to_csv('results.csv')
418

419
profiler.stop()
420
stats = profiler.get_results()
421

422
print(f"Operations accelerated: {stats['accelerated_ops']}")
423
print(f"Fallback operations: {stats['fallback_ops']}") 
424
print(f"Overall speedup: {stats['total_speedup']:.2f}x")
425
print(f"Memory savings: {stats['memory_reduction']:.1f}%")
426
```
427

428
### Gradual Migration
429

430
```python
431
# Hybrid approach - mix cuDF and pandas as needed
432
import cudf
433
import pandas as pd
434
import cudf.pandas
435

436
# Explicit cuDF for known GPU-beneficial operations
437
cudf_df = cudf.read_parquet('large_data.parquet')  # Explicit GPU
438
processed_cudf = cudf_df.groupby('key').sum()
439

440
# Convert to pandas for unsupported operations
441
pandas_df = processed_cudf.to_pandas()
442
result = pandas_df.some_pandas_only_operation()
443

444
# Convert back for further GPU processing
445
final_cudf = cudf.from_pandas(result)
446
final_result = final_cudf.sort_values('column')
447
```
448

449
## Compatibility Matrix
450

451
### Fully Supported Operations
452
- **I/O**: `read_csv`, `read_parquet`, `to_csv`, `to_parquet`
453
- **Groupby**: Standard aggregations (`sum`, `mean`, `count`, `min`, `max`)
454
- **Joins**: `merge`, `concat`, `join`
455
- **Sorting**: `sort_values`, `sort_index`
456
- **Filtering**: Boolean indexing, `query`
457
- **Reshaping**: `pivot_table`, `melt`, `stack`, `unstack`
458

459
### Partial Support (Selective Acceleration)
460
- **String Operations**: Common string methods with GPU acceleration
461
- **DateTime Operations**: Basic datetime arithmetic and formatting
462
- **Statistical Operations**: Standard statistical functions
463
- **Window Operations**: Rolling and expanding windows
464

465
### Fallback Operations (CPU Only)
466
- **Custom Functions**: User-defined functions in `apply`, `map`
467
- **Advanced String Operations**: Complex regex and advanced text processing  
468
- **Specialized Statistical Methods**: Advanced statistical functions
469
- **Plot Operations**: Matplotlib integration (uses CPU data)
470

471
## Performance Benefits
472

473
### Typical Speedups
474
- **Large Groupby Operations**: 10-100x faster than pandas
475
- **I/O Operations**: 2-20x faster for Parquet, CSV reading/writing
476
- **Join Operations**: 5-50x faster for large table joins
477
- **Sorting**: 3-30x faster for large datasets
478
- **Aggregations**: 10-100x faster for numerical aggregations
479

480
### Memory Efficiency
481
- **Columnar Storage**: More memory-efficient data representation
482
- **GPU Memory Management**: Automatic memory optimization
483
- **Reduced Copying**: Fewer data copies between operations
484
- **Memory Pools**: Efficient memory allocation and reuse
485

486
### Best Practices
487
- **Let the System Decide**: Trust automatic routing for most operations
488
- **Profile Regularly**: Use `Profiler` to identify optimization opportunities  
489
- **Monitor Fallbacks**: Check for unexpected CPU fallbacks that might indicate issues
490
- **Batch Operations**: Combine operations to maximize GPU efficiency
491
- **Memory Awareness**: Consider GPU memory limits for very large datasets

Version

Tile

Files

pandas-compatibility.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

pandas-compatibility.mddocs/