GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data
npx @tessl/cli install tessl/pypi-cudf-cu12@25.8.00
# cuDF: GPU-Accelerated DataFrames
1
2
cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.
3
4
## Package Information
5
6
- **Package**: `cudf-cu12`
7
- **Import**: `cudf`
8
- **Version**: 25.8.0+
9
- **Installation**: `pip install cudf-cu12` or `conda install cudf`
10
- **Requirements**: NVIDIA GPU with CUDA support
11
12
## Core Imports
13
14
```python
15
# Main data structures
16
import cudf
17
from cudf import DataFrame, Series, Index
18
19
# I/O operations
20
from cudf import read_csv, read_parquet, read_json
21
from cudf.io import read_orc, read_avro, read_feather
22
23
# Data manipulation
24
from cudf import concat, merge, pivot_table
25
from cudf import cut, factorize, unique
26
27
# Type checking
28
from cudf.api.types import is_numeric_dtype, is_categorical_dtype
29
from cudf.api.types import dtype
30
31
# Configuration
32
from cudf.options import get_option, set_option
33
34
# Dataset generation
35
from cudf.datasets import timeseries, randomdata
36
37
# Version information
38
import cudf
39
print(cudf.__version__) # Package version
40
```
41
42
## Basic Usage
43
44
```{ .api }
45
# Create DataFrame from dictionary
46
df = cudf.DataFrame({
47
'x': [1, 2, 3, 4, 5],
48
'y': [1.0, 2.5, 3.2, 4.1, 5.8],
49
'z': ['red', 'green', 'blue', 'red', 'green']
50
})
51
52
# GPU-accelerated operations
53
result = df.groupby('z').agg({'x': 'sum', 'y': 'mean'})
54
55
# I/O operations leverage GPU memory
56
df_from_file = cudf.read_parquet('data.parquet')
57
df_from_file.to_csv('output.csv')
58
59
# Seamless pandas compatibility
60
df_pandas = df.to_pandas() # Move to CPU
61
df_cudf = cudf.from_pandas(df_pandas) # Move to GPU
62
```
63
64
## Architecture
65
66
cuDF leverages the RAPIDS ecosystem to provide GPU-accelerated data processing:
67
68
- **GPU Memory Management**: Built on RAPIDS Memory Manager (RMM) for efficient GPU memory allocation
69
- **Columnar Storage**: Uses Apache Arrow format for optimal GPU performance
70
- **libcudf Backend**: C++/CUDA library provides the computational engine
71
- **Pandas API**: Maintains familiar pandas interface while delivering GPU performance
72
- **Zero-Copy Interop**: Seamless integration with PyArrow, Numba, and other GPU libraries
73
74
## Core Data Structures
75
76
cuDF provides GPU-accelerated versions of pandas' core data structures with enhanced capabilities.
77
78
```{ .api }
79
class DataFrame:
80
"""GPU-accelerated DataFrame with pandas-like API"""
81
82
class Series:
83
"""One-dimensional GPU array with axis labels"""
84
85
class Index:
86
"""Immutable sequence used for axis labels and selection"""
87
88
class RangeIndex(Index):
89
"""Memory-efficient index for integer ranges"""
90
91
class CategoricalIndex(Index):
92
"""Index for categorical data with GPU acceleration"""
93
```
94
95
**Key Features**: GPU memory efficiency, nested data types (lists, structs), decimal precision support.
96
97
[**→ Learn more about Core Data Structures**](./core-data-structures.md)
98
99
## I/O Operations
100
101
High-performance GPU I/O for popular data formats with automatic memory management.
102
103
```{ .api }
104
def read_parquet(filepath_or_buffer, columns=None, **kwargs) -> DataFrame:
105
"""
106
Read Apache Parquet file directly into GPU memory
107
108
Parameters:
109
filepath_or_buffer: File path, URL, or buffer-like object
110
columns: List[str], optional column subset to read
111
**kwargs: Additional parquet reading options
112
113
Returns:
114
DataFrame: GPU-accelerated DataFrame
115
"""
116
117
def read_csv(filepath_or_buffer, **kwargs) -> DataFrame:
118
"""
119
Read CSV file with GPU acceleration
120
121
Parameters:
122
filepath_or_buffer: File path or buffer
123
**kwargs: CSV parsing options (delimiter, header, etc.)
124
125
Returns:
126
DataFrame: GPU DataFrame with parsed CSV data
127
"""
128
```
129
130
**Supported Formats**: Parquet, ORC, CSV, JSON, Avro, Feather, HDF5, raw text files.
131
132
[**→ Learn more about I/O Operations**](./io-operations.md)
133
134
## Data Manipulation
135
136
GPU-accelerated operations for reshaping, joining, and transforming data.
137
138
```{ .api }
139
def concat(objs, axis=0, ignore_index=False, **kwargs) -> Union[DataFrame, Series]:
140
"""
141
Concatenate cuDF objects along a particular axis
142
143
Parameters:
144
objs: Sequence of DataFrame/Series objects
145
axis: int, axis to concatenate along (0='index', 1='columns')
146
ignore_index: bool, reset index if True
147
148
Returns:
149
Union[DataFrame, Series]: Concatenated result
150
"""
151
152
def merge(left, right, how='inner', on=None, **kwargs) -> DataFrame:
153
"""
154
Merge DataFrame objects with database-style join operations
155
156
Parameters:
157
left: DataFrame, left object to merge
158
right: DataFrame, right object to merge
159
how: str, type of merge ('inner', 'outer', 'left', 'right')
160
on: label or list, column names to join on
161
162
Returns:
163
DataFrame: Merged DataFrame
164
"""
165
```
166
167
**Operations**: Concatenation, merging, pivoting, melting, groupby, aggregation, sorting.
168
169
[**→ Learn more about Data Manipulation**](./data-manipulation.md)
170
171
## Type Checking & Validation
172
173
Comprehensive type checking system for GPU data types including nested types.
174
175
```{ .api }
176
def is_numeric_dtype(arr_or_dtype) -> bool:
177
"""
178
Check whether the provided array or dtype is numeric
179
180
Parameters:
181
arr_or_dtype: Array-like or data type to check
182
183
Returns:
184
bool: True if numeric dtype
185
"""
186
187
def is_categorical_dtype(arr_or_dtype) -> bool:
188
"""
189
Check whether the array or dtype is categorical
190
191
Parameters:
192
arr_or_dtype: Array-like or data type to check
193
194
Returns:
195
bool: True if categorical dtype
196
"""
197
```
198
199
**Type Support**: Standard dtypes, categorical, decimal, list, struct, interval, datetime types.
200
201
[**→ Learn more about Type Checking**](./type-checking.md)
202
203
## Pandas Compatibility Layer
204
205
Drop-in acceleration for existing pandas code with cudf.pandas.
206
207
```{ .api }
208
def install() -> None:
209
"""
210
Enable cuDF pandas accelerator mode
211
212
Automatically accelerates pandas operations with GPU when beneficial,
213
falls back to CPU pandas for unsupported operations.
214
"""
215
216
class Profiler:
217
"""
218
Performance profiler for pandas acceleration opportunities
219
220
Analyzes pandas code execution to identify GPU acceleration potential
221
"""
222
```
223
224
**Features**: Automatic fallback, transparent acceleration, performance profiling, IPython magic commands.
225
226
[**→ Learn more about Pandas Compatibility**](./pandas-compatibility.md)
227
228
## Testing Utilities
229
230
GPU-aware testing framework with specialized assertions for cuDF objects.
231
232
```{ .api }
233
def assert_frame_equal(left, right, check_dtype=True, **kwargs) -> None:
234
"""
235
Assert DataFrame equality with GPU-aware comparison
236
237
Parameters:
238
left: DataFrame, expected result
239
right: DataFrame, actual result
240
check_dtype: bool, whether to check dtype compatibility
241
**kwargs: Additional comparison options
242
"""
243
```
244
245
**Capabilities**: DataFrame/Series/Index comparison, GPU memory validation, performance assertions.
246
247
[**→ Learn more about Testing Utilities**](./testing-utilities.md)
248
249
## Configuration Management
250
251
Global configuration system for controlling GPU memory usage and behavior.
252
253
```{ .api }
254
def get_option(key: str) -> Any:
255
"""
256
Get the value of a configuration option
257
258
Parameters:
259
key: str, configuration option key
260
261
Returns:
262
Any: Current option value
263
"""
264
265
def set_option(key: str, value: Any) -> None:
266
"""
267
Set a configuration option value
268
269
Parameters:
270
key: str, configuration option key
271
value: Any, new option value
272
"""
273
```
274
275
**Options**: Memory management, display formatting, computation behavior, I/O settings.
276
277
## Error Handling
278
279
Specialized error types for GPU-specific issues and mixed-type operations.
280
281
```{ .api }
282
class UnsupportedCUDAError(Exception):
283
"""Raised when CUDA functionality is not supported"""
284
285
class MixedTypeError(Exception):
286
"""Raised when mixing incompatible GPU and CPU types"""
287
```
288
289
## Dataset Generation
290
291
Utilities for generating test data and benchmarking datasets directly in GPU memory.
292
293
```{ .api }
294
def timeseries(
295
start='2000-01-01',
296
end='2000-01-31',
297
freq='1s',
298
dtypes=None,
299
nulls_frequency=0,
300
seed=None
301
) -> DataFrame:
302
"""
303
Generate random timeseries data for testing and benchmarking
304
305
Parameters:
306
start: str or datetime-like, start date
307
end: str or datetime-like, end date
308
freq: str, date frequency string (e.g., '1s', '1H', '1D')
309
dtypes: dict, mapping of column names to types
310
nulls_frequency: float, proportion of nulls to include (0-1)
311
seed: int, random state seed for reproducibility
312
313
Returns:
314
DataFrame: GPU DataFrame with random timeseries data
315
"""
316
317
def randomdata(nrows=10, dtypes=None, seed=None) -> DataFrame:
318
"""
319
Generate random data for testing and benchmarking
320
321
Parameters:
322
nrows: int, number of rows to generate
323
dtypes: dict, mapping of column names to types
324
seed: int, random state seed for reproducibility
325
326
Returns:
327
DataFrame: GPU DataFrame with random data
328
"""
329
```
330
331
## Performance Benefits
332
333
- **Memory Bandwidth**: 10-50x improvement over pandas for large datasets
334
- **Parallel Processing**: Leverages thousands of GPU cores for operations
335
- **Memory Efficiency**: Columnar storage reduces memory footprint
336
- **Zero-Copy**: Minimal data movement between GPU operations
337
- **Automatic Optimization**: Query optimization and kernel fusion
338
339
## GPU Requirements
340
341
- NVIDIA GPU with Compute Capability 7.0+ (Volta architecture or newer)
342
- CUDA 11.2+ or CUDA 12.0+
343
- Sufficient GPU memory for dataset size
344
- Compatible NVIDIA drivers
345
346
## Version Information
347
348
Access package version and build information programmatically.
349
350
```{ .api }
351
import cudf
352
353
# Package version string
354
__version__ = cudf.__version__ # e.g., "25.8.0"
355
356
# Git commit hash (if available)
357
__git_commit__ = cudf.__git_commit__ # e.g., "6cea3743b6"
358
```