Dask cuDF - A GPU Backend for Dask DataFrame providing GPU-accelerated parallel and larger-than-memory DataFrame computing
npx @tessl/cli install tessl/pypi-dask-cudf@24.12.00
# Dask-cuDF
1
2
Dask-cuDF is a GPU-accelerated backend for Dask DataFrame that provides a pandas-like API for parallel and larger-than-memory DataFrame computing on GPUs. When installed, it automatically registers as the "cudf" dataframe backend for Dask DataFrame, enabling seamless integration with the broader RAPIDS ecosystem for high-performance data analytics workloads.
3
4
## Package Information
5
6
- **Package Name**: dask-cudf
7
- **Language**: Python
8
- **Installation**: `pip install dask-cudf` or via conda from the RAPIDS channel
9
- **Dependencies**: Requires NVIDIA GPU with CUDA support, cuDF, Dask, and other RAPIDS components
10
11
## Core Imports
12
13
```python
14
import dask_cudf
15
```
16
17
For DataFrame operations:
18
19
```python
20
from dask_cudf import DataFrame, Series, Index, from_delayed
21
```
22
23
Configuration for automatic cuDF backend:
24
25
```python
26
import dask
27
dask.config.set({"dataframe.backend": "cudf"})
28
```
29
30
## Basic Usage
31
32
```python
33
import dask
34
import dask.dataframe as dd
35
import dask_cudf
36
from dask_cuda import LocalCUDACluster
37
from distributed import Client
38
39
# Set up GPU cluster (optional but recommended for multi-GPU)
40
client = Client(LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1"))
41
42
# Configure to use cuDF backend
43
dask.config.set({"dataframe.backend": "cudf"})
44
45
# Read data using standard Dask interface
46
df = dd.read_parquet("/path/to/data.parquet")
47
48
# Perform GPU-accelerated operations
49
result = df.groupby('category')['value'].mean()
50
51
# Convert cuDF DataFrame to Dask-cuDF
52
import cudf
53
cudf_df = cudf.DataFrame({'x': [1, 2, 3, 4], 'y': [10, 20, 30, 40]})
54
dask_cudf_df = dask_cudf.from_cudf(cudf_df, npartitions=2)
55
56
# Compute results
57
print(result.compute())
58
```
59
60
## Architecture
61
62
Dask-cuDF provides dual API implementations:
63
64
- **Expression-based API** (modern, default): Uses dask-expr for query optimization and planning
65
- **Legacy API**: Traditional Dask DataFrame implementation for backward compatibility
66
67
The architecture automatically selects the appropriate implementation based on the `DASK_DATAFRAME__QUERY_PLANNING` environment variable.
68
69
**Backend Registration**: Dask-cuDF registers itself as a dataframe backend through entry points defined in pyproject.toml:
70
- `dask.dataframe.backends`: `cudf = "dask_cudf.backends:CudfBackendEntrypoint"`
71
- `dask_expr.dataframe.backends`: `cudf = "dask_cudf.backends:CudfDXBackendEntrypoint"`
72
73
This enables automatic activation when `{"dataframe.backend": "cudf"}` is configured.
74
75
**GPU Memory Management**: Integrates with RMM (RAPIDS Memory Manager) for efficient GPU memory allocation and supports spilling to host memory for improved stability.
76
77
## Capabilities
78
79
### Core DataFrame Operations
80
81
Create, manipulate, and convert GPU-accelerated DataFrames with full compatibility with Dask DataFrame API, including automatic partitioning and distributed computing support.
82
83
```python { .api }
84
class DataFrame:
85
def __init__(self, dsk, name, meta, divisions): ...
86
87
class Series:
88
def __init__(self, dsk, name, meta, divisions): ...
89
90
class Index:
91
def __init__(self, dsk, name, meta, divisions): ...
92
93
def from_cudf(data, npartitions=None, chunksize=None, sort=True, name=None):
94
"""
95
Create Dask-cuDF collection from cuDF object.
96
97
Parameters:
98
- data: cudf.DataFrame, cudf.Series, or cudf.Index
99
- npartitions: int, number of partitions
100
- chunksize: int, rows per partition
101
- sort: bool, sort by index
102
- name: str, collection name
103
104
Returns:
105
Dask-cuDF DataFrame, Series, or Index
106
"""
107
108
def concat(*args, **kwargs):
109
"""Concatenate DataFrames along axis (alias to dask.dataframe.concat)"""
110
```
111
112
[Core Operations](./core-operations.md)
113
114
### Data I/O Operations
115
116
Read and write data in various formats (CSV, JSON, Parquet, ORC, text) with GPU acceleration and automatic cuDF backend integration.
117
118
```python { .api }
119
def read_csv(*args, **kwargs):
120
"""Read CSV files with cuDF backend"""
121
122
def read_json(*args, **kwargs):
123
"""Read JSON files with cuDF backend"""
124
125
def read_parquet(*args, **kwargs):
126
"""Read Parquet files with cuDF backend"""
127
128
def read_orc(*args, **kwargs):
129
"""Read ORC files with cuDF backend"""
130
131
def read_text(path, chunksize="256 MiB", **kwargs):
132
"""Read text files with cuDF backend (conditional availability)"""
133
```
134
135
[Data I/O](./data-io.md)
136
137
### Groupby and Aggregation Operations
138
139
Optimized groupby operations leveraging GPU parallelism for high-performance aggregations with support for custom aggregation functions.
140
141
```python { .api }
142
def groupby_agg(df, by, agg_dict, **kwargs):
143
"""
144
Optimized groupby aggregation for GPU acceleration.
145
146
Parameters:
147
- df: DataFrame to group
148
- by: str or list, grouping columns
149
- agg_dict: dict, aggregation specification
150
151
Returns:
152
Aggregated DataFrame
153
"""
154
```
155
156
[Groupby Operations](./groupby-operations.md)
157
158
### Specialized Data Type Accessors
159
160
Access methods for complex cuDF data types including list and struct columns, providing GPU-accelerated operations on nested data structures.
161
162
```python { .api }
163
class ListMethods:
164
def len(self): ...
165
def contains(self, search_key): ...
166
def get(self, index): ...
167
def unique(self): ...
168
def sort_values(self, ascending=True, **kwargs): ...
169
170
class StructMethods:
171
def field(self, key): ...
172
def explode(self): ...
173
```
174
175
[Data Type Accessors](./data-type-accessors.md)