Tessl Tile for pypi/dask-cudf@24.12.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-dask-cudf

Dask cuDF - A GPU Backend for Dask DataFrame providing GPU-accelerated parallel and larger-than-memory DataFrame computing

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/dask-cudf@24.12.x

To install, run

npx @tessl/cli install tessl/pypi-dask-cudf@24.12.0

0
# Dask-cuDF
1

2
Dask-cuDF is a GPU-accelerated backend for Dask DataFrame that provides a pandas-like API for parallel and larger-than-memory DataFrame computing on GPUs. When installed, it automatically registers as the "cudf" dataframe backend for Dask DataFrame, enabling seamless integration with the broader RAPIDS ecosystem for high-performance data analytics workloads.
3

4
## Package Information
5

6
- **Package Name**: dask-cudf
7
- **Language**: Python
8
- **Installation**: `pip install dask-cudf` or via conda from the RAPIDS channel
9
- **Dependencies**: Requires NVIDIA GPU with CUDA support, cuDF, Dask, and other RAPIDS components
10

11
## Core Imports
12

13
```python
14
import dask_cudf
15
```
16

17
For DataFrame operations:
18

19
```python
20
from dask_cudf import DataFrame, Series, Index, from_delayed
21
```
22

23
Configuration for automatic cuDF backend:
24

25
```python
26
import dask
27
dask.config.set({"dataframe.backend": "cudf"})
28
```
29

30
## Basic Usage
31

32
```python
33
import dask
34
import dask.dataframe as dd
35
import dask_cudf
36
from dask_cuda import LocalCUDACluster
37
from distributed import Client
38

39
# Set up GPU cluster (optional but recommended for multi-GPU)
40
client = Client(LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1"))
41

42
# Configure to use cuDF backend
43
dask.config.set({"dataframe.backend": "cudf"})
44

45
# Read data using standard Dask interface
46
df = dd.read_parquet("/path/to/data.parquet")
47

48
# Perform GPU-accelerated operations
49
result = df.groupby('category')['value'].mean()
50

51
# Convert cuDF DataFrame to Dask-cuDF
52
import cudf
53
cudf_df = cudf.DataFrame({'x': [1, 2, 3, 4], 'y': [10, 20, 30, 40]})
54
dask_cudf_df = dask_cudf.from_cudf(cudf_df, npartitions=2)
55

56
# Compute results
57
print(result.compute())
58
```
59

60
## Architecture
61

62
Dask-cuDF provides dual API implementations:
63

64
- **Expression-based API** (modern, default): Uses dask-expr for query optimization and planning
65
- **Legacy API**: Traditional Dask DataFrame implementation for backward compatibility
66

67
The architecture automatically selects the appropriate implementation based on the `DASK_DATAFRAME__QUERY_PLANNING` environment variable.
68

69
**Backend Registration**: Dask-cuDF registers itself as a dataframe backend through entry points defined in pyproject.toml:
70
- `dask.dataframe.backends`: `cudf = "dask_cudf.backends:CudfBackendEntrypoint"`
71
- `dask_expr.dataframe.backends`: `cudf = "dask_cudf.backends:CudfDXBackendEntrypoint"`
72

73
This enables automatic activation when `{"dataframe.backend": "cudf"}` is configured.
74

75
**GPU Memory Management**: Integrates with RMM (RAPIDS Memory Manager) for efficient GPU memory allocation and supports spilling to host memory for improved stability.
76

77
## Capabilities
78

79
### Core DataFrame Operations
80

81
Create, manipulate, and convert GPU-accelerated DataFrames with full compatibility with Dask DataFrame API, including automatic partitioning and distributed computing support.
82

83
```python { .api }
84
class DataFrame:
85
    def __init__(self, dsk, name, meta, divisions): ...
86

87
class Series:
88
    def __init__(self, dsk, name, meta, divisions): ...
89

90
class Index:
91
    def __init__(self, dsk, name, meta, divisions): ...
92

93
def from_cudf(data, npartitions=None, chunksize=None, sort=True, name=None): 
94
    """
95
    Create Dask-cuDF collection from cuDF object.
96
    
97
    Parameters:
98
    - data: cudf.DataFrame, cudf.Series, or cudf.Index
99
    - npartitions: int, number of partitions
100
    - chunksize: int, rows per partition
101
    - sort: bool, sort by index
102
    - name: str, collection name
103
    
104
    Returns:
105
    Dask-cuDF DataFrame, Series, or Index
106
    """
107

108
def concat(*args, **kwargs):
109
    """Concatenate DataFrames along axis (alias to dask.dataframe.concat)"""
110
```
111

112
[Core Operations](./core-operations.md)
113

114
### Data I/O Operations
115

116
Read and write data in various formats (CSV, JSON, Parquet, ORC, text) with GPU acceleration and automatic cuDF backend integration.
117

118
```python { .api }
119
def read_csv(*args, **kwargs):
120
    """Read CSV files with cuDF backend"""
121

122
def read_json(*args, **kwargs):
123
    """Read JSON files with cuDF backend"""
124

125
def read_parquet(*args, **kwargs):
126
    """Read Parquet files with cuDF backend"""
127

128
def read_orc(*args, **kwargs):
129
    """Read ORC files with cuDF backend"""
130

131
def read_text(path, chunksize="256 MiB", **kwargs):
132
    """Read text files with cuDF backend (conditional availability)"""
133
```
134

135
[Data I/O](./data-io.md)
136

137
### Groupby and Aggregation Operations
138

139
Optimized groupby operations leveraging GPU parallelism for high-performance aggregations with support for custom aggregation functions.
140

141
```python { .api }
142
def groupby_agg(df, by, agg_dict, **kwargs):
143
    """
144
    Optimized groupby aggregation for GPU acceleration.
145
    
146
    Parameters:
147
    - df: DataFrame to group
148
    - by: str or list, grouping columns
149
    - agg_dict: dict, aggregation specification
150
    
151
    Returns:
152
    Aggregated DataFrame
153
    """
154
```
155

156
[Groupby Operations](./groupby-operations.md)
157

158
### Specialized Data Type Accessors
159

160
Access methods for complex cuDF data types including list and struct columns, providing GPU-accelerated operations on nested data structures.
161

162
```python { .api }
163
class ListMethods:
164
    def len(self): ...
165
    def contains(self, search_key): ...
166
    def get(self, index): ...
167
    def unique(self): ...
168
    def sort_values(self, ascending=True, **kwargs): ...
169

170
class StructMethods:
171
    def field(self, key): ...
172
    def explode(self): ...
173
```
174

175
[Data Type Accessors](./data-type-accessors.md)