or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-dask-cudf

Dask cuDF - A GPU Backend for Dask DataFrame providing GPU-accelerated parallel and larger-than-memory DataFrame computing

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/dask-cudf@24.12.x

To install, run

npx @tessl/cli install tessl/pypi-dask-cudf@24.12.0

0

# Dask-cuDF

1

2

Dask-cuDF is a GPU-accelerated backend for Dask DataFrame that provides a pandas-like API for parallel and larger-than-memory DataFrame computing on GPUs. When installed, it automatically registers as the "cudf" dataframe backend for Dask DataFrame, enabling seamless integration with the broader RAPIDS ecosystem for high-performance data analytics workloads.

3

4

## Package Information

5

6

- **Package Name**: dask-cudf

7

- **Language**: Python

8

- **Installation**: `pip install dask-cudf` or via conda from the RAPIDS channel

9

- **Dependencies**: Requires NVIDIA GPU with CUDA support, cuDF, Dask, and other RAPIDS components

10

11

## Core Imports

12

13

```python

14

import dask_cudf

15

```

16

17

For DataFrame operations:

18

19

```python

20

from dask_cudf import DataFrame, Series, Index, from_delayed

21

```

22

23

Configuration for automatic cuDF backend:

24

25

```python

26

import dask

27

dask.config.set({"dataframe.backend": "cudf"})

28

```

29

30

## Basic Usage

31

32

```python

33

import dask

34

import dask.dataframe as dd

35

import dask_cudf

36

from dask_cuda import LocalCUDACluster

37

from distributed import Client

38

39

# Set up GPU cluster (optional but recommended for multi-GPU)

40

client = Client(LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1"))

41

42

# Configure to use cuDF backend

43

dask.config.set({"dataframe.backend": "cudf"})

44

45

# Read data using standard Dask interface

46

df = dd.read_parquet("/path/to/data.parquet")

47

48

# Perform GPU-accelerated operations

49

result = df.groupby('category')['value'].mean()

50

51

# Convert cuDF DataFrame to Dask-cuDF

52

import cudf

53

cudf_df = cudf.DataFrame({'x': [1, 2, 3, 4], 'y': [10, 20, 30, 40]})

54

dask_cudf_df = dask_cudf.from_cudf(cudf_df, npartitions=2)

55

56

# Compute results

57

print(result.compute())

58

```

59

60

## Architecture

61

62

Dask-cuDF provides dual API implementations:

63

64

- **Expression-based API** (modern, default): Uses dask-expr for query optimization and planning

65

- **Legacy API**: Traditional Dask DataFrame implementation for backward compatibility

66

67

The architecture automatically selects the appropriate implementation based on the `DASK_DATAFRAME__QUERY_PLANNING` environment variable.

68

69

**Backend Registration**: Dask-cuDF registers itself as a dataframe backend through entry points defined in pyproject.toml:

70

- `dask.dataframe.backends`: `cudf = "dask_cudf.backends:CudfBackendEntrypoint"`

71

- `dask_expr.dataframe.backends`: `cudf = "dask_cudf.backends:CudfDXBackendEntrypoint"`

72

73

This enables automatic activation when `{"dataframe.backend": "cudf"}` is configured.

74

75

**GPU Memory Management**: Integrates with RMM (RAPIDS Memory Manager) for efficient GPU memory allocation and supports spilling to host memory for improved stability.

76

77

## Capabilities

78

79

### Core DataFrame Operations

80

81

Create, manipulate, and convert GPU-accelerated DataFrames with full compatibility with Dask DataFrame API, including automatic partitioning and distributed computing support.

82

83

```python { .api }

84

class DataFrame:

85

def __init__(self, dsk, name, meta, divisions): ...

86

87

class Series:

88

def __init__(self, dsk, name, meta, divisions): ...

89

90

class Index:

91

def __init__(self, dsk, name, meta, divisions): ...

92

93

def from_cudf(data, npartitions=None, chunksize=None, sort=True, name=None):

94

"""

95

Create Dask-cuDF collection from cuDF object.

96

97

Parameters:

98

- data: cudf.DataFrame, cudf.Series, or cudf.Index

99

- npartitions: int, number of partitions

100

- chunksize: int, rows per partition

101

- sort: bool, sort by index

102

- name: str, collection name

103

104

Returns:

105

Dask-cuDF DataFrame, Series, or Index

106

"""

107

108

def concat(*args, **kwargs):

109

"""Concatenate DataFrames along axis (alias to dask.dataframe.concat)"""

110

```

111

112

[Core Operations](./core-operations.md)

113

114

### Data I/O Operations

115

116

Read and write data in various formats (CSV, JSON, Parquet, ORC, text) with GPU acceleration and automatic cuDF backend integration.

117

118

```python { .api }

119

def read_csv(*args, **kwargs):

120

"""Read CSV files with cuDF backend"""

121

122

def read_json(*args, **kwargs):

123

"""Read JSON files with cuDF backend"""

124

125

def read_parquet(*args, **kwargs):

126

"""Read Parquet files with cuDF backend"""

127

128

def read_orc(*args, **kwargs):

129

"""Read ORC files with cuDF backend"""

130

131

def read_text(path, chunksize="256 MiB", **kwargs):

132

"""Read text files with cuDF backend (conditional availability)"""

133

```

134

135

[Data I/O](./data-io.md)

136

137

### Groupby and Aggregation Operations

138

139

Optimized groupby operations leveraging GPU parallelism for high-performance aggregations with support for custom aggregation functions.

140

141

```python { .api }

142

def groupby_agg(df, by, agg_dict, **kwargs):

143

"""

144

Optimized groupby aggregation for GPU acceleration.

145

146

Parameters:

147

- df: DataFrame to group

148

- by: str or list, grouping columns

149

- agg_dict: dict, aggregation specification

150

151

Returns:

152

Aggregated DataFrame

153

"""

154

```

155

156

[Groupby Operations](./groupby-operations.md)

157

158

### Specialized Data Type Accessors

159

160

Access methods for complex cuDF data types including list and struct columns, providing GPU-accelerated operations on nested data structures.

161

162

```python { .api }

163

class ListMethods:

164

def len(self): ...

165

def contains(self, search_key): ...

166

def get(self, index): ...

167

def unique(self): ...

168

def sort_values(self, ascending=True, **kwargs): ...

169

170

class StructMethods:

171

def field(self, key): ...

172

def explode(self): ...

173

```

174

175

[Data Type Accessors](./data-type-accessors.md)