Dask cuDF - A GPU Backend for Dask DataFrame providing GPU-accelerated parallel and larger-than-memory DataFrame computing
npx @tessl/cli install tessl/pypi-dask-cudf@24.12.0Dask-cuDF is a GPU-accelerated backend for Dask DataFrame that provides a pandas-like API for parallel and larger-than-memory DataFrame computing on GPUs. When installed, it automatically registers as the "cudf" dataframe backend for Dask DataFrame, enabling seamless integration with the broader RAPIDS ecosystem for high-performance data analytics workloads.
pip install dask-cudf or via conda from the RAPIDS channelimport dask_cudfFor DataFrame operations:
from dask_cudf import DataFrame, Series, Index, from_delayedConfiguration for automatic cuDF backend:
import dask
dask.config.set({"dataframe.backend": "cudf"})import dask
import dask.dataframe as dd
import dask_cudf
from dask_cuda import LocalCUDACluster
from distributed import Client
# Set up GPU cluster (optional but recommended for multi-GPU)
client = Client(LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1"))
# Configure to use cuDF backend
dask.config.set({"dataframe.backend": "cudf"})
# Read data using standard Dask interface
df = dd.read_parquet("/path/to/data.parquet")
# Perform GPU-accelerated operations
result = df.groupby('category')['value'].mean()
# Convert cuDF DataFrame to Dask-cuDF
import cudf
cudf_df = cudf.DataFrame({'x': [1, 2, 3, 4], 'y': [10, 20, 30, 40]})
dask_cudf_df = dask_cudf.from_cudf(cudf_df, npartitions=2)
# Compute results
print(result.compute())Dask-cuDF provides dual API implementations:
The architecture automatically selects the appropriate implementation based on the DASK_DATAFRAME__QUERY_PLANNING environment variable.
Backend Registration: Dask-cuDF registers itself as a dataframe backend through entry points defined in pyproject.toml:
dask.dataframe.backends: cudf = "dask_cudf.backends:CudfBackendEntrypoint"dask_expr.dataframe.backends: cudf = "dask_cudf.backends:CudfDXBackendEntrypoint"This enables automatic activation when {"dataframe.backend": "cudf"} is configured.
GPU Memory Management: Integrates with RMM (RAPIDS Memory Manager) for efficient GPU memory allocation and supports spilling to host memory for improved stability.
Create, manipulate, and convert GPU-accelerated DataFrames with full compatibility with Dask DataFrame API, including automatic partitioning and distributed computing support.
class DataFrame:
def __init__(self, dsk, name, meta, divisions): ...
class Series:
def __init__(self, dsk, name, meta, divisions): ...
class Index:
def __init__(self, dsk, name, meta, divisions): ...
def from_cudf(data, npartitions=None, chunksize=None, sort=True, name=None):
"""
Create Dask-cuDF collection from cuDF object.
Parameters:
- data: cudf.DataFrame, cudf.Series, or cudf.Index
- npartitions: int, number of partitions
- chunksize: int, rows per partition
- sort: bool, sort by index
- name: str, collection name
Returns:
Dask-cuDF DataFrame, Series, or Index
"""
def concat(*args, **kwargs):
"""Concatenate DataFrames along axis (alias to dask.dataframe.concat)"""Read and write data in various formats (CSV, JSON, Parquet, ORC, text) with GPU acceleration and automatic cuDF backend integration.
def read_csv(*args, **kwargs):
"""Read CSV files with cuDF backend"""
def read_json(*args, **kwargs):
"""Read JSON files with cuDF backend"""
def read_parquet(*args, **kwargs):
"""Read Parquet files with cuDF backend"""
def read_orc(*args, **kwargs):
"""Read ORC files with cuDF backend"""
def read_text(path, chunksize="256 MiB", **kwargs):
"""Read text files with cuDF backend (conditional availability)"""Optimized groupby operations leveraging GPU parallelism for high-performance aggregations with support for custom aggregation functions.
def groupby_agg(df, by, agg_dict, **kwargs):
"""
Optimized groupby aggregation for GPU acceleration.
Parameters:
- df: DataFrame to group
- by: str or list, grouping columns
- agg_dict: dict, aggregation specification
Returns:
Aggregated DataFrame
"""Access methods for complex cuDF data types including list and struct columns, providing GPU-accelerated operations on nested data structures.
class ListMethods:
def len(self): ...
def contains(self, search_key): ...
def get(self, index): ...
def unique(self): ...
def sort_values(self, ascending=True, **kwargs): ...
class StructMethods:
def field(self, key): ...
def explode(self): ...