CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-dask-cudf

Dask cuDF - A GPU Backend for Dask DataFrame providing GPU-accelerated parallel and larger-than-memory DataFrame computing

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

Dask-cuDF

Dask-cuDF is a GPU-accelerated backend for Dask DataFrame that provides a pandas-like API for parallel and larger-than-memory DataFrame computing on GPUs. When installed, it automatically registers as the "cudf" dataframe backend for Dask DataFrame, enabling seamless integration with the broader RAPIDS ecosystem for high-performance data analytics workloads.

Package Information

  • Package Name: dask-cudf
  • Language: Python
  • Installation: pip install dask-cudf or via conda from the RAPIDS channel
  • Dependencies: Requires NVIDIA GPU with CUDA support, cuDF, Dask, and other RAPIDS components

Core Imports

import dask_cudf

For DataFrame operations:

from dask_cudf import DataFrame, Series, Index, from_delayed

Configuration for automatic cuDF backend:

import dask
dask.config.set({"dataframe.backend": "cudf"})

Basic Usage

import dask
import dask.dataframe as dd
import dask_cudf
from dask_cuda import LocalCUDACluster
from distributed import Client

# Set up GPU cluster (optional but recommended for multi-GPU)
client = Client(LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1"))

# Configure to use cuDF backend
dask.config.set({"dataframe.backend": "cudf"})

# Read data using standard Dask interface
df = dd.read_parquet("/path/to/data.parquet")

# Perform GPU-accelerated operations
result = df.groupby('category')['value'].mean()

# Convert cuDF DataFrame to Dask-cuDF
import cudf
cudf_df = cudf.DataFrame({'x': [1, 2, 3, 4], 'y': [10, 20, 30, 40]})
dask_cudf_df = dask_cudf.from_cudf(cudf_df, npartitions=2)

# Compute results
print(result.compute())

Architecture

Dask-cuDF provides dual API implementations:

  • Expression-based API (modern, default): Uses dask-expr for query optimization and planning
  • Legacy API: Traditional Dask DataFrame implementation for backward compatibility

The architecture automatically selects the appropriate implementation based on the DASK_DATAFRAME__QUERY_PLANNING environment variable.

Backend Registration: Dask-cuDF registers itself as a dataframe backend through entry points defined in pyproject.toml:

  • dask.dataframe.backends: cudf = "dask_cudf.backends:CudfBackendEntrypoint"
  • dask_expr.dataframe.backends: cudf = "dask_cudf.backends:CudfDXBackendEntrypoint"

This enables automatic activation when {"dataframe.backend": "cudf"} is configured.

GPU Memory Management: Integrates with RMM (RAPIDS Memory Manager) for efficient GPU memory allocation and supports spilling to host memory for improved stability.

Capabilities

Core DataFrame Operations

Create, manipulate, and convert GPU-accelerated DataFrames with full compatibility with Dask DataFrame API, including automatic partitioning and distributed computing support.

class DataFrame:
    def __init__(self, dsk, name, meta, divisions): ...

class Series:
    def __init__(self, dsk, name, meta, divisions): ...

class Index:
    def __init__(self, dsk, name, meta, divisions): ...

def from_cudf(data, npartitions=None, chunksize=None, sort=True, name=None): 
    """
    Create Dask-cuDF collection from cuDF object.
    
    Parameters:
    - data: cudf.DataFrame, cudf.Series, or cudf.Index
    - npartitions: int, number of partitions
    - chunksize: int, rows per partition
    - sort: bool, sort by index
    - name: str, collection name
    
    Returns:
    Dask-cuDF DataFrame, Series, or Index
    """

def concat(*args, **kwargs):
    """Concatenate DataFrames along axis (alias to dask.dataframe.concat)"""

Core Operations

Data I/O Operations

Read and write data in various formats (CSV, JSON, Parquet, ORC, text) with GPU acceleration and automatic cuDF backend integration.

def read_csv(*args, **kwargs):
    """Read CSV files with cuDF backend"""

def read_json(*args, **kwargs):
    """Read JSON files with cuDF backend"""

def read_parquet(*args, **kwargs):
    """Read Parquet files with cuDF backend"""

def read_orc(*args, **kwargs):
    """Read ORC files with cuDF backend"""

def read_text(path, chunksize="256 MiB", **kwargs):
    """Read text files with cuDF backend (conditional availability)"""

Data I/O

Groupby and Aggregation Operations

Optimized groupby operations leveraging GPU parallelism for high-performance aggregations with support for custom aggregation functions.

def groupby_agg(df, by, agg_dict, **kwargs):
    """
    Optimized groupby aggregation for GPU acceleration.
    
    Parameters:
    - df: DataFrame to group
    - by: str or list, grouping columns
    - agg_dict: dict, aggregation specification
    
    Returns:
    Aggregated DataFrame
    """

Groupby Operations

Specialized Data Type Accessors

Access methods for complex cuDF data types including list and struct columns, providing GPU-accelerated operations on nested data structures.

class ListMethods:
    def len(self): ...
    def contains(self, search_key): ...
    def get(self, index): ...
    def unique(self): ...
    def sort_values(self, ascending=True, **kwargs): ...

class StructMethods:
    def field(self, key): ...
    def explode(self): ...

Data Type Accessors

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/dask-cudf@24.12.x
Publish Source
CLI
Badge
tessl/pypi-dask-cudf badge