tessl/pypi-cudf-cu12

GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data

—

Quality

—

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

—

The risk profile of this skill

Overview

Eval results

Files

cuDF: GPU-Accelerated DataFrames

Name: tessl/pypi-cudf-cu12
Author: tessl

cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

Package Information

Package: cudf-cu12
Import: cudf
Version: 25.8.0+
Installation: pip install cudf-cu12 or conda install cudf
Requirements: NVIDIA GPU with CUDA support

Core Imports

# Main data structures
import cudf
from cudf import DataFrame, Series, Index

# I/O operations
from cudf import read_csv, read_parquet, read_json
from cudf.io import read_orc, read_avro, read_feather

# Data manipulation
from cudf import concat, merge, pivot_table
from cudf import cut, factorize, unique

# Type checking
from cudf.api.types import is_numeric_dtype, is_categorical_dtype
from cudf.api.types import dtype

# Configuration
from cudf.options import get_option, set_option

# Dataset generation
from cudf.datasets import timeseries, randomdata

# Version information
import cudf
print(cudf.__version__)  # Package version

Basic Usage

# Create DataFrame from dictionary
df = cudf.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [1.0, 2.5, 3.2, 4.1, 5.8],
    'z': ['red', 'green', 'blue', 'red', 'green']
})

# GPU-accelerated operations
result = df.groupby('z').agg({'x': 'sum', 'y': 'mean'})

# I/O operations leverage GPU memory
df_from_file = cudf.read_parquet('data.parquet')
df_from_file.to_csv('output.csv')

# Seamless pandas compatibility
df_pandas = df.to_pandas()  # Move to CPU
df_cudf = cudf.from_pandas(df_pandas)  # Move to GPU

Architecture

cuDF leverages the RAPIDS ecosystem to provide GPU-accelerated data processing:

GPU Memory Management: Built on RAPIDS Memory Manager (RMM) for efficient GPU memory allocation
Columnar Storage: Uses Apache Arrow format for optimal GPU performance
libcudf Backend: C++/CUDA library provides the computational engine
Pandas API: Maintains familiar pandas interface while delivering GPU performance
Zero-Copy Interop: Seamless integration with PyArrow, Numba, and other GPU libraries

Core Data Structures

cuDF provides GPU-accelerated versions of pandas' core data structures with enhanced capabilities.

class DataFrame:
    """GPU-accelerated DataFrame with pandas-like API"""
    
class Series:
    """One-dimensional GPU array with axis labels"""
    
class Index:
    """Immutable sequence used for axis labels and selection"""
    
class RangeIndex(Index):
    """Memory-efficient index for integer ranges"""
    
class CategoricalIndex(Index):
    """Index for categorical data with GPU acceleration"""

Key Features: GPU memory efficiency, nested data types (lists, structs), decimal precision support.

→ Learn more about Core Data Structures

I/O Operations

High-performance GPU I/O for popular data formats with automatic memory management.

def read_parquet(filepath_or_buffer, columns=None, **kwargs) -> DataFrame:
    """
    Read Apache Parquet file directly into GPU memory
    
    Parameters:
        filepath_or_buffer: File path, URL, or buffer-like object
        columns: List[str], optional column subset to read
        **kwargs: Additional parquet reading options
        
    Returns:
        DataFrame: GPU-accelerated DataFrame
    """

def read_csv(filepath_or_buffer, **kwargs) -> DataFrame:
    """
    Read CSV file with GPU acceleration
    
    Parameters:
        filepath_or_buffer: File path or buffer
        **kwargs: CSV parsing options (delimiter, header, etc.)
        
    Returns:
        DataFrame: GPU DataFrame with parsed CSV data
    """

Supported Formats: Parquet, ORC, CSV, JSON, Avro, Feather, HDF5, raw text files.

→ Learn more about I/O Operations

Data Manipulation

GPU-accelerated operations for reshaping, joining, and transforming data.

def concat(objs, axis=0, ignore_index=False, **kwargs) -> Union[DataFrame, Series]:
    """
    Concatenate cuDF objects along a particular axis
    
    Parameters:
        objs: Sequence of DataFrame/Series objects
        axis: int, axis to concatenate along (0='index', 1='columns')
        ignore_index: bool, reset index if True
        
    Returns:
        Union[DataFrame, Series]: Concatenated result
    """

def merge(left, right, how='inner', on=None, **kwargs) -> DataFrame:
    """
    Merge DataFrame objects with database-style join operations
    
    Parameters:
        left: DataFrame, left object to merge
        right: DataFrame, right object to merge  
        how: str, type of merge ('inner', 'outer', 'left', 'right')
        on: label or list, column names to join on
        
    Returns:
        DataFrame: Merged DataFrame
    """

Operations: Concatenation, merging, pivoting, melting, groupby, aggregation, sorting.

→ Learn more about Data Manipulation

Type Checking & Validation

Comprehensive type checking system for GPU data types including nested types.

def is_numeric_dtype(arr_or_dtype) -> bool:
    """
    Check whether the provided array or dtype is numeric
    
    Parameters:
        arr_or_dtype: Array-like or data type to check
        
    Returns:
        bool: True if numeric dtype
    """

def is_categorical_dtype(arr_or_dtype) -> bool:
    """
    Check whether the array or dtype is categorical
    
    Parameters:
        arr_or_dtype: Array-like or data type to check
        
    Returns:
        bool: True if categorical dtype  
    """

Type Support: Standard dtypes, categorical, decimal, list, struct, interval, datetime types.

→ Learn more about Type Checking

Pandas Compatibility Layer

Drop-in acceleration for existing pandas code with cudf.pandas.

def install() -> None:
    """
    Enable cuDF pandas accelerator mode
    
    Automatically accelerates pandas operations with GPU when beneficial,
    falls back to CPU pandas for unsupported operations.
    """

class Profiler:
    """
    Performance profiler for pandas acceleration opportunities
    
    Analyzes pandas code execution to identify GPU acceleration potential
    """

Features: Automatic fallback, transparent acceleration, performance profiling, IPython magic commands.

→ Learn more about Pandas Compatibility

Testing Utilities

GPU-aware testing framework with specialized assertions for cuDF objects.

def assert_frame_equal(left, right, check_dtype=True, **kwargs) -> None:
    """
    Assert DataFrame equality with GPU-aware comparison
    
    Parameters:
        left: DataFrame, expected result
        right: DataFrame, actual result
        check_dtype: bool, whether to check dtype compatibility
        **kwargs: Additional comparison options
    """

Capabilities: DataFrame/Series/Index comparison, GPU memory validation, performance assertions.

→ Learn more about Testing Utilities

Configuration Management

Global configuration system for controlling GPU memory usage and behavior.

def get_option(key: str) -> Any:
    """
    Get the value of a configuration option
    
    Parameters:
        key: str, configuration option key
        
    Returns:
        Any: Current option value
    """

def set_option(key: str, value: Any) -> None:
    """
    Set a configuration option value
    
    Parameters:  
        key: str, configuration option key
        value: Any, new option value
    """

Options: Memory management, display formatting, computation behavior, I/O settings.

Error Handling

Specialized error types for GPU-specific issues and mixed-type operations.

class UnsupportedCUDAError(Exception):
    """Raised when CUDA functionality is not supported"""

class MixedTypeError(Exception):
    """Raised when mixing incompatible GPU and CPU types"""

Dataset Generation

Utilities for generating test data and benchmarking datasets directly in GPU memory.

def timeseries(
    start='2000-01-01', 
    end='2000-01-31', 
    freq='1s', 
    dtypes=None, 
    nulls_frequency=0, 
    seed=None
) -> DataFrame:
    """
    Generate random timeseries data for testing and benchmarking
    
    Parameters:
        start: str or datetime-like, start date
        end: str or datetime-like, end date  
        freq: str, date frequency string (e.g., '1s', '1H', '1D')
        dtypes: dict, mapping of column names to types
        nulls_frequency: float, proportion of nulls to include (0-1)
        seed: int, random state seed for reproducibility
        
    Returns:
        DataFrame: GPU DataFrame with random timeseries data
    """

def randomdata(nrows=10, dtypes=None, seed=None) -> DataFrame:
    """
    Generate random data for testing and benchmarking
    
    Parameters:
        nrows: int, number of rows to generate
        dtypes: dict, mapping of column names to types
        seed: int, random state seed for reproducibility
        
    Returns:
        DataFrame: GPU DataFrame with random data
    """

Performance Benefits

Memory Bandwidth: 10-50x improvement over pandas for large datasets
Parallel Processing: Leverages thousands of GPU cores for operations
Memory Efficiency: Columnar storage reduces memory footprint
Zero-Copy: Minimal data movement between GPU operations
Automatic Optimization: Query optimization and kernel fusion

GPU Requirements

NVIDIA GPU with Compute Capability 7.0+ (Volta architecture or newer)
CUDA 11.2+ or CUDA 12.0+
Sufficient GPU memory for dataset size
Compatible NVIDIA drivers

Version Information

Access package version and build information programmatically.

import cudf

# Package version string  
__version__ = cudf.__version__  # e.g., "25.8.0"

# Git commit hash (if available)
__git_commit__ = cudf.__git_commit__  # e.g., "6cea3743b6"