tessl/pypi-cudf-cu12

GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data

—

Pending

Overview

Eval results

Files

Core Data Structures

Name: tessl/pypi-cudf-cu12
Author: tessl

cuDF provides GPU-accelerated versions of pandas' core data structures with enhanced capabilities for handling large datasets and complex data types. All structures leverage GPU memory for optimal performance.

DataFrame

The primary data structure for two-dimensional, tabular data with labeled axes.

class DataFrame:
    """
    GPU-accelerated DataFrame with pandas-like API
    
    Two-dimensional, size-mutable, potentially heterogeneous tabular data structure
    with labeled axes (rows and columns). Stored in GPU memory with columnar layout
    for optimal performance.
    
    Parameters:
        data: dict, list, ndarray, Series, DataFrame, optional
            Data to initialize DataFrame from various sources
        index: Index or array-like, optional
            Index (row labels) for the DataFrame
        columns: Index or array-like, optional  
            Column labels for the DataFrame
        dtype: dtype, optional
            Data type to force, otherwise infer
        copy: bool, default False
            Copy data if True
    
    Attributes:
        index: Index representing row labels
        columns: Index representing column labels  
        dtypes: Series with column data types
        shape: tuple representing DataFrame dimensions
        size: int representing total number of elements
        ndim: int representing number of dimensions (always 2)
        empty: bool indicating if DataFrame is empty
        
    Examples:
        # Create from dictionary
        df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4.0, 5.1, 6.2]})
        
        # Create with custom index
        df = cudf.DataFrame(
            {'x': [1, 2], 'y': [3, 4]},
            index=['row1', 'row2']
        )
    """

Series

One-dimensional labeled array capable of holding any data type.

class Series:
    """
    GPU-accelerated one-dimensional array with axis labels
    
    One-dimensional ndarray-like object containing an array of data and
    associated array of labels, called its index. Optimized for GPU computation
    with automatic memory management.
    
    Parameters:
        data: array-like, dict, scalar value
            Contains data stored in Series
        index: array-like or Index, optional
            Values must be hashable and same length as data
        dtype: dtype, optional
            Data type for the output Series
        name: str, optional
            Name to give to the Series
        copy: bool, default False
            Copy input data if True
            
    Attributes:
        index: Index representing the axis labels
        dtype: numpy.dtype representing data type
        shape: tuple representing Series dimensions  
        size: int representing number of elements
        ndim: int representing number of dimensions (always 1)
        name: str or None representing Series name
        values: cupy.ndarray representing underlying data
        
    Examples:
        # Create from list
        s = cudf.Series([1, 2, 3, 4, 5])
        
        # Create with index and name
        s = cudf.Series([1.1, 2.2, 3.3], 
                       index=['a', 'b', 'c'], 
                       name='values')
    """

Index Classes

Immutable sequences used for axis labels and data selection.

Base Index

class Index:
    """
    Immutable sequence used for axis labels and selection
    
    Base class for all index types in cuDF. Provides common functionality
    for indexing, selection, and alignment operations. GPU-accelerated for
    large-scale operations.
    
    Parameters:
        data: array-like (1-D)
            Data to create index from
        dtype: numpy.dtype, optional
            Data type for index
        copy: bool, default False
            Copy input data if True
        name: str, optional
            Name for the index
            
    Attributes:
        dtype: numpy.dtype representing data type
        shape: tuple representing index dimensions
        size: int representing number of elements
        ndim: int representing number of dimensions (always 1)
        name: str or None representing index name
        values: cupy.ndarray representing underlying data
        is_unique: bool indicating if all values are unique
        
    Examples:
        # Create from list
        idx = cudf.Index([1, 2, 3, 4])
        
        # Create with name
        idx = cudf.Index(['a', 'b', 'c'], name='letters')
    """

RangeIndex

class RangeIndex(Index):
    """
    Memory-efficient index representing a range of integers
    
    Immutable index implementing a monotonic integer range. Optimized for
    memory efficiency by storing only start, stop, and step values rather
    than materializing the entire range.
    
    Parameters:
        start: int, optional (default 0)
            Start value of the range
        stop: int, optional
            Stop value of the range (exclusive)
        step: int, optional (default 1)
            Step size of the range
        name: str, optional
            Name for the index
            
    Attributes:
        start: int representing range start
        stop: int representing range stop  
        step: int representing range step
        
    Examples:
        # Create range index
        idx = cudf.RangeIndex(10)  # 0 to 9
        idx = cudf.RangeIndex(1, 11, 2)  # 1, 3, 5, 7, 9
    """

CategoricalIndex

class CategoricalIndex(Index):
    """
    Index for categorical data with GPU acceleration
    
    Immutable index for categorical data. Provides memory efficiency for
    repeated string or numeric values by storing categories and codes
    separately. GPU-accelerated for large categorical datasets.
    
    Parameters:
        data: array-like
            Categorical data for the index
        categories: array-like, optional
            Unique categories for the data
        ordered: bool, default False
            Whether categories have a meaningful order
        dtype: CategoricalDtype, optional
            Categorical data type
        name: str, optional
            Name for the index
            
    Attributes:
        categories: Index representing unique categories
        codes: cupy.ndarray representing category codes
        ordered: bool indicating if categories are ordered
        
    Examples:
        # Create categorical index
        idx = cudf.CategoricalIndex(['red', 'blue', 'red', 'green'])
        
        # With explicit categories  
        idx = cudf.CategoricalIndex(
            ['small', 'large', 'medium'],
            categories=['small', 'medium', 'large'],
            ordered=True
        )
    """

DatetimeIndex

class DatetimeIndex(Index):
    """
    Index for datetime values with GPU acceleration
    
    Immutable index containing datetime64 values. Provides fast temporal
    operations and date-based selection. GPU-accelerated for time series
    operations on large datasets.
    
    Parameters:
        data: array-like
            Datetime-like data for the index
        freq: str or DateOffset, optional
            Frequency of the datetime data
        tz: str or timezone, optional
            Timezone for localized datetime index
        normalize: bool, default False
            Normalize start/end dates to midnight
        name: str, optional
            Name for the index
            
    Attributes:
        freq: str or None representing frequency
        tz: timezone or None representing timezone
        year: Series representing year values
        month: Series representing month values  
        day: Series representing day values
        hour: Series representing hour values
        minute: Series representing minute values
        second: Series representing second values
        
    Examples:
        # Create from date strings
        idx = cudf.DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03'])
        
        # With timezone
        idx = cudf.DatetimeIndex(
            ['2023-01-01', '2023-01-02'], 
            tz='UTC'
        )
    """

TimedeltaIndex

class TimedeltaIndex(Index):
    """
    Index for timedelta values with GPU acceleration
    
    Immutable index containing timedelta64 values. Represents durations
    and time differences. GPU-accelerated for temporal arithmetic operations.
    
    Parameters:
        data: array-like
            Timedelta-like data for the index
        unit: str, optional
            Unit of the timedelta data ('D', 'h', 'm', 's', etc.)
        freq: str or DateOffset, optional  
            Frequency of the timedelta data
        name: str, optional
            Name for the index
            
    Attributes:
        freq: str or None representing frequency
        components: DataFrame with timedelta components
        days: Series representing days component
        seconds: Series representing seconds component
        microseconds: Series representing microseconds component
        nanoseconds: Series representing nanoseconds component
        
    Examples:
        # Create from timedelta strings
        idx = cudf.TimedeltaIndex(['1 day', '2 hours', '30 minutes'])
        
        # From numeric values with unit
        idx = cudf.TimedeltaIndex([1, 2, 3], unit='D')
    """

IntervalIndex

class IntervalIndex(Index):
    """
    Index for interval data with GPU acceleration
    
    Immutable index containing Interval objects. Represents closed, open,
    or half-open intervals. GPU-accelerated for interval-based operations
    and overlapping queries.
    
    Parameters:
        data: array-like
            Interval-like data for the index
        closed: str, default 'right'
            Whether intervals are closed ('left', 'right', 'both', 'neither')
        dtype: IntervalDtype, optional
            Interval data type
        name: str, optional
            Name for the index
            
    Attributes:
        closed: str representing interval closure type
        left: Index representing left bounds
        right: Index representing right bounds  
        mid: Index representing interval midpoints
        length: Index representing interval lengths
        
    Examples:
        # Create from arrays
        left = [0, 1, 2]
        right = [1, 2, 3]
        idx = cudf.IntervalIndex.from_arrays(left, right)
        
        # From tuples
        intervals = [(0, 1), (1, 2), (2, 3)]
        idx = cudf.IntervalIndex.from_tuples(intervals)
    """

MultiIndex

class MultiIndex(Index):
    """
    Multi-level/hierarchical index for GPU DataFrames
    
    Multi-level index object. Represents multiple levels of indexing
    on a single axis. GPU-accelerated for hierarchical data operations
    and multi-dimensional selections.
    
    Parameters:
        levels: sequence of arrays
            Unique labels for each level
        codes: sequence of arrays  
            Integers for each level indicating label positions
        names: sequence of str, optional
            Names for each level
        
    Attributes:
        levels: list of Index objects representing each level
        codes: list of arrays representing level codes
        names: list of str representing level names
        nlevels: int representing number of levels
        
    Examples:
        # Create from arrays
        arrays = [
            ['A', 'A', 'B', 'B'],
            [1, 2, 1, 2]
        ]
        idx = cudf.MultiIndex.from_arrays(arrays, names=['letter', 'number'])
        
        # From tuples
        tuples = [('A', 1), ('A', 2), ('B', 1), ('B', 2)]
        idx = cudf.MultiIndex.from_tuples(tuples)
    """

Data Types

Extended data type system supporting nested and specialized types.

CategoricalDtype

class CategoricalDtype:
    """
    Extension dtype for categorical data
    
    Data type for categorical data with optional ordering. Provides memory
    efficiency for repeated values and supports ordered categorical operations.
    
    Parameters:
        categories: Index-like, optional
            Unique categories for the data
        ordered: bool, default False
            Whether categories have meaningful order
            
    Attributes:
        categories: Index representing unique categories
        ordered: bool indicating if categories are ordered
        
    Examples:
        # Create categorical dtype
        dtype = cudf.CategoricalDtype(['red', 'blue', 'green'])
        
        # With ordering
        dtype = cudf.CategoricalDtype(
            ['small', 'medium', 'large'], 
            ordered=True
        )
    """

Decimal Data Types

class Decimal32Dtype:
    """
    32-bit fixed-point decimal data type
    
    Extension dtype for 32-bit decimal numbers with configurable precision
    and scale. Provides exact decimal arithmetic without floating-point errors.
    
    Parameters:
        precision: int (1-9)
            Total number of digits
        scale: int (0-precision)
            Number of digits after decimal point
            
    Examples:
        # Create decimal32 dtype
        dtype = cudf.Decimal32Dtype(precision=7, scale=2)  # 99999.99 max
    """

class Decimal64Dtype:
    """
    64-bit fixed-point decimal data type
    
    Extension dtype for 64-bit decimal numbers with configurable precision
    and scale. Provides exact decimal arithmetic for financial calculations.
    
    Parameters:
        precision: int (1-18)
            Total number of digits
        scale: int (0-precision) 
            Number of digits after decimal point
            
    Examples:
        # Create decimal64 dtype
        dtype = cudf.Decimal64Dtype(precision=10, scale=4)  # 999999.9999 max
    """

class Decimal128Dtype:
    """
    128-bit fixed-point decimal data type
    
    Extension dtype for 128-bit decimal numbers with configurable precision
    and scale. Provides highest precision decimal arithmetic.
    
    Parameters:
        precision: int (1-38)
            Total number of digits  
        scale: int (0-precision)
            Number of digits after decimal point
            
    Examples:
        # Create decimal128 dtype  
        dtype = cudf.Decimal128Dtype(precision=20, scale=6)
    """

Nested Data Types

class ListDtype:
    """
    Extension dtype for nested list data
    
    Data type representing lists of elements where each row can contain
    a variable-length list. Supports nested operations and list processing
    on GPU.
    
    Parameters:
        element_type: dtype
            Data type of list elements
            
    Attributes:
        element_type: dtype representing element data type
        
    Examples:
        # Create list dtype
        dtype = cudf.ListDtype('int64')  # Lists of integers
        dtype = cudf.ListDtype('float32')  # Lists of floats
    """

class StructDtype:
    """
    Extension dtype for nested struct data
    
    Data type representing structured data where each row contains
    multiple named fields. Similar to database records or JSON objects.
    
    Parameters:
        fields: dict
            Mapping of field names to data types
            
    Attributes:
        fields: dict representing field name to dtype mapping
        
    Examples:
        # Create struct dtype
        fields = {'x': 'int64', 'y': 'float64', 'name': 'object'}
        dtype = cudf.StructDtype(fields)
    """

IntervalDtype

class IntervalDtype:
    """
    Extension dtype for interval data
    
    Data type for interval objects with configurable closure behavior
    and subtype. Used for representing ranges and interval-based operations.
    
    Parameters:
        subtype: dtype, optional (default 'float64')
            Data type for interval bounds
        closed: str, optional (default 'right')
            Whether intervals are closed ('left', 'right', 'both', 'neither')
            
    Attributes:
        subtype: dtype representing bounds data type
        closed: str representing closure behavior
        
    Examples:
        # Create interval dtype
        dtype = cudf.IntervalDtype('int64', closed='both')
        dtype = cudf.IntervalDtype('float32', closed='left')
    """

Special Values

Constants for representing missing and special values.

NA = cudf.NA
"""
Scalar representation of missing value

cuDF's representation of a missing value that is compatible across
all data types including nested types. Distinct from None and np.nan.

Examples:
    # Create Series with missing values
    s = cudf.Series([1, cudf.NA, 3])
    
    # Check for missing values  
    mask = s.isna()  # Returns boolean mask
"""

NaT = cudf.NaT  
"""
Not-a-Time representation for datetime/timedelta

Pandas-compatible representation of missing datetime or timedelta values.
Used specifically for temporal data types.

Examples:
    # Create datetime series with NaT
    dates = cudf.Series(['2023-01-01', cudf.NaT, '2023-01-03'])
    dates = cudf.to_datetime(dates)
"""

Memory Management

cuDF data structures leverage RAPIDS Memory Manager (RMM) for optimal GPU memory usage:

Columnar Storage: Apache Arrow format for cache efficiency
Memory Pools: Reduces allocation overhead for frequent operations
Zero-Copy: Minimal data movement between operations
Automatic Cleanup: Garbage collection integration for GPU memory
Memory Mapping: Support for memory-mapped files

Type Conversions

# GPU to CPU conversion
df_pandas = cudf_df.to_pandas()
series_pandas = cudf_series.to_pandas()

# CPU to GPU conversion  
cudf_df = cudf.from_pandas(pandas_df)
cudf_series = cudf.from_pandas(pandas_series)

# Arrow integration
arrow_table = cudf_df.to_arrow()
cudf_df = cudf.from_arrow(arrow_table)

# NumPy/CuPy arrays
cupy_array = cudf_series.values  # Get underlying CuPy array
cudf_series = cudf.Series(cupy_array)  # Create from CuPy array