GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data
—
cuDF provides GPU-accelerated versions of pandas' core data structures with enhanced capabilities for handling large datasets and complex data types. All structures leverage GPU memory for optimal performance.
The primary data structure for two-dimensional, tabular data with labeled axes.
class DataFrame:
"""
GPU-accelerated DataFrame with pandas-like API
Two-dimensional, size-mutable, potentially heterogeneous tabular data structure
with labeled axes (rows and columns). Stored in GPU memory with columnar layout
for optimal performance.
Parameters:
data: dict, list, ndarray, Series, DataFrame, optional
Data to initialize DataFrame from various sources
index: Index or array-like, optional
Index (row labels) for the DataFrame
columns: Index or array-like, optional
Column labels for the DataFrame
dtype: dtype, optional
Data type to force, otherwise infer
copy: bool, default False
Copy data if True
Attributes:
index: Index representing row labels
columns: Index representing column labels
dtypes: Series with column data types
shape: tuple representing DataFrame dimensions
size: int representing total number of elements
ndim: int representing number of dimensions (always 2)
empty: bool indicating if DataFrame is empty
Examples:
# Create from dictionary
df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4.0, 5.1, 6.2]})
# Create with custom index
df = cudf.DataFrame(
{'x': [1, 2], 'y': [3, 4]},
index=['row1', 'row2']
)
"""One-dimensional labeled array capable of holding any data type.
class Series:
"""
GPU-accelerated one-dimensional array with axis labels
One-dimensional ndarray-like object containing an array of data and
associated array of labels, called its index. Optimized for GPU computation
with automatic memory management.
Parameters:
data: array-like, dict, scalar value
Contains data stored in Series
index: array-like or Index, optional
Values must be hashable and same length as data
dtype: dtype, optional
Data type for the output Series
name: str, optional
Name to give to the Series
copy: bool, default False
Copy input data if True
Attributes:
index: Index representing the axis labels
dtype: numpy.dtype representing data type
shape: tuple representing Series dimensions
size: int representing number of elements
ndim: int representing number of dimensions (always 1)
name: str or None representing Series name
values: cupy.ndarray representing underlying data
Examples:
# Create from list
s = cudf.Series([1, 2, 3, 4, 5])
# Create with index and name
s = cudf.Series([1.1, 2.2, 3.3],
index=['a', 'b', 'c'],
name='values')
"""Immutable sequences used for axis labels and data selection.
class Index:
"""
Immutable sequence used for axis labels and selection
Base class for all index types in cuDF. Provides common functionality
for indexing, selection, and alignment operations. GPU-accelerated for
large-scale operations.
Parameters:
data: array-like (1-D)
Data to create index from
dtype: numpy.dtype, optional
Data type for index
copy: bool, default False
Copy input data if True
name: str, optional
Name for the index
Attributes:
dtype: numpy.dtype representing data type
shape: tuple representing index dimensions
size: int representing number of elements
ndim: int representing number of dimensions (always 1)
name: str or None representing index name
values: cupy.ndarray representing underlying data
is_unique: bool indicating if all values are unique
Examples:
# Create from list
idx = cudf.Index([1, 2, 3, 4])
# Create with name
idx = cudf.Index(['a', 'b', 'c'], name='letters')
"""class RangeIndex(Index):
"""
Memory-efficient index representing a range of integers
Immutable index implementing a monotonic integer range. Optimized for
memory efficiency by storing only start, stop, and step values rather
than materializing the entire range.
Parameters:
start: int, optional (default 0)
Start value of the range
stop: int, optional
Stop value of the range (exclusive)
step: int, optional (default 1)
Step size of the range
name: str, optional
Name for the index
Attributes:
start: int representing range start
stop: int representing range stop
step: int representing range step
Examples:
# Create range index
idx = cudf.RangeIndex(10) # 0 to 9
idx = cudf.RangeIndex(1, 11, 2) # 1, 3, 5, 7, 9
"""class CategoricalIndex(Index):
"""
Index for categorical data with GPU acceleration
Immutable index for categorical data. Provides memory efficiency for
repeated string or numeric values by storing categories and codes
separately. GPU-accelerated for large categorical datasets.
Parameters:
data: array-like
Categorical data for the index
categories: array-like, optional
Unique categories for the data
ordered: bool, default False
Whether categories have a meaningful order
dtype: CategoricalDtype, optional
Categorical data type
name: str, optional
Name for the index
Attributes:
categories: Index representing unique categories
codes: cupy.ndarray representing category codes
ordered: bool indicating if categories are ordered
Examples:
# Create categorical index
idx = cudf.CategoricalIndex(['red', 'blue', 'red', 'green'])
# With explicit categories
idx = cudf.CategoricalIndex(
['small', 'large', 'medium'],
categories=['small', 'medium', 'large'],
ordered=True
)
"""class DatetimeIndex(Index):
"""
Index for datetime values with GPU acceleration
Immutable index containing datetime64 values. Provides fast temporal
operations and date-based selection. GPU-accelerated for time series
operations on large datasets.
Parameters:
data: array-like
Datetime-like data for the index
freq: str or DateOffset, optional
Frequency of the datetime data
tz: str or timezone, optional
Timezone for localized datetime index
normalize: bool, default False
Normalize start/end dates to midnight
name: str, optional
Name for the index
Attributes:
freq: str or None representing frequency
tz: timezone or None representing timezone
year: Series representing year values
month: Series representing month values
day: Series representing day values
hour: Series representing hour values
minute: Series representing minute values
second: Series representing second values
Examples:
# Create from date strings
idx = cudf.DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03'])
# With timezone
idx = cudf.DatetimeIndex(
['2023-01-01', '2023-01-02'],
tz='UTC'
)
"""class TimedeltaIndex(Index):
"""
Index for timedelta values with GPU acceleration
Immutable index containing timedelta64 values. Represents durations
and time differences. GPU-accelerated for temporal arithmetic operations.
Parameters:
data: array-like
Timedelta-like data for the index
unit: str, optional
Unit of the timedelta data ('D', 'h', 'm', 's', etc.)
freq: str or DateOffset, optional
Frequency of the timedelta data
name: str, optional
Name for the index
Attributes:
freq: str or None representing frequency
components: DataFrame with timedelta components
days: Series representing days component
seconds: Series representing seconds component
microseconds: Series representing microseconds component
nanoseconds: Series representing nanoseconds component
Examples:
# Create from timedelta strings
idx = cudf.TimedeltaIndex(['1 day', '2 hours', '30 minutes'])
# From numeric values with unit
idx = cudf.TimedeltaIndex([1, 2, 3], unit='D')
"""class IntervalIndex(Index):
"""
Index for interval data with GPU acceleration
Immutable index containing Interval objects. Represents closed, open,
or half-open intervals. GPU-accelerated for interval-based operations
and overlapping queries.
Parameters:
data: array-like
Interval-like data for the index
closed: str, default 'right'
Whether intervals are closed ('left', 'right', 'both', 'neither')
dtype: IntervalDtype, optional
Interval data type
name: str, optional
Name for the index
Attributes:
closed: str representing interval closure type
left: Index representing left bounds
right: Index representing right bounds
mid: Index representing interval midpoints
length: Index representing interval lengths
Examples:
# Create from arrays
left = [0, 1, 2]
right = [1, 2, 3]
idx = cudf.IntervalIndex.from_arrays(left, right)
# From tuples
intervals = [(0, 1), (1, 2), (2, 3)]
idx = cudf.IntervalIndex.from_tuples(intervals)
"""class MultiIndex(Index):
"""
Multi-level/hierarchical index for GPU DataFrames
Multi-level index object. Represents multiple levels of indexing
on a single axis. GPU-accelerated for hierarchical data operations
and multi-dimensional selections.
Parameters:
levels: sequence of arrays
Unique labels for each level
codes: sequence of arrays
Integers for each level indicating label positions
names: sequence of str, optional
Names for each level
Attributes:
levels: list of Index objects representing each level
codes: list of arrays representing level codes
names: list of str representing level names
nlevels: int representing number of levels
Examples:
# Create from arrays
arrays = [
['A', 'A', 'B', 'B'],
[1, 2, 1, 2]
]
idx = cudf.MultiIndex.from_arrays(arrays, names=['letter', 'number'])
# From tuples
tuples = [('A', 1), ('A', 2), ('B', 1), ('B', 2)]
idx = cudf.MultiIndex.from_tuples(tuples)
"""Extended data type system supporting nested and specialized types.
class CategoricalDtype:
"""
Extension dtype for categorical data
Data type for categorical data with optional ordering. Provides memory
efficiency for repeated values and supports ordered categorical operations.
Parameters:
categories: Index-like, optional
Unique categories for the data
ordered: bool, default False
Whether categories have meaningful order
Attributes:
categories: Index representing unique categories
ordered: bool indicating if categories are ordered
Examples:
# Create categorical dtype
dtype = cudf.CategoricalDtype(['red', 'blue', 'green'])
# With ordering
dtype = cudf.CategoricalDtype(
['small', 'medium', 'large'],
ordered=True
)
"""class Decimal32Dtype:
"""
32-bit fixed-point decimal data type
Extension dtype for 32-bit decimal numbers with configurable precision
and scale. Provides exact decimal arithmetic without floating-point errors.
Parameters:
precision: int (1-9)
Total number of digits
scale: int (0-precision)
Number of digits after decimal point
Examples:
# Create decimal32 dtype
dtype = cudf.Decimal32Dtype(precision=7, scale=2) # 99999.99 max
"""
class Decimal64Dtype:
"""
64-bit fixed-point decimal data type
Extension dtype for 64-bit decimal numbers with configurable precision
and scale. Provides exact decimal arithmetic for financial calculations.
Parameters:
precision: int (1-18)
Total number of digits
scale: int (0-precision)
Number of digits after decimal point
Examples:
# Create decimal64 dtype
dtype = cudf.Decimal64Dtype(precision=10, scale=4) # 999999.9999 max
"""
class Decimal128Dtype:
"""
128-bit fixed-point decimal data type
Extension dtype for 128-bit decimal numbers with configurable precision
and scale. Provides highest precision decimal arithmetic.
Parameters:
precision: int (1-38)
Total number of digits
scale: int (0-precision)
Number of digits after decimal point
Examples:
# Create decimal128 dtype
dtype = cudf.Decimal128Dtype(precision=20, scale=6)
"""class ListDtype:
"""
Extension dtype for nested list data
Data type representing lists of elements where each row can contain
a variable-length list. Supports nested operations and list processing
on GPU.
Parameters:
element_type: dtype
Data type of list elements
Attributes:
element_type: dtype representing element data type
Examples:
# Create list dtype
dtype = cudf.ListDtype('int64') # Lists of integers
dtype = cudf.ListDtype('float32') # Lists of floats
"""
class StructDtype:
"""
Extension dtype for nested struct data
Data type representing structured data where each row contains
multiple named fields. Similar to database records or JSON objects.
Parameters:
fields: dict
Mapping of field names to data types
Attributes:
fields: dict representing field name to dtype mapping
Examples:
# Create struct dtype
fields = {'x': 'int64', 'y': 'float64', 'name': 'object'}
dtype = cudf.StructDtype(fields)
"""class IntervalDtype:
"""
Extension dtype for interval data
Data type for interval objects with configurable closure behavior
and subtype. Used for representing ranges and interval-based operations.
Parameters:
subtype: dtype, optional (default 'float64')
Data type for interval bounds
closed: str, optional (default 'right')
Whether intervals are closed ('left', 'right', 'both', 'neither')
Attributes:
subtype: dtype representing bounds data type
closed: str representing closure behavior
Examples:
# Create interval dtype
dtype = cudf.IntervalDtype('int64', closed='both')
dtype = cudf.IntervalDtype('float32', closed='left')
"""Constants for representing missing and special values.
NA = cudf.NA
"""
Scalar representation of missing value
cuDF's representation of a missing value that is compatible across
all data types including nested types. Distinct from None and np.nan.
Examples:
# Create Series with missing values
s = cudf.Series([1, cudf.NA, 3])
# Check for missing values
mask = s.isna() # Returns boolean mask
"""
NaT = cudf.NaT
"""
Not-a-Time representation for datetime/timedelta
Pandas-compatible representation of missing datetime or timedelta values.
Used specifically for temporal data types.
Examples:
# Create datetime series with NaT
dates = cudf.Series(['2023-01-01', cudf.NaT, '2023-01-03'])
dates = cudf.to_datetime(dates)
"""cuDF data structures leverage RAPIDS Memory Manager (RMM) for optimal GPU memory usage:
# GPU to CPU conversion
df_pandas = cudf_df.to_pandas()
series_pandas = cudf_series.to_pandas()
# CPU to GPU conversion
cudf_df = cudf.from_pandas(pandas_df)
cudf_series = cudf.from_pandas(pandas_series)
# Arrow integration
arrow_table = cudf_df.to_arrow()
cudf_df = cudf.from_arrow(arrow_table)
# NumPy/CuPy arrays
cupy_array = cudf_series.values # Get underlying CuPy array
cudf_series = cudf.Series(cupy_array) # Create from CuPy arrayInstall with Tessl CLI
npx tessl i tessl/pypi-cudf-cu12