0
# Core Data Structures
1
2
cuDF provides GPU-accelerated versions of pandas' core data structures with enhanced capabilities for handling large datasets and complex data types. All structures leverage GPU memory for optimal performance.
3
4
## DataFrame
5
6
The primary data structure for two-dimensional, tabular data with labeled axes.
7
8
```{ .api }
9
class DataFrame:
10
"""
11
GPU-accelerated DataFrame with pandas-like API
12
13
Two-dimensional, size-mutable, potentially heterogeneous tabular data structure
14
with labeled axes (rows and columns). Stored in GPU memory with columnar layout
15
for optimal performance.
16
17
Parameters:
18
data: dict, list, ndarray, Series, DataFrame, optional
19
Data to initialize DataFrame from various sources
20
index: Index or array-like, optional
21
Index (row labels) for the DataFrame
22
columns: Index or array-like, optional
23
Column labels for the DataFrame
24
dtype: dtype, optional
25
Data type to force, otherwise infer
26
copy: bool, default False
27
Copy data if True
28
29
Attributes:
30
index: Index representing row labels
31
columns: Index representing column labels
32
dtypes: Series with column data types
33
shape: tuple representing DataFrame dimensions
34
size: int representing total number of elements
35
ndim: int representing number of dimensions (always 2)
36
empty: bool indicating if DataFrame is empty
37
38
Examples:
39
# Create from dictionary
40
df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4.0, 5.1, 6.2]})
41
42
# Create with custom index
43
df = cudf.DataFrame(
44
{'x': [1, 2], 'y': [3, 4]},
45
index=['row1', 'row2']
46
)
47
"""
48
```
49
50
## Series
51
52
One-dimensional labeled array capable of holding any data type.
53
54
```{ .api }
55
class Series:
56
"""
57
GPU-accelerated one-dimensional array with axis labels
58
59
One-dimensional ndarray-like object containing an array of data and
60
associated array of labels, called its index. Optimized for GPU computation
61
with automatic memory management.
62
63
Parameters:
64
data: array-like, dict, scalar value
65
Contains data stored in Series
66
index: array-like or Index, optional
67
Values must be hashable and same length as data
68
dtype: dtype, optional
69
Data type for the output Series
70
name: str, optional
71
Name to give to the Series
72
copy: bool, default False
73
Copy input data if True
74
75
Attributes:
76
index: Index representing the axis labels
77
dtype: numpy.dtype representing data type
78
shape: tuple representing Series dimensions
79
size: int representing number of elements
80
ndim: int representing number of dimensions (always 1)
81
name: str or None representing Series name
82
values: cupy.ndarray representing underlying data
83
84
Examples:
85
# Create from list
86
s = cudf.Series([1, 2, 3, 4, 5])
87
88
# Create with index and name
89
s = cudf.Series([1.1, 2.2, 3.3],
90
index=['a', 'b', 'c'],
91
name='values')
92
"""
93
```
94
95
## Index Classes
96
97
Immutable sequences used for axis labels and data selection.
98
99
### Base Index
100
101
```{ .api }
102
class Index:
103
"""
104
Immutable sequence used for axis labels and selection
105
106
Base class for all index types in cuDF. Provides common functionality
107
for indexing, selection, and alignment operations. GPU-accelerated for
108
large-scale operations.
109
110
Parameters:
111
data: array-like (1-D)
112
Data to create index from
113
dtype: numpy.dtype, optional
114
Data type for index
115
copy: bool, default False
116
Copy input data if True
117
name: str, optional
118
Name for the index
119
120
Attributes:
121
dtype: numpy.dtype representing data type
122
shape: tuple representing index dimensions
123
size: int representing number of elements
124
ndim: int representing number of dimensions (always 1)
125
name: str or None representing index name
126
values: cupy.ndarray representing underlying data
127
is_unique: bool indicating if all values are unique
128
129
Examples:
130
# Create from list
131
idx = cudf.Index([1, 2, 3, 4])
132
133
# Create with name
134
idx = cudf.Index(['a', 'b', 'c'], name='letters')
135
"""
136
```
137
138
### RangeIndex
139
140
```{ .api }
141
class RangeIndex(Index):
142
"""
143
Memory-efficient index representing a range of integers
144
145
Immutable index implementing a monotonic integer range. Optimized for
146
memory efficiency by storing only start, stop, and step values rather
147
than materializing the entire range.
148
149
Parameters:
150
start: int, optional (default 0)
151
Start value of the range
152
stop: int, optional
153
Stop value of the range (exclusive)
154
step: int, optional (default 1)
155
Step size of the range
156
name: str, optional
157
Name for the index
158
159
Attributes:
160
start: int representing range start
161
stop: int representing range stop
162
step: int representing range step
163
164
Examples:
165
# Create range index
166
idx = cudf.RangeIndex(10) # 0 to 9
167
idx = cudf.RangeIndex(1, 11, 2) # 1, 3, 5, 7, 9
168
"""
169
```
170
171
### CategoricalIndex
172
173
```{ .api }
174
class CategoricalIndex(Index):
175
"""
176
Index for categorical data with GPU acceleration
177
178
Immutable index for categorical data. Provides memory efficiency for
179
repeated string or numeric values by storing categories and codes
180
separately. GPU-accelerated for large categorical datasets.
181
182
Parameters:
183
data: array-like
184
Categorical data for the index
185
categories: array-like, optional
186
Unique categories for the data
187
ordered: bool, default False
188
Whether categories have a meaningful order
189
dtype: CategoricalDtype, optional
190
Categorical data type
191
name: str, optional
192
Name for the index
193
194
Attributes:
195
categories: Index representing unique categories
196
codes: cupy.ndarray representing category codes
197
ordered: bool indicating if categories are ordered
198
199
Examples:
200
# Create categorical index
201
idx = cudf.CategoricalIndex(['red', 'blue', 'red', 'green'])
202
203
# With explicit categories
204
idx = cudf.CategoricalIndex(
205
['small', 'large', 'medium'],
206
categories=['small', 'medium', 'large'],
207
ordered=True
208
)
209
"""
210
```
211
212
### DatetimeIndex
213
214
```{ .api }
215
class DatetimeIndex(Index):
216
"""
217
Index for datetime values with GPU acceleration
218
219
Immutable index containing datetime64 values. Provides fast temporal
220
operations and date-based selection. GPU-accelerated for time series
221
operations on large datasets.
222
223
Parameters:
224
data: array-like
225
Datetime-like data for the index
226
freq: str or DateOffset, optional
227
Frequency of the datetime data
228
tz: str or timezone, optional
229
Timezone for localized datetime index
230
normalize: bool, default False
231
Normalize start/end dates to midnight
232
name: str, optional
233
Name for the index
234
235
Attributes:
236
freq: str or None representing frequency
237
tz: timezone or None representing timezone
238
year: Series representing year values
239
month: Series representing month values
240
day: Series representing day values
241
hour: Series representing hour values
242
minute: Series representing minute values
243
second: Series representing second values
244
245
Examples:
246
# Create from date strings
247
idx = cudf.DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03'])
248
249
# With timezone
250
idx = cudf.DatetimeIndex(
251
['2023-01-01', '2023-01-02'],
252
tz='UTC'
253
)
254
"""
255
```
256
257
### TimedeltaIndex
258
259
```{ .api }
260
class TimedeltaIndex(Index):
261
"""
262
Index for timedelta values with GPU acceleration
263
264
Immutable index containing timedelta64 values. Represents durations
265
and time differences. GPU-accelerated for temporal arithmetic operations.
266
267
Parameters:
268
data: array-like
269
Timedelta-like data for the index
270
unit: str, optional
271
Unit of the timedelta data ('D', 'h', 'm', 's', etc.)
272
freq: str or DateOffset, optional
273
Frequency of the timedelta data
274
name: str, optional
275
Name for the index
276
277
Attributes:
278
freq: str or None representing frequency
279
components: DataFrame with timedelta components
280
days: Series representing days component
281
seconds: Series representing seconds component
282
microseconds: Series representing microseconds component
283
nanoseconds: Series representing nanoseconds component
284
285
Examples:
286
# Create from timedelta strings
287
idx = cudf.TimedeltaIndex(['1 day', '2 hours', '30 minutes'])
288
289
# From numeric values with unit
290
idx = cudf.TimedeltaIndex([1, 2, 3], unit='D')
291
"""
292
```
293
294
### IntervalIndex
295
296
```{ .api }
297
class IntervalIndex(Index):
298
"""
299
Index for interval data with GPU acceleration
300
301
Immutable index containing Interval objects. Represents closed, open,
302
or half-open intervals. GPU-accelerated for interval-based operations
303
and overlapping queries.
304
305
Parameters:
306
data: array-like
307
Interval-like data for the index
308
closed: str, default 'right'
309
Whether intervals are closed ('left', 'right', 'both', 'neither')
310
dtype: IntervalDtype, optional
311
Interval data type
312
name: str, optional
313
Name for the index
314
315
Attributes:
316
closed: str representing interval closure type
317
left: Index representing left bounds
318
right: Index representing right bounds
319
mid: Index representing interval midpoints
320
length: Index representing interval lengths
321
322
Examples:
323
# Create from arrays
324
left = [0, 1, 2]
325
right = [1, 2, 3]
326
idx = cudf.IntervalIndex.from_arrays(left, right)
327
328
# From tuples
329
intervals = [(0, 1), (1, 2), (2, 3)]
330
idx = cudf.IntervalIndex.from_tuples(intervals)
331
"""
332
```
333
334
### MultiIndex
335
336
```{ .api }
337
class MultiIndex(Index):
338
"""
339
Multi-level/hierarchical index for GPU DataFrames
340
341
Multi-level index object. Represents multiple levels of indexing
342
on a single axis. GPU-accelerated for hierarchical data operations
343
and multi-dimensional selections.
344
345
Parameters:
346
levels: sequence of arrays
347
Unique labels for each level
348
codes: sequence of arrays
349
Integers for each level indicating label positions
350
names: sequence of str, optional
351
Names for each level
352
353
Attributes:
354
levels: list of Index objects representing each level
355
codes: list of arrays representing level codes
356
names: list of str representing level names
357
nlevels: int representing number of levels
358
359
Examples:
360
# Create from arrays
361
arrays = [
362
['A', 'A', 'B', 'B'],
363
[1, 2, 1, 2]
364
]
365
idx = cudf.MultiIndex.from_arrays(arrays, names=['letter', 'number'])
366
367
# From tuples
368
tuples = [('A', 1), ('A', 2), ('B', 1), ('B', 2)]
369
idx = cudf.MultiIndex.from_tuples(tuples)
370
"""
371
```
372
373
## Data Types
374
375
Extended data type system supporting nested and specialized types.
376
377
### CategoricalDtype
378
379
```{ .api }
380
class CategoricalDtype:
381
"""
382
Extension dtype for categorical data
383
384
Data type for categorical data with optional ordering. Provides memory
385
efficiency for repeated values and supports ordered categorical operations.
386
387
Parameters:
388
categories: Index-like, optional
389
Unique categories for the data
390
ordered: bool, default False
391
Whether categories have meaningful order
392
393
Attributes:
394
categories: Index representing unique categories
395
ordered: bool indicating if categories are ordered
396
397
Examples:
398
# Create categorical dtype
399
dtype = cudf.CategoricalDtype(['red', 'blue', 'green'])
400
401
# With ordering
402
dtype = cudf.CategoricalDtype(
403
['small', 'medium', 'large'],
404
ordered=True
405
)
406
"""
407
```
408
409
### Decimal Data Types
410
411
```{ .api }
412
class Decimal32Dtype:
413
"""
414
32-bit fixed-point decimal data type
415
416
Extension dtype for 32-bit decimal numbers with configurable precision
417
and scale. Provides exact decimal arithmetic without floating-point errors.
418
419
Parameters:
420
precision: int (1-9)
421
Total number of digits
422
scale: int (0-precision)
423
Number of digits after decimal point
424
425
Examples:
426
# Create decimal32 dtype
427
dtype = cudf.Decimal32Dtype(precision=7, scale=2) # 99999.99 max
428
"""
429
430
class Decimal64Dtype:
431
"""
432
64-bit fixed-point decimal data type
433
434
Extension dtype for 64-bit decimal numbers with configurable precision
435
and scale. Provides exact decimal arithmetic for financial calculations.
436
437
Parameters:
438
precision: int (1-18)
439
Total number of digits
440
scale: int (0-precision)
441
Number of digits after decimal point
442
443
Examples:
444
# Create decimal64 dtype
445
dtype = cudf.Decimal64Dtype(precision=10, scale=4) # 999999.9999 max
446
"""
447
448
class Decimal128Dtype:
449
"""
450
128-bit fixed-point decimal data type
451
452
Extension dtype for 128-bit decimal numbers with configurable precision
453
and scale. Provides highest precision decimal arithmetic.
454
455
Parameters:
456
precision: int (1-38)
457
Total number of digits
458
scale: int (0-precision)
459
Number of digits after decimal point
460
461
Examples:
462
# Create decimal128 dtype
463
dtype = cudf.Decimal128Dtype(precision=20, scale=6)
464
"""
465
```
466
467
### Nested Data Types
468
469
```{ .api }
470
class ListDtype:
471
"""
472
Extension dtype for nested list data
473
474
Data type representing lists of elements where each row can contain
475
a variable-length list. Supports nested operations and list processing
476
on GPU.
477
478
Parameters:
479
element_type: dtype
480
Data type of list elements
481
482
Attributes:
483
element_type: dtype representing element data type
484
485
Examples:
486
# Create list dtype
487
dtype = cudf.ListDtype('int64') # Lists of integers
488
dtype = cudf.ListDtype('float32') # Lists of floats
489
"""
490
491
class StructDtype:
492
"""
493
Extension dtype for nested struct data
494
495
Data type representing structured data where each row contains
496
multiple named fields. Similar to database records or JSON objects.
497
498
Parameters:
499
fields: dict
500
Mapping of field names to data types
501
502
Attributes:
503
fields: dict representing field name to dtype mapping
504
505
Examples:
506
# Create struct dtype
507
fields = {'x': 'int64', 'y': 'float64', 'name': 'object'}
508
dtype = cudf.StructDtype(fields)
509
"""
510
```
511
512
### IntervalDtype
513
514
```{ .api }
515
class IntervalDtype:
516
"""
517
Extension dtype for interval data
518
519
Data type for interval objects with configurable closure behavior
520
and subtype. Used for representing ranges and interval-based operations.
521
522
Parameters:
523
subtype: dtype, optional (default 'float64')
524
Data type for interval bounds
525
closed: str, optional (default 'right')
526
Whether intervals are closed ('left', 'right', 'both', 'neither')
527
528
Attributes:
529
subtype: dtype representing bounds data type
530
closed: str representing closure behavior
531
532
Examples:
533
# Create interval dtype
534
dtype = cudf.IntervalDtype('int64', closed='both')
535
dtype = cudf.IntervalDtype('float32', closed='left')
536
"""
537
```
538
539
## Special Values
540
541
Constants for representing missing and special values.
542
543
```{ .api }
544
NA = cudf.NA
545
"""
546
Scalar representation of missing value
547
548
cuDF's representation of a missing value that is compatible across
549
all data types including nested types. Distinct from None and np.nan.
550
551
Examples:
552
# Create Series with missing values
553
s = cudf.Series([1, cudf.NA, 3])
554
555
# Check for missing values
556
mask = s.isna() # Returns boolean mask
557
"""
558
559
NaT = cudf.NaT
560
"""
561
Not-a-Time representation for datetime/timedelta
562
563
Pandas-compatible representation of missing datetime or timedelta values.
564
Used specifically for temporal data types.
565
566
Examples:
567
# Create datetime series with NaT
568
dates = cudf.Series(['2023-01-01', cudf.NaT, '2023-01-03'])
569
dates = cudf.to_datetime(dates)
570
"""
571
```
572
573
## Memory Management
574
575
cuDF data structures leverage RAPIDS Memory Manager (RMM) for optimal GPU memory usage:
576
577
- **Columnar Storage**: Apache Arrow format for cache efficiency
578
- **Memory Pools**: Reduces allocation overhead for frequent operations
579
- **Zero-Copy**: Minimal data movement between operations
580
- **Automatic Cleanup**: Garbage collection integration for GPU memory
581
- **Memory Mapping**: Support for memory-mapped files
582
583
## Type Conversions
584
585
```python
586
# GPU to CPU conversion
587
df_pandas = cudf_df.to_pandas()
588
series_pandas = cudf_series.to_pandas()
589
590
# CPU to GPU conversion
591
cudf_df = cudf.from_pandas(pandas_df)
592
cudf_series = cudf.from_pandas(pandas_series)
593
594
# Arrow integration
595
arrow_table = cudf_df.to_arrow()
596
cudf_df = cudf.from_arrow(arrow_table)
597
598
# NumPy/CuPy arrays
599
cupy_array = cudf_series.values # Get underlying CuPy array
600
cudf_series = cudf.Series(cupy_array) # Create from CuPy array
601
```
602
603
## Performance Characteristics
604
605
- **Memory Bandwidth**: 10-100x improvement over pandas for large datasets
606
- **Parallel Operations**: Leverages thousands of GPU cores
607
- **Cache Efficiency**: Columnar layout optimizes memory access patterns
608
- **Kernel Fusion**: Combines multiple operations into single GPU kernels
609
- **Lazy Evaluation**: Defers computation until results are needed