or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-data-structures.mddata-manipulation.mdindex.mdio-operations.mdpandas-compatibility.mdtesting-utilities.mdtype-checking.md

core-data-structures.mddocs/

0

# Core Data Structures

1

2

cuDF provides GPU-accelerated versions of pandas' core data structures with enhanced capabilities for handling large datasets and complex data types. All structures leverage GPU memory for optimal performance.

3

4

## DataFrame

5

6

The primary data structure for two-dimensional, tabular data with labeled axes.

7

8

```{ .api }

9

class DataFrame:

10

"""

11

GPU-accelerated DataFrame with pandas-like API

12

13

Two-dimensional, size-mutable, potentially heterogeneous tabular data structure

14

with labeled axes (rows and columns). Stored in GPU memory with columnar layout

15

for optimal performance.

16

17

Parameters:

18

data: dict, list, ndarray, Series, DataFrame, optional

19

Data to initialize DataFrame from various sources

20

index: Index or array-like, optional

21

Index (row labels) for the DataFrame

22

columns: Index or array-like, optional

23

Column labels for the DataFrame

24

dtype: dtype, optional

25

Data type to force, otherwise infer

26

copy: bool, default False

27

Copy data if True

28

29

Attributes:

30

index: Index representing row labels

31

columns: Index representing column labels

32

dtypes: Series with column data types

33

shape: tuple representing DataFrame dimensions

34

size: int representing total number of elements

35

ndim: int representing number of dimensions (always 2)

36

empty: bool indicating if DataFrame is empty

37

38

Examples:

39

# Create from dictionary

40

df = cudf.DataFrame({'A': [1, 2, 3], 'B': [4.0, 5.1, 6.2]})

41

42

# Create with custom index

43

df = cudf.DataFrame(

44

{'x': [1, 2], 'y': [3, 4]},

45

index=['row1', 'row2']

46

)

47

"""

48

```

49

50

## Series

51

52

One-dimensional labeled array capable of holding any data type.

53

54

```{ .api }

55

class Series:

56

"""

57

GPU-accelerated one-dimensional array with axis labels

58

59

One-dimensional ndarray-like object containing an array of data and

60

associated array of labels, called its index. Optimized for GPU computation

61

with automatic memory management.

62

63

Parameters:

64

data: array-like, dict, scalar value

65

Contains data stored in Series

66

index: array-like or Index, optional

67

Values must be hashable and same length as data

68

dtype: dtype, optional

69

Data type for the output Series

70

name: str, optional

71

Name to give to the Series

72

copy: bool, default False

73

Copy input data if True

74

75

Attributes:

76

index: Index representing the axis labels

77

dtype: numpy.dtype representing data type

78

shape: tuple representing Series dimensions

79

size: int representing number of elements

80

ndim: int representing number of dimensions (always 1)

81

name: str or None representing Series name

82

values: cupy.ndarray representing underlying data

83

84

Examples:

85

# Create from list

86

s = cudf.Series([1, 2, 3, 4, 5])

87

88

# Create with index and name

89

s = cudf.Series([1.1, 2.2, 3.3],

90

index=['a', 'b', 'c'],

91

name='values')

92

"""

93

```

94

95

## Index Classes

96

97

Immutable sequences used for axis labels and data selection.

98

99

### Base Index

100

101

```{ .api }

102

class Index:

103

"""

104

Immutable sequence used for axis labels and selection

105

106

Base class for all index types in cuDF. Provides common functionality

107

for indexing, selection, and alignment operations. GPU-accelerated for

108

large-scale operations.

109

110

Parameters:

111

data: array-like (1-D)

112

Data to create index from

113

dtype: numpy.dtype, optional

114

Data type for index

115

copy: bool, default False

116

Copy input data if True

117

name: str, optional

118

Name for the index

119

120

Attributes:

121

dtype: numpy.dtype representing data type

122

shape: tuple representing index dimensions

123

size: int representing number of elements

124

ndim: int representing number of dimensions (always 1)

125

name: str or None representing index name

126

values: cupy.ndarray representing underlying data

127

is_unique: bool indicating if all values are unique

128

129

Examples:

130

# Create from list

131

idx = cudf.Index([1, 2, 3, 4])

132

133

# Create with name

134

idx = cudf.Index(['a', 'b', 'c'], name='letters')

135

"""

136

```

137

138

### RangeIndex

139

140

```{ .api }

141

class RangeIndex(Index):

142

"""

143

Memory-efficient index representing a range of integers

144

145

Immutable index implementing a monotonic integer range. Optimized for

146

memory efficiency by storing only start, stop, and step values rather

147

than materializing the entire range.

148

149

Parameters:

150

start: int, optional (default 0)

151

Start value of the range

152

stop: int, optional

153

Stop value of the range (exclusive)

154

step: int, optional (default 1)

155

Step size of the range

156

name: str, optional

157

Name for the index

158

159

Attributes:

160

start: int representing range start

161

stop: int representing range stop

162

step: int representing range step

163

164

Examples:

165

# Create range index

166

idx = cudf.RangeIndex(10) # 0 to 9

167

idx = cudf.RangeIndex(1, 11, 2) # 1, 3, 5, 7, 9

168

"""

169

```

170

171

### CategoricalIndex

172

173

```{ .api }

174

class CategoricalIndex(Index):

175

"""

176

Index for categorical data with GPU acceleration

177

178

Immutable index for categorical data. Provides memory efficiency for

179

repeated string or numeric values by storing categories and codes

180

separately. GPU-accelerated for large categorical datasets.

181

182

Parameters:

183

data: array-like

184

Categorical data for the index

185

categories: array-like, optional

186

Unique categories for the data

187

ordered: bool, default False

188

Whether categories have a meaningful order

189

dtype: CategoricalDtype, optional

190

Categorical data type

191

name: str, optional

192

Name for the index

193

194

Attributes:

195

categories: Index representing unique categories

196

codes: cupy.ndarray representing category codes

197

ordered: bool indicating if categories are ordered

198

199

Examples:

200

# Create categorical index

201

idx = cudf.CategoricalIndex(['red', 'blue', 'red', 'green'])

202

203

# With explicit categories

204

idx = cudf.CategoricalIndex(

205

['small', 'large', 'medium'],

206

categories=['small', 'medium', 'large'],

207

ordered=True

208

)

209

"""

210

```

211

212

### DatetimeIndex

213

214

```{ .api }

215

class DatetimeIndex(Index):

216

"""

217

Index for datetime values with GPU acceleration

218

219

Immutable index containing datetime64 values. Provides fast temporal

220

operations and date-based selection. GPU-accelerated for time series

221

operations on large datasets.

222

223

Parameters:

224

data: array-like

225

Datetime-like data for the index

226

freq: str or DateOffset, optional

227

Frequency of the datetime data

228

tz: str or timezone, optional

229

Timezone for localized datetime index

230

normalize: bool, default False

231

Normalize start/end dates to midnight

232

name: str, optional

233

Name for the index

234

235

Attributes:

236

freq: str or None representing frequency

237

tz: timezone or None representing timezone

238

year: Series representing year values

239

month: Series representing month values

240

day: Series representing day values

241

hour: Series representing hour values

242

minute: Series representing minute values

243

second: Series representing second values

244

245

Examples:

246

# Create from date strings

247

idx = cudf.DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03'])

248

249

# With timezone

250

idx = cudf.DatetimeIndex(

251

['2023-01-01', '2023-01-02'],

252

tz='UTC'

253

)

254

"""

255

```

256

257

### TimedeltaIndex

258

259

```{ .api }

260

class TimedeltaIndex(Index):

261

"""

262

Index for timedelta values with GPU acceleration

263

264

Immutable index containing timedelta64 values. Represents durations

265

and time differences. GPU-accelerated for temporal arithmetic operations.

266

267

Parameters:

268

data: array-like

269

Timedelta-like data for the index

270

unit: str, optional

271

Unit of the timedelta data ('D', 'h', 'm', 's', etc.)

272

freq: str or DateOffset, optional

273

Frequency of the timedelta data

274

name: str, optional

275

Name for the index

276

277

Attributes:

278

freq: str or None representing frequency

279

components: DataFrame with timedelta components

280

days: Series representing days component

281

seconds: Series representing seconds component

282

microseconds: Series representing microseconds component

283

nanoseconds: Series representing nanoseconds component

284

285

Examples:

286

# Create from timedelta strings

287

idx = cudf.TimedeltaIndex(['1 day', '2 hours', '30 minutes'])

288

289

# From numeric values with unit

290

idx = cudf.TimedeltaIndex([1, 2, 3], unit='D')

291

"""

292

```

293

294

### IntervalIndex

295

296

```{ .api }

297

class IntervalIndex(Index):

298

"""

299

Index for interval data with GPU acceleration

300

301

Immutable index containing Interval objects. Represents closed, open,

302

or half-open intervals. GPU-accelerated for interval-based operations

303

and overlapping queries.

304

305

Parameters:

306

data: array-like

307

Interval-like data for the index

308

closed: str, default 'right'

309

Whether intervals are closed ('left', 'right', 'both', 'neither')

310

dtype: IntervalDtype, optional

311

Interval data type

312

name: str, optional

313

Name for the index

314

315

Attributes:

316

closed: str representing interval closure type

317

left: Index representing left bounds

318

right: Index representing right bounds

319

mid: Index representing interval midpoints

320

length: Index representing interval lengths

321

322

Examples:

323

# Create from arrays

324

left = [0, 1, 2]

325

right = [1, 2, 3]

326

idx = cudf.IntervalIndex.from_arrays(left, right)

327

328

# From tuples

329

intervals = [(0, 1), (1, 2), (2, 3)]

330

idx = cudf.IntervalIndex.from_tuples(intervals)

331

"""

332

```

333

334

### MultiIndex

335

336

```{ .api }

337

class MultiIndex(Index):

338

"""

339

Multi-level/hierarchical index for GPU DataFrames

340

341

Multi-level index object. Represents multiple levels of indexing

342

on a single axis. GPU-accelerated for hierarchical data operations

343

and multi-dimensional selections.

344

345

Parameters:

346

levels: sequence of arrays

347

Unique labels for each level

348

codes: sequence of arrays

349

Integers for each level indicating label positions

350

names: sequence of str, optional

351

Names for each level

352

353

Attributes:

354

levels: list of Index objects representing each level

355

codes: list of arrays representing level codes

356

names: list of str representing level names

357

nlevels: int representing number of levels

358

359

Examples:

360

# Create from arrays

361

arrays = [

362

['A', 'A', 'B', 'B'],

363

[1, 2, 1, 2]

364

]

365

idx = cudf.MultiIndex.from_arrays(arrays, names=['letter', 'number'])

366

367

# From tuples

368

tuples = [('A', 1), ('A', 2), ('B', 1), ('B', 2)]

369

idx = cudf.MultiIndex.from_tuples(tuples)

370

"""

371

```

372

373

## Data Types

374

375

Extended data type system supporting nested and specialized types.

376

377

### CategoricalDtype

378

379

```{ .api }

380

class CategoricalDtype:

381

"""

382

Extension dtype for categorical data

383

384

Data type for categorical data with optional ordering. Provides memory

385

efficiency for repeated values and supports ordered categorical operations.

386

387

Parameters:

388

categories: Index-like, optional

389

Unique categories for the data

390

ordered: bool, default False

391

Whether categories have meaningful order

392

393

Attributes:

394

categories: Index representing unique categories

395

ordered: bool indicating if categories are ordered

396

397

Examples:

398

# Create categorical dtype

399

dtype = cudf.CategoricalDtype(['red', 'blue', 'green'])

400

401

# With ordering

402

dtype = cudf.CategoricalDtype(

403

['small', 'medium', 'large'],

404

ordered=True

405

)

406

"""

407

```

408

409

### Decimal Data Types

410

411

```{ .api }

412

class Decimal32Dtype:

413

"""

414

32-bit fixed-point decimal data type

415

416

Extension dtype for 32-bit decimal numbers with configurable precision

417

and scale. Provides exact decimal arithmetic without floating-point errors.

418

419

Parameters:

420

precision: int (1-9)

421

Total number of digits

422

scale: int (0-precision)

423

Number of digits after decimal point

424

425

Examples:

426

# Create decimal32 dtype

427

dtype = cudf.Decimal32Dtype(precision=7, scale=2) # 99999.99 max

428

"""

429

430

class Decimal64Dtype:

431

"""

432

64-bit fixed-point decimal data type

433

434

Extension dtype for 64-bit decimal numbers with configurable precision

435

and scale. Provides exact decimal arithmetic for financial calculations.

436

437

Parameters:

438

precision: int (1-18)

439

Total number of digits

440

scale: int (0-precision)

441

Number of digits after decimal point

442

443

Examples:

444

# Create decimal64 dtype

445

dtype = cudf.Decimal64Dtype(precision=10, scale=4) # 999999.9999 max

446

"""

447

448

class Decimal128Dtype:

449

"""

450

128-bit fixed-point decimal data type

451

452

Extension dtype for 128-bit decimal numbers with configurable precision

453

and scale. Provides highest precision decimal arithmetic.

454

455

Parameters:

456

precision: int (1-38)

457

Total number of digits

458

scale: int (0-precision)

459

Number of digits after decimal point

460

461

Examples:

462

# Create decimal128 dtype

463

dtype = cudf.Decimal128Dtype(precision=20, scale=6)

464

"""

465

```

466

467

### Nested Data Types

468

469

```{ .api }

470

class ListDtype:

471

"""

472

Extension dtype for nested list data

473

474

Data type representing lists of elements where each row can contain

475

a variable-length list. Supports nested operations and list processing

476

on GPU.

477

478

Parameters:

479

element_type: dtype

480

Data type of list elements

481

482

Attributes:

483

element_type: dtype representing element data type

484

485

Examples:

486

# Create list dtype

487

dtype = cudf.ListDtype('int64') # Lists of integers

488

dtype = cudf.ListDtype('float32') # Lists of floats

489

"""

490

491

class StructDtype:

492

"""

493

Extension dtype for nested struct data

494

495

Data type representing structured data where each row contains

496

multiple named fields. Similar to database records or JSON objects.

497

498

Parameters:

499

fields: dict

500

Mapping of field names to data types

501

502

Attributes:

503

fields: dict representing field name to dtype mapping

504

505

Examples:

506

# Create struct dtype

507

fields = {'x': 'int64', 'y': 'float64', 'name': 'object'}

508

dtype = cudf.StructDtype(fields)

509

"""

510

```

511

512

### IntervalDtype

513

514

```{ .api }

515

class IntervalDtype:

516

"""

517

Extension dtype for interval data

518

519

Data type for interval objects with configurable closure behavior

520

and subtype. Used for representing ranges and interval-based operations.

521

522

Parameters:

523

subtype: dtype, optional (default 'float64')

524

Data type for interval bounds

525

closed: str, optional (default 'right')

526

Whether intervals are closed ('left', 'right', 'both', 'neither')

527

528

Attributes:

529

subtype: dtype representing bounds data type

530

closed: str representing closure behavior

531

532

Examples:

533

# Create interval dtype

534

dtype = cudf.IntervalDtype('int64', closed='both')

535

dtype = cudf.IntervalDtype('float32', closed='left')

536

"""

537

```

538

539

## Special Values

540

541

Constants for representing missing and special values.

542

543

```{ .api }

544

NA = cudf.NA

545

"""

546

Scalar representation of missing value

547

548

cuDF's representation of a missing value that is compatible across

549

all data types including nested types. Distinct from None and np.nan.

550

551

Examples:

552

# Create Series with missing values

553

s = cudf.Series([1, cudf.NA, 3])

554

555

# Check for missing values

556

mask = s.isna() # Returns boolean mask

557

"""

558

559

NaT = cudf.NaT

560

"""

561

Not-a-Time representation for datetime/timedelta

562

563

Pandas-compatible representation of missing datetime or timedelta values.

564

Used specifically for temporal data types.

565

566

Examples:

567

# Create datetime series with NaT

568

dates = cudf.Series(['2023-01-01', cudf.NaT, '2023-01-03'])

569

dates = cudf.to_datetime(dates)

570

"""

571

```

572

573

## Memory Management

574

575

cuDF data structures leverage RAPIDS Memory Manager (RMM) for optimal GPU memory usage:

576

577

- **Columnar Storage**: Apache Arrow format for cache efficiency

578

- **Memory Pools**: Reduces allocation overhead for frequent operations

579

- **Zero-Copy**: Minimal data movement between operations

580

- **Automatic Cleanup**: Garbage collection integration for GPU memory

581

- **Memory Mapping**: Support for memory-mapped files

582

583

## Type Conversions

584

585

```python

586

# GPU to CPU conversion

587

df_pandas = cudf_df.to_pandas()

588

series_pandas = cudf_series.to_pandas()

589

590

# CPU to GPU conversion

591

cudf_df = cudf.from_pandas(pandas_df)

592

cudf_series = cudf.from_pandas(pandas_series)

593

594

# Arrow integration

595

arrow_table = cudf_df.to_arrow()

596

cudf_df = cudf.from_arrow(arrow_table)

597

598

# NumPy/CuPy arrays

599

cupy_array = cudf_series.values # Get underlying CuPy array

600

cudf_series = cudf.Series(cupy_array) # Create from CuPy array

601

```

602

603

## Performance Characteristics

604

605

- **Memory Bandwidth**: 10-100x improvement over pandas for large datasets

606

- **Parallel Operations**: Leverages thousands of GPU cores

607

- **Cache Efficiency**: Columnar layout optimizes memory access patterns

608

- **Kernel Fusion**: Combines multiple operations into single GPU kernels

609

- **Lazy Evaluation**: Defers computation until results are needed