or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-data-structures.mddata-manipulation.mdindex.mdio-operations.mdpandas-compatibility.mdtesting-utilities.mdtype-checking.md

index.mddocs/

0

# cuDF: GPU-Accelerated DataFrames

1

2

cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

3

4

## Package Information

5

6

- **Package**: `cudf-cu12`

7

- **Import**: `cudf`

8

- **Version**: 25.8.0+

9

- **Installation**: `pip install cudf-cu12` or `conda install cudf`

10

- **Requirements**: NVIDIA GPU with CUDA support

11

12

## Core Imports

13

14

```python

15

# Main data structures

16

import cudf

17

from cudf import DataFrame, Series, Index

18

19

# I/O operations

20

from cudf import read_csv, read_parquet, read_json

21

from cudf.io import read_orc, read_avro, read_feather

22

23

# Data manipulation

24

from cudf import concat, merge, pivot_table

25

from cudf import cut, factorize, unique

26

27

# Type checking

28

from cudf.api.types import is_numeric_dtype, is_categorical_dtype

29

from cudf.api.types import dtype

30

31

# Configuration

32

from cudf.options import get_option, set_option

33

34

# Dataset generation

35

from cudf.datasets import timeseries, randomdata

36

37

# Version information

38

import cudf

39

print(cudf.__version__) # Package version

40

```

41

42

## Basic Usage

43

44

```{ .api }

45

# Create DataFrame from dictionary

46

df = cudf.DataFrame({

47

'x': [1, 2, 3, 4, 5],

48

'y': [1.0, 2.5, 3.2, 4.1, 5.8],

49

'z': ['red', 'green', 'blue', 'red', 'green']

50

})

51

52

# GPU-accelerated operations

53

result = df.groupby('z').agg({'x': 'sum', 'y': 'mean'})

54

55

# I/O operations leverage GPU memory

56

df_from_file = cudf.read_parquet('data.parquet')

57

df_from_file.to_csv('output.csv')

58

59

# Seamless pandas compatibility

60

df_pandas = df.to_pandas() # Move to CPU

61

df_cudf = cudf.from_pandas(df_pandas) # Move to GPU

62

```

63

64

## Architecture

65

66

cuDF leverages the RAPIDS ecosystem to provide GPU-accelerated data processing:

67

68

- **GPU Memory Management**: Built on RAPIDS Memory Manager (RMM) for efficient GPU memory allocation

69

- **Columnar Storage**: Uses Apache Arrow format for optimal GPU performance

70

- **libcudf Backend**: C++/CUDA library provides the computational engine

71

- **Pandas API**: Maintains familiar pandas interface while delivering GPU performance

72

- **Zero-Copy Interop**: Seamless integration with PyArrow, Numba, and other GPU libraries

73

74

## Core Data Structures

75

76

cuDF provides GPU-accelerated versions of pandas' core data structures with enhanced capabilities.

77

78

```{ .api }

79

class DataFrame:

80

"""GPU-accelerated DataFrame with pandas-like API"""

81

82

class Series:

83

"""One-dimensional GPU array with axis labels"""

84

85

class Index:

86

"""Immutable sequence used for axis labels and selection"""

87

88

class RangeIndex(Index):

89

"""Memory-efficient index for integer ranges"""

90

91

class CategoricalIndex(Index):

92

"""Index for categorical data with GPU acceleration"""

93

```

94

95

**Key Features**: GPU memory efficiency, nested data types (lists, structs), decimal precision support.

96

97

[**→ Learn more about Core Data Structures**](./core-data-structures.md)

98

99

## I/O Operations

100

101

High-performance GPU I/O for popular data formats with automatic memory management.

102

103

```{ .api }

104

def read_parquet(filepath_or_buffer, columns=None, **kwargs) -> DataFrame:

105

"""

106

Read Apache Parquet file directly into GPU memory

107

108

Parameters:

109

filepath_or_buffer: File path, URL, or buffer-like object

110

columns: List[str], optional column subset to read

111

**kwargs: Additional parquet reading options

112

113

Returns:

114

DataFrame: GPU-accelerated DataFrame

115

"""

116

117

def read_csv(filepath_or_buffer, **kwargs) -> DataFrame:

118

"""

119

Read CSV file with GPU acceleration

120

121

Parameters:

122

filepath_or_buffer: File path or buffer

123

**kwargs: CSV parsing options (delimiter, header, etc.)

124

125

Returns:

126

DataFrame: GPU DataFrame with parsed CSV data

127

"""

128

```

129

130

**Supported Formats**: Parquet, ORC, CSV, JSON, Avro, Feather, HDF5, raw text files.

131

132

[**→ Learn more about I/O Operations**](./io-operations.md)

133

134

## Data Manipulation

135

136

GPU-accelerated operations for reshaping, joining, and transforming data.

137

138

```{ .api }

139

def concat(objs, axis=0, ignore_index=False, **kwargs) -> Union[DataFrame, Series]:

140

"""

141

Concatenate cuDF objects along a particular axis

142

143

Parameters:

144

objs: Sequence of DataFrame/Series objects

145

axis: int, axis to concatenate along (0='index', 1='columns')

146

ignore_index: bool, reset index if True

147

148

Returns:

149

Union[DataFrame, Series]: Concatenated result

150

"""

151

152

def merge(left, right, how='inner', on=None, **kwargs) -> DataFrame:

153

"""

154

Merge DataFrame objects with database-style join operations

155

156

Parameters:

157

left: DataFrame, left object to merge

158

right: DataFrame, right object to merge

159

how: str, type of merge ('inner', 'outer', 'left', 'right')

160

on: label or list, column names to join on

161

162

Returns:

163

DataFrame: Merged DataFrame

164

"""

165

```

166

167

**Operations**: Concatenation, merging, pivoting, melting, groupby, aggregation, sorting.

168

169

[**→ Learn more about Data Manipulation**](./data-manipulation.md)

170

171

## Type Checking & Validation

172

173

Comprehensive type checking system for GPU data types including nested types.

174

175

```{ .api }

176

def is_numeric_dtype(arr_or_dtype) -> bool:

177

"""

178

Check whether the provided array or dtype is numeric

179

180

Parameters:

181

arr_or_dtype: Array-like or data type to check

182

183

Returns:

184

bool: True if numeric dtype

185

"""

186

187

def is_categorical_dtype(arr_or_dtype) -> bool:

188

"""

189

Check whether the array or dtype is categorical

190

191

Parameters:

192

arr_or_dtype: Array-like or data type to check

193

194

Returns:

195

bool: True if categorical dtype

196

"""

197

```

198

199

**Type Support**: Standard dtypes, categorical, decimal, list, struct, interval, datetime types.

200

201

[**→ Learn more about Type Checking**](./type-checking.md)

202

203

## Pandas Compatibility Layer

204

205

Drop-in acceleration for existing pandas code with cudf.pandas.

206

207

```{ .api }

208

def install() -> None:

209

"""

210

Enable cuDF pandas accelerator mode

211

212

Automatically accelerates pandas operations with GPU when beneficial,

213

falls back to CPU pandas for unsupported operations.

214

"""

215

216

class Profiler:

217

"""

218

Performance profiler for pandas acceleration opportunities

219

220

Analyzes pandas code execution to identify GPU acceleration potential

221

"""

222

```

223

224

**Features**: Automatic fallback, transparent acceleration, performance profiling, IPython magic commands.

225

226

[**→ Learn more about Pandas Compatibility**](./pandas-compatibility.md)

227

228

## Testing Utilities

229

230

GPU-aware testing framework with specialized assertions for cuDF objects.

231

232

```{ .api }

233

def assert_frame_equal(left, right, check_dtype=True, **kwargs) -> None:

234

"""

235

Assert DataFrame equality with GPU-aware comparison

236

237

Parameters:

238

left: DataFrame, expected result

239

right: DataFrame, actual result

240

check_dtype: bool, whether to check dtype compatibility

241

**kwargs: Additional comparison options

242

"""

243

```

244

245

**Capabilities**: DataFrame/Series/Index comparison, GPU memory validation, performance assertions.

246

247

[**→ Learn more about Testing Utilities**](./testing-utilities.md)

248

249

## Configuration Management

250

251

Global configuration system for controlling GPU memory usage and behavior.

252

253

```{ .api }

254

def get_option(key: str) -> Any:

255

"""

256

Get the value of a configuration option

257

258

Parameters:

259

key: str, configuration option key

260

261

Returns:

262

Any: Current option value

263

"""

264

265

def set_option(key: str, value: Any) -> None:

266

"""

267

Set a configuration option value

268

269

Parameters:

270

key: str, configuration option key

271

value: Any, new option value

272

"""

273

```

274

275

**Options**: Memory management, display formatting, computation behavior, I/O settings.

276

277

## Error Handling

278

279

Specialized error types for GPU-specific issues and mixed-type operations.

280

281

```{ .api }

282

class UnsupportedCUDAError(Exception):

283

"""Raised when CUDA functionality is not supported"""

284

285

class MixedTypeError(Exception):

286

"""Raised when mixing incompatible GPU and CPU types"""

287

```

288

289

## Dataset Generation

290

291

Utilities for generating test data and benchmarking datasets directly in GPU memory.

292

293

```{ .api }

294

def timeseries(

295

start='2000-01-01',

296

end='2000-01-31',

297

freq='1s',

298

dtypes=None,

299

nulls_frequency=0,

300

seed=None

301

) -> DataFrame:

302

"""

303

Generate random timeseries data for testing and benchmarking

304

305

Parameters:

306

start: str or datetime-like, start date

307

end: str or datetime-like, end date

308

freq: str, date frequency string (e.g., '1s', '1H', '1D')

309

dtypes: dict, mapping of column names to types

310

nulls_frequency: float, proportion of nulls to include (0-1)

311

seed: int, random state seed for reproducibility

312

313

Returns:

314

DataFrame: GPU DataFrame with random timeseries data

315

"""

316

317

def randomdata(nrows=10, dtypes=None, seed=None) -> DataFrame:

318

"""

319

Generate random data for testing and benchmarking

320

321

Parameters:

322

nrows: int, number of rows to generate

323

dtypes: dict, mapping of column names to types

324

seed: int, random state seed for reproducibility

325

326

Returns:

327

DataFrame: GPU DataFrame with random data

328

"""

329

```

330

331

## Performance Benefits

332

333

- **Memory Bandwidth**: 10-50x improvement over pandas for large datasets

334

- **Parallel Processing**: Leverages thousands of GPU cores for operations

335

- **Memory Efficiency**: Columnar storage reduces memory footprint

336

- **Zero-Copy**: Minimal data movement between GPU operations

337

- **Automatic Optimization**: Query optimization and kernel fusion

338

339

## GPU Requirements

340

341

- NVIDIA GPU with Compute Capability 7.0+ (Volta architecture or newer)

342

- CUDA 11.2+ or CUDA 12.0+

343

- Sufficient GPU memory for dataset size

344

- Compatible NVIDIA drivers

345

346

## Version Information

347

348

Access package version and build information programmatically.

349

350

```{ .api }

351

import cudf

352

353

# Package version string

354

__version__ = cudf.__version__ # e.g., "25.8.0"

355

356

# Git commit hash (if available)

357

__git_commit__ = cudf.__git_commit__ # e.g., "6cea3743b6"

358

```