or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-data-structures.mddata-manipulation.mdindex.mdio-operations.mdpandas-compatibility.mdtesting-utilities.mdtype-checking.md

io-operations.mddocs/

0

# I/O Operations

1

2

cuDF provides high-performance GPU I/O for popular data formats with automatic memory management and optimized readers/writers. All I/O operations leverage GPU memory directly, minimizing CPU-GPU data transfers.

3

4

## Import Statements

5

6

```python

7

# Core I/O functions

8

from cudf import read_csv, read_parquet, read_json

9

from cudf.io import read_orc, read_avro, read_feather, read_hdf, read_text

10

from cudf.io.csv import to_csv

11

from cudf.io.orc import to_orc

12

13

# Parquet utilities

14

from cudf.io.parquet import (

15

read_parquet_metadata, merge_parquet_filemetadata,

16

ParquetDatasetWriter, write_to_dataset

17

)

18

19

# ORC utilities

20

from cudf.io.orc import read_orc_metadata

21

22

# Interoperability

23

from cudf.io.dlpack import from_dlpack

24

```

25

26

## CSV I/O

27

28

High-performance CSV reading with extensive parsing options.

29

30

```{ .api }

31

def read_csv(

32

filepath_or_buffer,

33

sep=',',

34

delimiter=None,

35

header='infer',

36

names=None,

37

index_col=None,

38

usecols=None,

39

dtype=None,

40

skiprows=None,

41

skipfooter=0,

42

nrows=None,

43

na_values=None,

44

keep_default_na=True,

45

na_filter=True,

46

skip_blank_lines=True,

47

parse_dates=False,

48

date_parser=None,

49

dayfirst=False,

50

compression='infer',

51

thousands=None,

52

decimal='.',

53

lineterminator=None,

54

quotechar='"',

55

quoting=0,

56

doublequote=True,

57

escapechar=None,

58

comment=None,

59

encoding='utf-8',

60

storage_options=None,

61

**kwargs

62

) -> DataFrame:

63

"""

64

Read CSV file directly into GPU memory with optimized parsing

65

66

Provides GPU-accelerated CSV parsing with extensive configuration options.

67

Automatically detects and handles various CSV formats and encodings.

68

69

Parameters:

70

filepath_or_buffer: str, PathLike, or file-like object

71

File path, URL, or buffer containing CSV data

72

sep: str, default ','

73

Field delimiter character

74

delimiter: str, optional

75

Alternative name for sep parameter

76

header: int, list of int, or 'infer', default 'infer'

77

Row number(s) to use as column names

78

names: list, optional

79

List of column names to use instead of header

80

index_col: int, str, or list, optional

81

Column(s) to use as row labels

82

usecols: list or callable, optional

83

Subset of columns to read

84

dtype: dict or str, optional

85

Data type specification for columns

86

skiprows: int, list, or callable, optional

87

Rows to skip at beginning of file

88

skipfooter: int, default 0

89

Number of rows to skip at end of file

90

nrows: int, optional

91

Maximum number of rows to read

92

na_values: scalar, str, list, or dict, optional

93

Additional strings to recognize as NA/NaN

94

keep_default_na: bool, default True

95

Whether to include default NaN values

96

na_filter: bool, default True

97

Whether to check for missing values

98

skip_blank_lines: bool, default True

99

Whether to skip blank lines

100

parse_dates: bool, list, or dict, default False

101

Columns to parse as dates

102

compression: str or dict, default 'infer'

103

Type of compression ('gzip', 'bz2', 'xz', 'zip', None)

104

encoding: str, default 'utf-8'

105

Character encoding to use

106

storage_options: dict, optional

107

Options for cloud storage access

108

**kwargs: additional keyword arguments

109

Other CSV parsing options

110

111

Returns:

112

DataFrame: GPU DataFrame containing parsed CSV data

113

114

Examples:

115

# Basic CSV reading

116

df = cudf.read_csv('data.csv')

117

118

# With custom options

119

df = cudf.read_csv(

120

'data.csv',

121

sep=';',

122

header=0,

123

dtype={'col1': 'int64', 'col2': 'float32'},

124

parse_dates=['date_column']

125

)

126

127

# From URL with compression

128

df = cudf.read_csv(

129

'https://example.com/data.csv.gz',

130

compression='gzip'

131

)

132

"""

133

```

134

135

### CSV Writing

136

137

```{ .api }

138

def to_csv(

139

path_or_buf=None,

140

sep=',',

141

na_rep='',

142

float_format=None,

143

columns=None,

144

header=True,

145

index=True,

146

index_label=None,

147

mode='w',

148

encoding=None,

149

compression='infer',

150

quoting=None,

151

quotechar='"',

152

line_terminator=None,

153

chunksize=None,

154

date_format=None,

155

doublequote=True,

156

escapechar=None,

157

decimal='.',

158

**kwargs

159

):

160

"""

161

Write GPU DataFrame to CSV format

162

163

High-performance CSV writing with customizable formatting options.

164

Writes directly from GPU memory with minimal data transfers.

165

166

Parameters:

167

path_or_buf: str, path object, or file-like object

168

File path or object to write to

169

sep: str, default ','

170

Field delimiter character

171

na_rep: str, default ''

172

String representation of NaN values

173

float_format: str, optional

174

Format string for floating point numbers

175

columns: sequence, optional

176

Columns to write

177

header: bool or list of str, default True

178

Write column names as header

179

index: bool, default True

180

Write row names (index)

181

mode: str, default 'w'

182

File mode ('w' for write, 'a' for append)

183

compression: str or dict, default 'infer'

184

Compression type ('gzip', 'bz2', 'xz', 'zstd', etc.)

185

**kwargs: additional keyword arguments

186

Other CSV writing options

187

188

Examples:

189

# Basic CSV writing

190

df.to_csv('output.csv')

191

192

# Custom formatting

193

df.to_csv('output.csv', sep=';', index=False, float_format='%.2f')

194

195

# Compressed output

196

df.to_csv('output.csv.gz', compression='gzip')

197

"""

198

```

199

200

## Parquet I/O

201

202

Optimized Apache Parquet support with metadata handling and dataset operations.

203

204

```{ .api }

205

def read_parquet(

206

path,

207

engine='cudf',

208

columns=None,

209

filters=None,

210

row_groups=None,

211

use_pandas_metadata=True,

212

storage_options=None,

213

bytes_per_thread=None,

214

**kwargs

215

) -> DataFrame:

216

"""

217

Read Apache Parquet file(s) directly into GPU memory

218

219

High-performance Parquet reader with predicate pushdown, column pruning,

220

and automatic schema detection. Supports single files, directories, and

221

cloud storage locations.

222

223

Parameters:

224

path: str, PathLike, or list

225

File path, directory, or list of files to read

226

engine: str, default 'cudf'

227

Parquet engine to use ('cudf' for GPU acceleration)

228

columns: list, optional

229

Specific columns to read (column pruning)

230

filters: list of tuples, optional

231

Row filter predicates for predicate pushdown

232

row_groups: list, optional

233

Specific row groups to read

234

use_pandas_metadata: bool, default True

235

Whether to use pandas metadata for schema information

236

storage_options: dict, optional

237

Options for cloud storage (S3, GCS, Azure)

238

bytes_per_thread: int, optional

239

Bytes to read per thread for parallel I/O

240

**kwargs: additional arguments

241

Engine-specific options

242

243

Returns:

244

DataFrame: GPU DataFrame with Parquet data

245

246

Examples:

247

# Basic Parquet reading

248

df = cudf.read_parquet('data.parquet')

249

250

# Column pruning and filtering

251

df = cudf.read_parquet(

252

'data.parquet',

253

columns=['col1', 'col2', 'col3'],

254

filters=[('col1', '>', 100), ('col2', '==', 'value')]

255

)

256

257

# Multiple files

258

df = cudf.read_parquet(['file1.parquet', 'file2.parquet'])

259

260

# From cloud storage

261

df = cudf.read_parquet(

262

's3://bucket/path/data.parquet',

263

storage_options={'key': 'access_key', 'secret': 'secret_key'}

264

)

265

"""

266

267

def read_parquet_metadata(path, **kwargs) -> object:

268

"""

269

Read metadata from Parquet file without loading data

270

271

Extracts schema information, row group statistics, and file metadata

272

for query planning and data exploration without full data loading.

273

274

Parameters:

275

path: str or PathLike

276

Path to Parquet file

277

**kwargs: additional arguments

278

Storage and engine options

279

280

Returns:

281

object: Parquet metadata object with schema and statistics

282

283

Examples:

284

# Read metadata only

285

metadata = cudf.io.parquet.read_parquet_metadata('data.parquet')

286

print(f"Rows: {metadata.num_rows}")

287

print(f"Columns: {len(metadata.schema)}")

288

"""

289

290

def merge_parquet_filemetadata(metadata_list) -> object:

291

"""

292

Merge multiple Parquet file metadata objects

293

294

Combines metadata from multiple Parquet files for unified schema

295

and statistics. Useful for dataset-level operations.

296

297

Parameters:

298

metadata_list: list

299

List of Parquet metadata objects to merge

300

301

Returns:

302

object: Merged Parquet metadata object

303

304

Examples:

305

# Merge metadata from multiple files

306

meta1 = cudf.io.parquet.read_parquet_metadata('file1.parquet')

307

meta2 = cudf.io.parquet.read_parquet_metadata('file2.parquet')

308

merged = cudf.io.parquet.merge_parquet_filemetadata([meta1, meta2])

309

"""

310

```

311

312

### Parquet Dataset Operations

313

314

```{ .api }

315

class ParquetDatasetWriter:

316

"""

317

Writer for partitioned Parquet datasets

318

319

Manages writing DataFrames to partitioned Parquet datasets with

320

automatic directory structure creation and metadata management.

321

322

Parameters:

323

path: str or PathLike

324

Root directory for the dataset

325

partition_cols: list, optional

326

Columns to use for dataset partitioning

327

**kwargs: additional arguments

328

Writer configuration options

329

330

Methods:

331

write_table(table, **kwargs): Write table to dataset

332

close(): Finalize dataset and write metadata

333

334

Examples:

335

# Create partitioned dataset writer

336

writer = cudf.io.parquet.ParquetDatasetWriter(

337

'/path/to/dataset',

338

partition_cols=['year', 'month']

339

)

340

341

# Write data in chunks

342

for chunk in data_chunks:

343

writer.write_table(chunk)

344

writer.close()

345

"""

346

347

def write_to_dataset(

348

df,

349

root_path,

350

partition_cols=None,

351

preserve_index=False,

352

storage_options=None,

353

**kwargs

354

) -> None:

355

"""

356

Write DataFrame to partitioned Parquet dataset

357

358

Creates partitioned Parquet dataset with automatic directory structure

359

based on partition columns. Supports cloud storage destinations.

360

361

Parameters:

362

df: DataFrame

363

cuDF DataFrame to write

364

root_path: str or PathLike

365

Root directory for dataset

366

partition_cols: list, optional

367

Columns to use for partitioning

368

preserve_index: bool, default False

369

Whether to write index as column

370

storage_options: dict, optional

371

Cloud storage configuration

372

**kwargs: additional arguments

373

Writer options (compression, etc.)

374

375

Examples:

376

# Write partitioned dataset

377

cudf.io.parquet.write_to_dataset(

378

df,

379

'/path/to/dataset',

380

partition_cols=['year', 'category'],

381

compression='snappy'

382

)

383

"""

384

```

385

386

## JSON I/O

387

388

Flexible JSON reading with support for various JSON formats.

389

390

```{ .api }

391

def read_json(

392

path_or_buf,

393

orient='records',

394

typ='frame',

395

dtype=None,

396

lines=False,

397

compression='infer',

398

storage_options=None,

399

**kwargs

400

) -> DataFrame:

401

"""

402

Read JSON data directly into GPU memory

403

404

Supports various JSON formats including line-delimited JSON (JSONL),

405

nested JSON structures, and automatic schema inference.

406

407

Parameters:

408

path_or_buf: str, PathLike, or file-like object

409

JSON data source (file, URL, or buffer)

410

orient: str, default 'records'

411

JSON structure format ('records', 'index', 'values', 'split')

412

typ: str, default 'frame'

413

Type of object to return ('frame' for DataFrame)

414

dtype: dict or str, optional

415

Data type specification for columns

416

lines: bool, default False

417

Whether to read line-delimited JSON

418

compression: str, default 'infer'

419

Compression type ('gzip', 'bz2', 'xz', None)

420

storage_options: dict, optional

421

Cloud storage configuration

422

**kwargs: additional arguments

423

JSON parsing options

424

425

Returns:

426

DataFrame: GPU DataFrame containing JSON data

427

428

Examples:

429

# Read JSON file

430

df = cudf.read_json('data.json')

431

432

# Line-delimited JSON

433

df = cudf.read_json('data.jsonl', lines=True)

434

435

# With compression

436

df = cudf.read_json('data.json.gz', compression='gzip')

437

438

# From URL

439

df = cudf.read_json('https://api.example.com/data.json')

440

"""

441

```

442

443

## ORC I/O

444

445

Apache ORC format support with metadata utilities.

446

447

```{ .api }

448

def read_orc(

449

path,

450

columns=None,

451

filters=None,

452

stripes=None,

453

skiprows=None,

454

num_rows=None,

455

use_index=True,

456

storage_options=None,

457

**kwargs

458

) -> DataFrame:

459

"""

460

Read Apache ORC file directly into GPU memory

461

462

High-performance ORC reader with predicate pushdown and column pruning.

463

Supports compressed ORC files and cloud storage.

464

465

Parameters:

466

path: str or PathLike

467

Path to ORC file

468

columns: list, optional

469

Specific columns to read

470

filters: list of tuples, optional

471

Row filter predicates

472

stripes: list, optional

473

Specific ORC stripes to read

474

skiprows: int, optional

475

Number of rows to skip

476

num_rows: int, optional

477

Maximum rows to read

478

use_index: bool, default True

479

Whether to use ORC file index

480

storage_options: dict, optional

481

Cloud storage options

482

**kwargs: additional arguments

483

Reader configuration

484

485

Returns:

486

DataFrame: GPU DataFrame with ORC data

487

488

Examples:

489

# Basic ORC reading

490

df = cudf.read_orc('data.orc')

491

492

# With column pruning and filtering

493

df = cudf.read_orc(

494

'data.orc',

495

columns=['col1', 'col2'],

496

filters=[('col1', '>', 0)]

497

)

498

"""

499

500

def read_orc_metadata(path, **kwargs) -> object:

501

"""

502

Read metadata from ORC file without loading data

503

504

Extracts schema, stripe information, and statistics for

505

query planning and data exploration.

506

507

Parameters:

508

path: str or PathLike

509

Path to ORC file

510

**kwargs: additional arguments

511

Reader options

512

513

Returns:

514

object: ORC metadata with schema and statistics

515

516

Examples:

517

# Read ORC metadata

518

metadata = cudf.io.orc.read_orc_metadata('data.orc')

519

print(f"Stripes: {len(metadata.stripes)}")

520

"""

521

```

522

523

### ORC Writing

524

525

```{ .api }

526

def to_orc(

527

path,

528

compression='snappy',

529

enable_statistics=True,

530

stripe_size_bytes=None,

531

stripe_size_rows=None,

532

row_index_stride=None,

533

**kwargs

534

):

535

"""

536

Write GPU DataFrame to Apache ORC format

537

538

High-performance ORC writing with compression and statistical metadata.

539

Writes directly from GPU memory with configurable stripe organization.

540

541

Parameters:

542

path: str or PathLike

543

Output path for ORC file

544

compression: str, default 'snappy'

545

Compression algorithm ('snappy', 'zlib', 'lz4', 'zstd', None)

546

enable_statistics: bool, default True

547

Whether to compute column statistics

548

stripe_size_bytes: int, optional

549

Target stripe size in bytes

550

stripe_size_rows: int, optional

551

Target stripe size in rows

552

row_index_stride: int, optional

553

Row group index stride

554

**kwargs: additional keyword arguments

555

Other ORC writing options

556

557

Examples:

558

# Basic ORC writing

559

df.to_orc('output.orc')

560

561

# With compression

562

df.to_orc('output.orc', compression='zlib')

563

564

# Custom stripe configuration

565

df.to_orc('output.orc', stripe_size_rows=50000)

566

"""

567

```

568

569

## Avro I/O

570

571

Apache Avro format support for schema evolution and serialization.

572

573

```{ .api }

574

def read_avro(

575

filepath_or_buffer,

576

columns=None,

577

skiprows=None,

578

num_rows=None,

579

storage_options=None,

580

**kwargs

581

) -> DataFrame:

582

"""

583

Read Apache Avro file directly into GPU memory

584

585

Reads Avro files with automatic schema detection and type conversion.

586

Supports compressed Avro files and nested data structures.

587

588

Parameters:

589

filepath_or_buffer: str, PathLike, or file-like object

590

Avro data source

591

columns: list, optional

592

Specific columns to read

593

skiprows: int, optional

594

Number of rows to skip at beginning

595

num_rows: int, optional

596

Maximum number of rows to read

597

storage_options: dict, optional

598

Cloud storage configuration

599

**kwargs: additional arguments

600

Avro reader options

601

602

Returns:

603

DataFrame: GPU DataFrame with Avro data

604

605

Examples:

606

# Read Avro file

607

df = cudf.read_avro('data.avro')

608

609

# With column selection

610

df = cudf.read_avro('data.avro', columns=['col1', 'col2'])

611

"""

612

```

613

614

## Feather I/O

615

616

Apache Arrow Feather format for fast serialization.

617

618

```{ .api }

619

def read_feather(

620

path,

621

columns=None,

622

use_threads=True,

623

storage_options=None,

624

**kwargs

625

) -> DataFrame:

626

"""

627

Read Apache Feather format file into GPU memory

628

629

Fast binary format based on Apache Arrow for efficient DataFrame

630

serialization with preserved data types and metadata.

631

632

Parameters:

633

path: str or PathLike

634

Path to Feather file

635

columns: list, optional

636

Subset of columns to read

637

use_threads: bool, default True

638

Whether to use threading for parallel I/O

639

storage_options: dict, optional

640

Cloud storage options

641

**kwargs: additional arguments

642

Reader configuration

643

644

Returns:

645

DataFrame: GPU DataFrame with Feather data

646

647

Examples:

648

# Read Feather file

649

df = cudf.read_feather('data.feather')

650

651

# Column selection

652

df = cudf.read_feather('data.feather', columns=['A', 'B'])

653

"""

654

```

655

656

## HDF5 I/O

657

658

HDF5 format support for scientific and numerical data.

659

660

```{ .api }

661

def read_hdf(

662

path_or_buf,

663

key=None,

664

mode='r',

665

columns=None,

666

start=None,

667

stop=None,

668

**kwargs

669

) -> DataFrame:

670

"""

671

Read HDF5 file into GPU memory

672

673

Reads HDF5 datasets with support for hierarchical data organization

674

and partial reading of large datasets.

675

676

Parameters:

677

path_or_buf: str, PathLike, or file-like object

678

HDF5 file source

679

key: str, optional

680

HDF5 group/dataset key to read

681

mode: str, default 'r'

682

File access mode

683

columns: list, optional

684

Subset of columns to read

685

start: int, optional

686

Starting row position

687

stop: int, optional

688

Ending row position

689

**kwargs: additional arguments

690

HDF5 reader options

691

692

Returns:

693

DataFrame: GPU DataFrame with HDF5 data

694

695

Examples:

696

# Read HDF5 dataset

697

df = cudf.read_hdf('data.h5', key='dataset1')

698

699

# Partial reading

700

df = cudf.read_hdf('data.h5', key='dataset1', start=1000, stop=2000)

701

"""

702

```

703

704

## Text I/O

705

706

Raw text file reading for unstructured data processing.

707

708

```{ .api }

709

def read_text(

710

filepath_or_buffer,

711

delimiter=None,

712

dtype='str',

713

lineterminator='\n',

714

skiprows=0,

715

skipfooter=0,

716

nrows=None,

717

na_values=None,

718

keep_default_na=True,

719

na_filter=True,

720

storage_options=None,

721

**kwargs

722

) -> DataFrame:

723

"""

724

Read raw text file line by line into GPU memory

725

726

Reads unstructured text data with each line as a DataFrame row.

727

Useful for log files, natural language processing, and custom parsing.

728

729

Parameters:

730

filepath_or_buffer: str, PathLike, or file-like object

731

Text file source

732

delimiter: str, optional

733

Line delimiter (default: newline)

734

dtype: str, default 'str'

735

Data type for text data

736

lineterminator: str, default '\n'

737

Line termination character

738

skiprows: int, default 0

739

Number of rows to skip at beginning

740

skipfooter: int, default 0

741

Number of rows to skip at end

742

nrows: int, optional

743

Maximum number of lines to read

744

na_values: list, optional

745

Values to treat as missing

746

keep_default_na: bool, default True

747

Whether to include default NA values

748

na_filter: bool, default True

749

Whether to check for missing values

750

storage_options: dict, optional

751

Cloud storage configuration

752

**kwargs: additional arguments

753

Text reader options

754

755

Returns:

756

DataFrame: GPU DataFrame with one column containing text lines

757

758

Examples:

759

# Read text file

760

df = cudf.read_text('logfile.txt')

761

762

# With line limits

763

df = cudf.read_text('data.txt', nrows=1000)

764

"""

765

```

766

767

## Interoperability

768

769

### DLPack Integration

770

771

```{ .api }

772

def from_dlpack(dlpack_tensor) -> Union[DataFrame, Series]:

773

"""

774

Create cuDF object from DLPack tensor

775

776

Enables zero-copy data sharing between cuDF and other GPU libraries

777

that support the DLPack standard (PyTorch, CuPy, JAX, etc.).

778

779

Parameters:

780

dlpack_tensor: DLPack tensor object

781

GPU tensor in DLPack format

782

783

Returns:

784

Union[DataFrame, Series]: cuDF object sharing memory with tensor

785

786

Examples:

787

# From PyTorch tensor

788

import torch

789

tensor = torch.cuda.FloatTensor([1, 2, 3, 4])

790

series = cudf.io.dlpack.from_dlpack(tensor.__dlpack__())

791

792

# From CuPy array

793

import cupy

794

array = cupy.array([1.0, 2.0, 3.0])

795

series = cudf.io.dlpack.from_dlpack(array.toDlpack())

796

"""

797

```

798

799

## DataFrame Write Methods

800

801

All cuDF DataFrames include write methods for various formats:

802

803

```python

804

# CSV writing

805

df.to_csv('output.csv', index=False)

806

807

# Parquet writing

808

df.to_parquet('output.parquet', compression='snappy')

809

810

# JSON writing

811

df.to_json('output.json', orient='records', lines=True)

812

813

# ORC writing

814

df.to_orc('output.orc', compression='zlib')

815

816

# Feather writing

817

df.to_feather('output.feather')

818

819

# HDF5 writing

820

df.to_hdf('output.h5', key='dataset', mode='w')

821

```

822

823

## Performance Optimizations

824

825

### GPU Memory Management

826

- **Direct GPU Loading**: All readers load data directly to GPU memory

827

- **Memory Mapping**: Support for memory-mapped files to reduce memory usage

828

- **Streaming**: Chunked reading for datasets larger than GPU memory

829

- **Zero-Copy**: Minimal memory copying between operations

830

831

### Parallel Processing

832

- **Multi-threaded I/O**: Parallel file reading with configurable thread counts

833

- **Column Parallelism**: Independent processing of columns during parsing

834

- **Compressed Reading**: Hardware-accelerated decompression on GPU

835

836

### Query Optimization

837

- **Predicate Pushdown**: Filter rows during file reading

838

- **Column Pruning**: Read only required columns from files

839

- **Schema Inference**: Automatic data type detection and optimization

840

- **Metadata Caching**: Reuse file metadata for repeated operations

841

842

## Cloud Storage Support

843

844

All I/O functions support cloud storage through `storage_options`:

845

846

```python

847

# Amazon S3

848

s3_options = {

849

'key': 'access_key_id',

850

'secret': 'secret_access_key',

851

'token': 'session_token' # optional

852

}

853

df = cudf.read_parquet('s3://bucket/path/data.parquet',

854

storage_options=s3_options)

855

856

# Google Cloud Storage

857

gcs_options = {

858

'token': 'path/to/service_account.json'

859

}

860

df = cudf.read_csv('gs://bucket/data.csv', storage_options=gcs_options)

861

862

# Azure Blob Storage

863

azure_options = {

864

'account_name': 'storage_account',

865

'account_key': 'account_key'

866

}

867

df = cudf.read_json('abfs://container/data.json',

868

storage_options=azure_options)

869

```