0
# I/O Operations
1
2
cuDF provides high-performance GPU I/O for popular data formats with automatic memory management and optimized readers/writers. All I/O operations leverage GPU memory directly, minimizing CPU-GPU data transfers.
3
4
## Import Statements
5
6
```python
7
# Core I/O functions
8
from cudf import read_csv, read_parquet, read_json
9
from cudf.io import read_orc, read_avro, read_feather, read_hdf, read_text
10
from cudf.io.csv import to_csv
11
from cudf.io.orc import to_orc
12
13
# Parquet utilities
14
from cudf.io.parquet import (
15
read_parquet_metadata, merge_parquet_filemetadata,
16
ParquetDatasetWriter, write_to_dataset
17
)
18
19
# ORC utilities
20
from cudf.io.orc import read_orc_metadata
21
22
# Interoperability
23
from cudf.io.dlpack import from_dlpack
24
```
25
26
## CSV I/O
27
28
High-performance CSV reading with extensive parsing options.
29
30
```{ .api }
31
def read_csv(
32
filepath_or_buffer,
33
sep=',',
34
delimiter=None,
35
header='infer',
36
names=None,
37
index_col=None,
38
usecols=None,
39
dtype=None,
40
skiprows=None,
41
skipfooter=0,
42
nrows=None,
43
na_values=None,
44
keep_default_na=True,
45
na_filter=True,
46
skip_blank_lines=True,
47
parse_dates=False,
48
date_parser=None,
49
dayfirst=False,
50
compression='infer',
51
thousands=None,
52
decimal='.',
53
lineterminator=None,
54
quotechar='"',
55
quoting=0,
56
doublequote=True,
57
escapechar=None,
58
comment=None,
59
encoding='utf-8',
60
storage_options=None,
61
**kwargs
62
) -> DataFrame:
63
"""
64
Read CSV file directly into GPU memory with optimized parsing
65
66
Provides GPU-accelerated CSV parsing with extensive configuration options.
67
Automatically detects and handles various CSV formats and encodings.
68
69
Parameters:
70
filepath_or_buffer: str, PathLike, or file-like object
71
File path, URL, or buffer containing CSV data
72
sep: str, default ','
73
Field delimiter character
74
delimiter: str, optional
75
Alternative name for sep parameter
76
header: int, list of int, or 'infer', default 'infer'
77
Row number(s) to use as column names
78
names: list, optional
79
List of column names to use instead of header
80
index_col: int, str, or list, optional
81
Column(s) to use as row labels
82
usecols: list or callable, optional
83
Subset of columns to read
84
dtype: dict or str, optional
85
Data type specification for columns
86
skiprows: int, list, or callable, optional
87
Rows to skip at beginning of file
88
skipfooter: int, default 0
89
Number of rows to skip at end of file
90
nrows: int, optional
91
Maximum number of rows to read
92
na_values: scalar, str, list, or dict, optional
93
Additional strings to recognize as NA/NaN
94
keep_default_na: bool, default True
95
Whether to include default NaN values
96
na_filter: bool, default True
97
Whether to check for missing values
98
skip_blank_lines: bool, default True
99
Whether to skip blank lines
100
parse_dates: bool, list, or dict, default False
101
Columns to parse as dates
102
compression: str or dict, default 'infer'
103
Type of compression ('gzip', 'bz2', 'xz', 'zip', None)
104
encoding: str, default 'utf-8'
105
Character encoding to use
106
storage_options: dict, optional
107
Options for cloud storage access
108
**kwargs: additional keyword arguments
109
Other CSV parsing options
110
111
Returns:
112
DataFrame: GPU DataFrame containing parsed CSV data
113
114
Examples:
115
# Basic CSV reading
116
df = cudf.read_csv('data.csv')
117
118
# With custom options
119
df = cudf.read_csv(
120
'data.csv',
121
sep=';',
122
header=0,
123
dtype={'col1': 'int64', 'col2': 'float32'},
124
parse_dates=['date_column']
125
)
126
127
# From URL with compression
128
df = cudf.read_csv(
129
'https://example.com/data.csv.gz',
130
compression='gzip'
131
)
132
"""
133
```
134
135
### CSV Writing
136
137
```{ .api }
138
def to_csv(
139
path_or_buf=None,
140
sep=',',
141
na_rep='',
142
float_format=None,
143
columns=None,
144
header=True,
145
index=True,
146
index_label=None,
147
mode='w',
148
encoding=None,
149
compression='infer',
150
quoting=None,
151
quotechar='"',
152
line_terminator=None,
153
chunksize=None,
154
date_format=None,
155
doublequote=True,
156
escapechar=None,
157
decimal='.',
158
**kwargs
159
):
160
"""
161
Write GPU DataFrame to CSV format
162
163
High-performance CSV writing with customizable formatting options.
164
Writes directly from GPU memory with minimal data transfers.
165
166
Parameters:
167
path_or_buf: str, path object, or file-like object
168
File path or object to write to
169
sep: str, default ','
170
Field delimiter character
171
na_rep: str, default ''
172
String representation of NaN values
173
float_format: str, optional
174
Format string for floating point numbers
175
columns: sequence, optional
176
Columns to write
177
header: bool or list of str, default True
178
Write column names as header
179
index: bool, default True
180
Write row names (index)
181
mode: str, default 'w'
182
File mode ('w' for write, 'a' for append)
183
compression: str or dict, default 'infer'
184
Compression type ('gzip', 'bz2', 'xz', 'zstd', etc.)
185
**kwargs: additional keyword arguments
186
Other CSV writing options
187
188
Examples:
189
# Basic CSV writing
190
df.to_csv('output.csv')
191
192
# Custom formatting
193
df.to_csv('output.csv', sep=';', index=False, float_format='%.2f')
194
195
# Compressed output
196
df.to_csv('output.csv.gz', compression='gzip')
197
"""
198
```
199
200
## Parquet I/O
201
202
Optimized Apache Parquet support with metadata handling and dataset operations.
203
204
```{ .api }
205
def read_parquet(
206
path,
207
engine='cudf',
208
columns=None,
209
filters=None,
210
row_groups=None,
211
use_pandas_metadata=True,
212
storage_options=None,
213
bytes_per_thread=None,
214
**kwargs
215
) -> DataFrame:
216
"""
217
Read Apache Parquet file(s) directly into GPU memory
218
219
High-performance Parquet reader with predicate pushdown, column pruning,
220
and automatic schema detection. Supports single files, directories, and
221
cloud storage locations.
222
223
Parameters:
224
path: str, PathLike, or list
225
File path, directory, or list of files to read
226
engine: str, default 'cudf'
227
Parquet engine to use ('cudf' for GPU acceleration)
228
columns: list, optional
229
Specific columns to read (column pruning)
230
filters: list of tuples, optional
231
Row filter predicates for predicate pushdown
232
row_groups: list, optional
233
Specific row groups to read
234
use_pandas_metadata: bool, default True
235
Whether to use pandas metadata for schema information
236
storage_options: dict, optional
237
Options for cloud storage (S3, GCS, Azure)
238
bytes_per_thread: int, optional
239
Bytes to read per thread for parallel I/O
240
**kwargs: additional arguments
241
Engine-specific options
242
243
Returns:
244
DataFrame: GPU DataFrame with Parquet data
245
246
Examples:
247
# Basic Parquet reading
248
df = cudf.read_parquet('data.parquet')
249
250
# Column pruning and filtering
251
df = cudf.read_parquet(
252
'data.parquet',
253
columns=['col1', 'col2', 'col3'],
254
filters=[('col1', '>', 100), ('col2', '==', 'value')]
255
)
256
257
# Multiple files
258
df = cudf.read_parquet(['file1.parquet', 'file2.parquet'])
259
260
# From cloud storage
261
df = cudf.read_parquet(
262
's3://bucket/path/data.parquet',
263
storage_options={'key': 'access_key', 'secret': 'secret_key'}
264
)
265
"""
266
267
def read_parquet_metadata(path, **kwargs) -> object:
268
"""
269
Read metadata from Parquet file without loading data
270
271
Extracts schema information, row group statistics, and file metadata
272
for query planning and data exploration without full data loading.
273
274
Parameters:
275
path: str or PathLike
276
Path to Parquet file
277
**kwargs: additional arguments
278
Storage and engine options
279
280
Returns:
281
object: Parquet metadata object with schema and statistics
282
283
Examples:
284
# Read metadata only
285
metadata = cudf.io.parquet.read_parquet_metadata('data.parquet')
286
print(f"Rows: {metadata.num_rows}")
287
print(f"Columns: {len(metadata.schema)}")
288
"""
289
290
def merge_parquet_filemetadata(metadata_list) -> object:
291
"""
292
Merge multiple Parquet file metadata objects
293
294
Combines metadata from multiple Parquet files for unified schema
295
and statistics. Useful for dataset-level operations.
296
297
Parameters:
298
metadata_list: list
299
List of Parquet metadata objects to merge
300
301
Returns:
302
object: Merged Parquet metadata object
303
304
Examples:
305
# Merge metadata from multiple files
306
meta1 = cudf.io.parquet.read_parquet_metadata('file1.parquet')
307
meta2 = cudf.io.parquet.read_parquet_metadata('file2.parquet')
308
merged = cudf.io.parquet.merge_parquet_filemetadata([meta1, meta2])
309
"""
310
```
311
312
### Parquet Dataset Operations
313
314
```{ .api }
315
class ParquetDatasetWriter:
316
"""
317
Writer for partitioned Parquet datasets
318
319
Manages writing DataFrames to partitioned Parquet datasets with
320
automatic directory structure creation and metadata management.
321
322
Parameters:
323
path: str or PathLike
324
Root directory for the dataset
325
partition_cols: list, optional
326
Columns to use for dataset partitioning
327
**kwargs: additional arguments
328
Writer configuration options
329
330
Methods:
331
write_table(table, **kwargs): Write table to dataset
332
close(): Finalize dataset and write metadata
333
334
Examples:
335
# Create partitioned dataset writer
336
writer = cudf.io.parquet.ParquetDatasetWriter(
337
'/path/to/dataset',
338
partition_cols=['year', 'month']
339
)
340
341
# Write data in chunks
342
for chunk in data_chunks:
343
writer.write_table(chunk)
344
writer.close()
345
"""
346
347
def write_to_dataset(
348
df,
349
root_path,
350
partition_cols=None,
351
preserve_index=False,
352
storage_options=None,
353
**kwargs
354
) -> None:
355
"""
356
Write DataFrame to partitioned Parquet dataset
357
358
Creates partitioned Parquet dataset with automatic directory structure
359
based on partition columns. Supports cloud storage destinations.
360
361
Parameters:
362
df: DataFrame
363
cuDF DataFrame to write
364
root_path: str or PathLike
365
Root directory for dataset
366
partition_cols: list, optional
367
Columns to use for partitioning
368
preserve_index: bool, default False
369
Whether to write index as column
370
storage_options: dict, optional
371
Cloud storage configuration
372
**kwargs: additional arguments
373
Writer options (compression, etc.)
374
375
Examples:
376
# Write partitioned dataset
377
cudf.io.parquet.write_to_dataset(
378
df,
379
'/path/to/dataset',
380
partition_cols=['year', 'category'],
381
compression='snappy'
382
)
383
"""
384
```
385
386
## JSON I/O
387
388
Flexible JSON reading with support for various JSON formats.
389
390
```{ .api }
391
def read_json(
392
path_or_buf,
393
orient='records',
394
typ='frame',
395
dtype=None,
396
lines=False,
397
compression='infer',
398
storage_options=None,
399
**kwargs
400
) -> DataFrame:
401
"""
402
Read JSON data directly into GPU memory
403
404
Supports various JSON formats including line-delimited JSON (JSONL),
405
nested JSON structures, and automatic schema inference.
406
407
Parameters:
408
path_or_buf: str, PathLike, or file-like object
409
JSON data source (file, URL, or buffer)
410
orient: str, default 'records'
411
JSON structure format ('records', 'index', 'values', 'split')
412
typ: str, default 'frame'
413
Type of object to return ('frame' for DataFrame)
414
dtype: dict or str, optional
415
Data type specification for columns
416
lines: bool, default False
417
Whether to read line-delimited JSON
418
compression: str, default 'infer'
419
Compression type ('gzip', 'bz2', 'xz', None)
420
storage_options: dict, optional
421
Cloud storage configuration
422
**kwargs: additional arguments
423
JSON parsing options
424
425
Returns:
426
DataFrame: GPU DataFrame containing JSON data
427
428
Examples:
429
# Read JSON file
430
df = cudf.read_json('data.json')
431
432
# Line-delimited JSON
433
df = cudf.read_json('data.jsonl', lines=True)
434
435
# With compression
436
df = cudf.read_json('data.json.gz', compression='gzip')
437
438
# From URL
439
df = cudf.read_json('https://api.example.com/data.json')
440
"""
441
```
442
443
## ORC I/O
444
445
Apache ORC format support with metadata utilities.
446
447
```{ .api }
448
def read_orc(
449
path,
450
columns=None,
451
filters=None,
452
stripes=None,
453
skiprows=None,
454
num_rows=None,
455
use_index=True,
456
storage_options=None,
457
**kwargs
458
) -> DataFrame:
459
"""
460
Read Apache ORC file directly into GPU memory
461
462
High-performance ORC reader with predicate pushdown and column pruning.
463
Supports compressed ORC files and cloud storage.
464
465
Parameters:
466
path: str or PathLike
467
Path to ORC file
468
columns: list, optional
469
Specific columns to read
470
filters: list of tuples, optional
471
Row filter predicates
472
stripes: list, optional
473
Specific ORC stripes to read
474
skiprows: int, optional
475
Number of rows to skip
476
num_rows: int, optional
477
Maximum rows to read
478
use_index: bool, default True
479
Whether to use ORC file index
480
storage_options: dict, optional
481
Cloud storage options
482
**kwargs: additional arguments
483
Reader configuration
484
485
Returns:
486
DataFrame: GPU DataFrame with ORC data
487
488
Examples:
489
# Basic ORC reading
490
df = cudf.read_orc('data.orc')
491
492
# With column pruning and filtering
493
df = cudf.read_orc(
494
'data.orc',
495
columns=['col1', 'col2'],
496
filters=[('col1', '>', 0)]
497
)
498
"""
499
500
def read_orc_metadata(path, **kwargs) -> object:
501
"""
502
Read metadata from ORC file without loading data
503
504
Extracts schema, stripe information, and statistics for
505
query planning and data exploration.
506
507
Parameters:
508
path: str or PathLike
509
Path to ORC file
510
**kwargs: additional arguments
511
Reader options
512
513
Returns:
514
object: ORC metadata with schema and statistics
515
516
Examples:
517
# Read ORC metadata
518
metadata = cudf.io.orc.read_orc_metadata('data.orc')
519
print(f"Stripes: {len(metadata.stripes)}")
520
"""
521
```
522
523
### ORC Writing
524
525
```{ .api }
526
def to_orc(
527
path,
528
compression='snappy',
529
enable_statistics=True,
530
stripe_size_bytes=None,
531
stripe_size_rows=None,
532
row_index_stride=None,
533
**kwargs
534
):
535
"""
536
Write GPU DataFrame to Apache ORC format
537
538
High-performance ORC writing with compression and statistical metadata.
539
Writes directly from GPU memory with configurable stripe organization.
540
541
Parameters:
542
path: str or PathLike
543
Output path for ORC file
544
compression: str, default 'snappy'
545
Compression algorithm ('snappy', 'zlib', 'lz4', 'zstd', None)
546
enable_statistics: bool, default True
547
Whether to compute column statistics
548
stripe_size_bytes: int, optional
549
Target stripe size in bytes
550
stripe_size_rows: int, optional
551
Target stripe size in rows
552
row_index_stride: int, optional
553
Row group index stride
554
**kwargs: additional keyword arguments
555
Other ORC writing options
556
557
Examples:
558
# Basic ORC writing
559
df.to_orc('output.orc')
560
561
# With compression
562
df.to_orc('output.orc', compression='zlib')
563
564
# Custom stripe configuration
565
df.to_orc('output.orc', stripe_size_rows=50000)
566
"""
567
```
568
569
## Avro I/O
570
571
Apache Avro format support for schema evolution and serialization.
572
573
```{ .api }
574
def read_avro(
575
filepath_or_buffer,
576
columns=None,
577
skiprows=None,
578
num_rows=None,
579
storage_options=None,
580
**kwargs
581
) -> DataFrame:
582
"""
583
Read Apache Avro file directly into GPU memory
584
585
Reads Avro files with automatic schema detection and type conversion.
586
Supports compressed Avro files and nested data structures.
587
588
Parameters:
589
filepath_or_buffer: str, PathLike, or file-like object
590
Avro data source
591
columns: list, optional
592
Specific columns to read
593
skiprows: int, optional
594
Number of rows to skip at beginning
595
num_rows: int, optional
596
Maximum number of rows to read
597
storage_options: dict, optional
598
Cloud storage configuration
599
**kwargs: additional arguments
600
Avro reader options
601
602
Returns:
603
DataFrame: GPU DataFrame with Avro data
604
605
Examples:
606
# Read Avro file
607
df = cudf.read_avro('data.avro')
608
609
# With column selection
610
df = cudf.read_avro('data.avro', columns=['col1', 'col2'])
611
"""
612
```
613
614
## Feather I/O
615
616
Apache Arrow Feather format for fast serialization.
617
618
```{ .api }
619
def read_feather(
620
path,
621
columns=None,
622
use_threads=True,
623
storage_options=None,
624
**kwargs
625
) -> DataFrame:
626
"""
627
Read Apache Feather format file into GPU memory
628
629
Fast binary format based on Apache Arrow for efficient DataFrame
630
serialization with preserved data types and metadata.
631
632
Parameters:
633
path: str or PathLike
634
Path to Feather file
635
columns: list, optional
636
Subset of columns to read
637
use_threads: bool, default True
638
Whether to use threading for parallel I/O
639
storage_options: dict, optional
640
Cloud storage options
641
**kwargs: additional arguments
642
Reader configuration
643
644
Returns:
645
DataFrame: GPU DataFrame with Feather data
646
647
Examples:
648
# Read Feather file
649
df = cudf.read_feather('data.feather')
650
651
# Column selection
652
df = cudf.read_feather('data.feather', columns=['A', 'B'])
653
"""
654
```
655
656
## HDF5 I/O
657
658
HDF5 format support for scientific and numerical data.
659
660
```{ .api }
661
def read_hdf(
662
path_or_buf,
663
key=None,
664
mode='r',
665
columns=None,
666
start=None,
667
stop=None,
668
**kwargs
669
) -> DataFrame:
670
"""
671
Read HDF5 file into GPU memory
672
673
Reads HDF5 datasets with support for hierarchical data organization
674
and partial reading of large datasets.
675
676
Parameters:
677
path_or_buf: str, PathLike, or file-like object
678
HDF5 file source
679
key: str, optional
680
HDF5 group/dataset key to read
681
mode: str, default 'r'
682
File access mode
683
columns: list, optional
684
Subset of columns to read
685
start: int, optional
686
Starting row position
687
stop: int, optional
688
Ending row position
689
**kwargs: additional arguments
690
HDF5 reader options
691
692
Returns:
693
DataFrame: GPU DataFrame with HDF5 data
694
695
Examples:
696
# Read HDF5 dataset
697
df = cudf.read_hdf('data.h5', key='dataset1')
698
699
# Partial reading
700
df = cudf.read_hdf('data.h5', key='dataset1', start=1000, stop=2000)
701
"""
702
```
703
704
## Text I/O
705
706
Raw text file reading for unstructured data processing.
707
708
```{ .api }
709
def read_text(
710
filepath_or_buffer,
711
delimiter=None,
712
dtype='str',
713
lineterminator='\n',
714
skiprows=0,
715
skipfooter=0,
716
nrows=None,
717
na_values=None,
718
keep_default_na=True,
719
na_filter=True,
720
storage_options=None,
721
**kwargs
722
) -> DataFrame:
723
"""
724
Read raw text file line by line into GPU memory
725
726
Reads unstructured text data with each line as a DataFrame row.
727
Useful for log files, natural language processing, and custom parsing.
728
729
Parameters:
730
filepath_or_buffer: str, PathLike, or file-like object
731
Text file source
732
delimiter: str, optional
733
Line delimiter (default: newline)
734
dtype: str, default 'str'
735
Data type for text data
736
lineterminator: str, default '\n'
737
Line termination character
738
skiprows: int, default 0
739
Number of rows to skip at beginning
740
skipfooter: int, default 0
741
Number of rows to skip at end
742
nrows: int, optional
743
Maximum number of lines to read
744
na_values: list, optional
745
Values to treat as missing
746
keep_default_na: bool, default True
747
Whether to include default NA values
748
na_filter: bool, default True
749
Whether to check for missing values
750
storage_options: dict, optional
751
Cloud storage configuration
752
**kwargs: additional arguments
753
Text reader options
754
755
Returns:
756
DataFrame: GPU DataFrame with one column containing text lines
757
758
Examples:
759
# Read text file
760
df = cudf.read_text('logfile.txt')
761
762
# With line limits
763
df = cudf.read_text('data.txt', nrows=1000)
764
"""
765
```
766
767
## Interoperability
768
769
### DLPack Integration
770
771
```{ .api }
772
def from_dlpack(dlpack_tensor) -> Union[DataFrame, Series]:
773
"""
774
Create cuDF object from DLPack tensor
775
776
Enables zero-copy data sharing between cuDF and other GPU libraries
777
that support the DLPack standard (PyTorch, CuPy, JAX, etc.).
778
779
Parameters:
780
dlpack_tensor: DLPack tensor object
781
GPU tensor in DLPack format
782
783
Returns:
784
Union[DataFrame, Series]: cuDF object sharing memory with tensor
785
786
Examples:
787
# From PyTorch tensor
788
import torch
789
tensor = torch.cuda.FloatTensor([1, 2, 3, 4])
790
series = cudf.io.dlpack.from_dlpack(tensor.__dlpack__())
791
792
# From CuPy array
793
import cupy
794
array = cupy.array([1.0, 2.0, 3.0])
795
series = cudf.io.dlpack.from_dlpack(array.toDlpack())
796
"""
797
```
798
799
## DataFrame Write Methods
800
801
All cuDF DataFrames include write methods for various formats:
802
803
```python
804
# CSV writing
805
df.to_csv('output.csv', index=False)
806
807
# Parquet writing
808
df.to_parquet('output.parquet', compression='snappy')
809
810
# JSON writing
811
df.to_json('output.json', orient='records', lines=True)
812
813
# ORC writing
814
df.to_orc('output.orc', compression='zlib')
815
816
# Feather writing
817
df.to_feather('output.feather')
818
819
# HDF5 writing
820
df.to_hdf('output.h5', key='dataset', mode='w')
821
```
822
823
## Performance Optimizations
824
825
### GPU Memory Management
826
- **Direct GPU Loading**: All readers load data directly to GPU memory
827
- **Memory Mapping**: Support for memory-mapped files to reduce memory usage
828
- **Streaming**: Chunked reading for datasets larger than GPU memory
829
- **Zero-Copy**: Minimal memory copying between operations
830
831
### Parallel Processing
832
- **Multi-threaded I/O**: Parallel file reading with configurable thread counts
833
- **Column Parallelism**: Independent processing of columns during parsing
834
- **Compressed Reading**: Hardware-accelerated decompression on GPU
835
836
### Query Optimization
837
- **Predicate Pushdown**: Filter rows during file reading
838
- **Column Pruning**: Read only required columns from files
839
- **Schema Inference**: Automatic data type detection and optimization
840
- **Metadata Caching**: Reuse file metadata for repeated operations
841
842
## Cloud Storage Support
843
844
All I/O functions support cloud storage through `storage_options`:
845
846
```python
847
# Amazon S3
848
s3_options = {
849
'key': 'access_key_id',
850
'secret': 'secret_access_key',
851
'token': 'session_token' # optional
852
}
853
df = cudf.read_parquet('s3://bucket/path/data.parquet',
854
storage_options=s3_options)
855
856
# Google Cloud Storage
857
gcs_options = {
858
'token': 'path/to/service_account.json'
859
}
860
df = cudf.read_csv('gs://bucket/data.csv', storage_options=gcs_options)
861
862
# Azure Blob Storage
863
azure_options = {
864
'account_name': 'storage_account',
865
'account_key': 'account_key'
866
}
867
df = cudf.read_json('abfs://container/data.json',
868
storage_options=azure_options)
869
```