0
# Data I/O Operations
1
2
Read and write data in various formats with GPU acceleration and automatic cuDF backend integration. All I/O operations leverage the cuDF backend when configured, providing significant performance improvements for compatible data formats.
3
4
## Capabilities
5
6
### CSV Operations
7
8
Read CSV files with GPU acceleration using cuDF's high-performance CSV parser with automatic type inference and memory-efficient streaming.
9
10
```python { .api }
11
def read_csv(*args, **kwargs):
12
"""
13
Read CSV file(s) using cuDF backend.
14
15
Uses dask.dataframe.read_csv with cudf backend configured.
16
Supports all standard CSV reading options plus cuDF-specific optimizations.
17
18
Parameters:
19
- path: str or list - File path(s) to read
20
- **kwargs: Additional arguments passed to cudf.read_csv via Dask
21
22
Common Parameters:
23
- sep: str, default ',' - Field delimiter
24
- header: int or None, default 'infer' - Row to use as column names
25
- names: list, optional - Column names to use
26
- dtype: dict, optional - Data types for columns
27
- usecols: list, optional - Columns to read
28
- skiprows: int, optional - Rows to skip at start
29
- nrows: int, optional - Number of rows to read
30
- na_values: list, optional - Values to treat as NaN
31
32
Returns:
33
DataFrame - Dask-cuDF DataFrame with CSV data
34
35
Notes:
36
- Automatically uses cuDF backend when dataframe.backend="cudf"
37
- Supports remote filesystems via fsspec
38
- Optimized for large files with automatic partitioning
39
"""
40
```
41
42
### JSON Operations
43
44
Read JSON files and JSON Lines format with GPU acceleration and efficient nested data handling.
45
46
```python { .api }
47
def read_json(*args, **kwargs):
48
"""
49
Read JSON file(s) using cuDF backend.
50
51
Uses dask.dataframe.read_json with cudf backend configured.
52
Supports both standard JSON and JSON Lines formats.
53
54
Parameters:
55
- path: str or list - File path(s) to read
56
- **kwargs: Additional arguments passed to cudf.read_json via Dask
57
58
Common Parameters:
59
- orient: str, default 'records' - JSON orientation
60
- lines: bool, default False - Read as JSON Lines format
61
- dtype: dict, optional - Data types for columns
62
- compression: str, optional - Compression type ('gzip', 'bz2', etc.)
63
64
Returns:
65
DataFrame - Dask-cuDF DataFrame with JSON data
66
67
Notes:
68
- JSON Lines format recommended for large datasets
69
- Supports nested JSON structures with automatic flattening
70
"""
71
```
72
73
### Parquet Operations
74
75
Read Parquet files with GPU acceleration using cuDF's optimized Parquet reader with column pruning and predicate pushdown.
76
77
```python { .api }
78
def read_parquet(*args, **kwargs):
79
"""
80
Read Parquet file(s) using cuDF backend.
81
82
Uses dask.dataframe.read_parquet with cudf backend configured.
83
Provides optimized reading with column selection and filtering.
84
85
Parameters:
86
- path: str or list - File path(s) or directory to read
87
- **kwargs: Additional arguments passed via Dask
88
89
Common Parameters:
90
- columns: list, optional - Columns to read (column pruning)
91
- filters: list, optional - Row filters for predicate pushdown
92
- engine: str, default 'cudf' - Parquet engine to use
93
- index: str or False, optional - Column to use as index
94
- storage_options: dict, optional - Filesystem options
95
96
Returns:
97
DataFrame - Dask-cuDF DataFrame with Parquet data
98
99
Notes:
100
- Automatically partitions based on Parquet file structure
101
- Supports nested column types and complex schemas
102
- Optimized for large datasets with efficient memory usage
103
"""
104
```
105
106
### ORC Operations
107
108
Read ORC (Optimized Row Columnar) files with GPU acceleration for high-performance columnar data access.
109
110
```python { .api }
111
def read_orc(*args, **kwargs):
112
"""
113
Read ORC file(s) using cuDF backend.
114
115
Uses dask.dataframe.read_orc with cudf backend configured.
116
Optimized for ORC's columnar format with GPU acceleration.
117
118
Parameters:
119
- path: str or list - File path(s) to read
120
- **kwargs: Additional arguments passed to cudf.read_orc via Dask
121
122
Common Parameters:
123
- columns: list, optional - Columns to read
124
- stripes: list, optional - Stripe indices to read
125
- skiprows: int, optional - Rows to skip
126
- num_rows: int, optional - Number of rows to read
127
128
Returns:
129
DataFrame - Dask-cuDF DataFrame with ORC data
130
131
Notes:
132
- Leverages ORC's built-in compression and encoding
133
- Supports complex nested data types
134
- Optimized stripe-level reading for large files
135
"""
136
137
def read_text(path, chunksize="256 MiB", **kwargs):
138
"""
139
Read text files using cuDF backend.
140
141
Available in both expression and legacy modes. In expression mode,
142
uses DataFrame.read_text method. In legacy mode, uses direct implementation.
143
144
Parameters:
145
- path: str or list - File path(s) to read
146
- chunksize: str or int, default "256 MiB" - Size of each partition
147
- **kwargs: Additional arguments passed to cudf.read_text
148
149
Common Parameters:
150
- delimiter: str - Text delimiter for parsing
151
- byte_range: tuple, optional - (offset, size) for reading specific byte range
152
153
Returns:
154
DataFrame - Dask-cuDF DataFrame with parsed text data
155
156
Notes:
157
- Conditional availability based on DASK_DATAFRAME__QUERY_PLANNING setting
158
- Supports large text files with automatic chunking
159
- Uses cuDF's optimized text parsing capabilities
160
"""
161
```
162
163
### Deprecated I/O Interface
164
165
Legacy I/O functions available through the `dask_cudf.io` module (deprecated in favor of top-level functions).
166
167
```python { .api }
168
# Deprecated - use dask_cudf.read_csv instead
169
dask_cudf.io.read_csv(*args, **kwargs)
170
171
# Deprecated - use dask_cudf.read_json instead
172
dask_cudf.io.read_json(*args, **kwargs)
173
174
# Deprecated - use dask_cudf.read_parquet instead
175
dask_cudf.io.read_parquet(*args, **kwargs)
176
177
# Deprecated - use dask_cudf.read_orc instead
178
dask_cudf.io.read_orc(*args, **kwargs)
179
180
# Deprecated - use DataFrame.to_parquet method instead
181
dask_cudf.io.to_parquet(df, path, **kwargs)
182
183
def to_orc(df, path, **kwargs):
184
"""
185
Write DataFrame to ORC format.
186
187
DEPRECATED: This function is deprecated and will be removed.
188
Use DataFrame.to_orc method instead.
189
190
Parameters:
191
- df: DataFrame - DataFrame to write
192
- path: str - Output path
193
- **kwargs: Additional arguments
194
195
Raises:
196
NotImplementedError - Function is no longer supported
197
198
Notes:
199
- Legacy implementation available via dask_cudf._legacy.io.to_orc
200
- Recommended migration: df.to_orc(path, **kwargs)
201
"""
202
```
203
204
## Usage Examples
205
206
### Reading CSV Files
207
208
```python
209
import dask_cudf
210
211
# Read single CSV file
212
df = dask_cudf.read_csv('data.csv')
213
214
# Read multiple CSV files with pattern
215
df = dask_cudf.read_csv('data/*.csv')
216
217
# Read with specific options
218
df = dask_cudf.read_csv(
219
'data.csv',
220
dtype={'id': 'int64', 'value': 'float64'},
221
usecols=['id', 'value', 'category'],
222
skiprows=1
223
)
224
225
result = df.compute()
226
```
227
228
### Reading Parquet with Filters
229
230
```python
231
# Read Parquet with column selection and filtering
232
df = dask_cudf.read_parquet(
233
'data.parquet',
234
columns=['id', 'value', 'timestamp'],
235
filters=[('timestamp', '>=', '2023-01-01')]
236
)
237
238
# Process filtered data
239
summary = df.groupby('id')['value'].mean()
240
result = summary.compute()
241
```
242
243
### Working with Remote Data
244
245
```python
246
# Read from S3 with storage options
247
df = dask_cudf.read_parquet(
248
's3://bucket/data/',
249
storage_options={
250
'key': 'access_key',
251
'secret': 'secret_key'
252
}
253
)
254
255
# Read JSON Lines from remote location
256
df = dask_cudf.read_json(
257
's3://bucket/jsonl_data/*.jsonl',
258
lines=True,
259
storage_options={'anon': True}
260
)
261
```
262
263
### Configuration for Automatic Backend
264
265
```python
266
import dask
267
import dask.dataframe as dd
268
269
# Configure cuDF backend globally
270
dask.config.set({"dataframe.backend": "cudf"})
271
272
# Now standard Dask functions use cuDF backend
273
df = dd.read_csv('data.csv') # Automatically uses cuDF
274
result = df.groupby('category').sum().compute() # GPU-accelerated
275
```
276
277
### Reading Text Files
278
279
```python
280
# Read large text files with automatic chunking
281
df = dask_cudf.read_text(
282
'large_text_file.txt',
283
delimiter='\n',
284
chunksize='128 MiB'
285
)
286
287
# Read with specific byte range
288
df_range = dask_cudf.read_text(
289
'data.txt',
290
delimiter='|',
291
byte_range=(1000, 5000) # Read bytes 1000-5000
292
)
293
294
result = df.compute()
295
```