Tessl Tile for pypi/dask-cudf@24.12.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-operations.md data-io.md data-type-accessors.md groupby-operations.md index.md

data-io.mddocs/

0
# Data I/O Operations
1

2
Read and write data in various formats with GPU acceleration and automatic cuDF backend integration. All I/O operations leverage the cuDF backend when configured, providing significant performance improvements for compatible data formats.
3

4
## Capabilities
5

6
### CSV Operations
7

8
Read CSV files with GPU acceleration using cuDF's high-performance CSV parser with automatic type inference and memory-efficient streaming.
9

10
```python { .api }
11
def read_csv(*args, **kwargs):
12
    """
13
    Read CSV file(s) using cuDF backend.
14
    
15
    Uses dask.dataframe.read_csv with cudf backend configured.
16
    Supports all standard CSV reading options plus cuDF-specific optimizations.
17
    
18
    Parameters:
19
    - path: str or list - File path(s) to read
20
    - **kwargs: Additional arguments passed to cudf.read_csv via Dask
21
    
22
    Common Parameters:
23
    - sep: str, default ',' - Field delimiter
24
    - header: int or None, default 'infer' - Row to use as column names
25
    - names: list, optional - Column names to use
26
    - dtype: dict, optional - Data types for columns
27
    - usecols: list, optional - Columns to read
28
    - skiprows: int, optional - Rows to skip at start
29
    - nrows: int, optional - Number of rows to read
30
    - na_values: list, optional - Values to treat as NaN
31
    
32
    Returns:
33
    DataFrame - Dask-cuDF DataFrame with CSV data
34
    
35
    Notes:
36
    - Automatically uses cuDF backend when dataframe.backend="cudf"
37
    - Supports remote filesystems via fsspec
38
    - Optimized for large files with automatic partitioning
39
    """
40
```
41

42
### JSON Operations
43

44
Read JSON files and JSON Lines format with GPU acceleration and efficient nested data handling.
45

46
```python { .api }
47
def read_json(*args, **kwargs):
48
    """
49
    Read JSON file(s) using cuDF backend.
50
    
51
    Uses dask.dataframe.read_json with cudf backend configured.
52
    Supports both standard JSON and JSON Lines formats.
53
    
54
    Parameters:
55
    - path: str or list - File path(s) to read
56
    - **kwargs: Additional arguments passed to cudf.read_json via Dask
57
    
58
    Common Parameters:
59
    - orient: str, default 'records' - JSON orientation
60
    - lines: bool, default False - Read as JSON Lines format
61
    - dtype: dict, optional - Data types for columns
62
    - compression: str, optional - Compression type ('gzip', 'bz2', etc.)
63
    
64
    Returns:
65
    DataFrame - Dask-cuDF DataFrame with JSON data
66
    
67
    Notes:
68
    - JSON Lines format recommended for large datasets
69
    - Supports nested JSON structures with automatic flattening
70
    """
71
```
72

73
### Parquet Operations
74

75
Read Parquet files with GPU acceleration using cuDF's optimized Parquet reader with column pruning and predicate pushdown.
76

77
```python { .api }
78
def read_parquet(*args, **kwargs):
79
    """
80
    Read Parquet file(s) using cuDF backend.
81
    
82
    Uses dask.dataframe.read_parquet with cudf backend configured.
83
    Provides optimized reading with column selection and filtering.
84
    
85
    Parameters:
86
    - path: str or list - File path(s) or directory to read
87
    - **kwargs: Additional arguments passed via Dask
88
    
89
    Common Parameters:
90
    - columns: list, optional - Columns to read (column pruning)
91
    - filters: list, optional - Row filters for predicate pushdown
92
    - engine: str, default 'cudf' - Parquet engine to use
93
    - index: str or False, optional - Column to use as index
94
    - storage_options: dict, optional - Filesystem options
95
    
96
    Returns:
97
    DataFrame - Dask-cuDF DataFrame with Parquet data
98
    
99
    Notes:
100
    - Automatically partitions based on Parquet file structure
101
    - Supports nested column types and complex schemas
102
    - Optimized for large datasets with efficient memory usage
103
    """
104
```
105

106
### ORC Operations
107

108
Read ORC (Optimized Row Columnar) files with GPU acceleration for high-performance columnar data access.
109

110
```python { .api }
111
def read_orc(*args, **kwargs):
112
    """
113
    Read ORC file(s) using cuDF backend.
114
    
115
    Uses dask.dataframe.read_orc with cudf backend configured.
116
    Optimized for ORC's columnar format with GPU acceleration.
117
    
118
    Parameters:
119
    - path: str or list - File path(s) to read
120
    - **kwargs: Additional arguments passed to cudf.read_orc via Dask
121
    
122
    Common Parameters:
123
    - columns: list, optional - Columns to read
124
    - stripes: list, optional - Stripe indices to read
125
    - skiprows: int, optional - Rows to skip
126
    - num_rows: int, optional - Number of rows to read
127
    
128
    Returns:
129
    DataFrame - Dask-cuDF DataFrame with ORC data
130
    
131
    Notes:
132
    - Leverages ORC's built-in compression and encoding
133
    - Supports complex nested data types
134
    - Optimized stripe-level reading for large files
135
    """
136

137
def read_text(path, chunksize="256 MiB", **kwargs):
138
    """
139
    Read text files using cuDF backend.
140
    
141
    Available in both expression and legacy modes. In expression mode,
142
    uses DataFrame.read_text method. In legacy mode, uses direct implementation.
143
    
144
    Parameters:
145
    - path: str or list - File path(s) to read
146
    - chunksize: str or int, default "256 MiB" - Size of each partition
147
    - **kwargs: Additional arguments passed to cudf.read_text
148
    
149
    Common Parameters:
150
    - delimiter: str - Text delimiter for parsing
151
    - byte_range: tuple, optional - (offset, size) for reading specific byte range
152
    
153
    Returns:
154
    DataFrame - Dask-cuDF DataFrame with parsed text data
155
    
156
    Notes:
157
    - Conditional availability based on DASK_DATAFRAME__QUERY_PLANNING setting
158
    - Supports large text files with automatic chunking
159
    - Uses cuDF's optimized text parsing capabilities
160
    """
161
```
162

163
### Deprecated I/O Interface
164

165
Legacy I/O functions available through the `dask_cudf.io` module (deprecated in favor of top-level functions).
166

167
```python { .api }
168
# Deprecated - use dask_cudf.read_csv instead
169
dask_cudf.io.read_csv(*args, **kwargs)
170

171
# Deprecated - use dask_cudf.read_json instead  
172
dask_cudf.io.read_json(*args, **kwargs)
173

174
# Deprecated - use dask_cudf.read_parquet instead
175
dask_cudf.io.read_parquet(*args, **kwargs)
176

177
# Deprecated - use dask_cudf.read_orc instead
178
dask_cudf.io.read_orc(*args, **kwargs)
179

180
# Deprecated - use DataFrame.to_parquet method instead
181
dask_cudf.io.to_parquet(df, path, **kwargs)
182

183
def to_orc(df, path, **kwargs):
184
    """
185
    Write DataFrame to ORC format.
186
    
187
    DEPRECATED: This function is deprecated and will be removed.
188
    Use DataFrame.to_orc method instead.
189
    
190
    Parameters:
191
    - df: DataFrame - DataFrame to write
192
    - path: str - Output path
193
    - **kwargs: Additional arguments
194
    
195
    Raises:
196
    NotImplementedError - Function is no longer supported
197
    
198
    Notes:
199
    - Legacy implementation available via dask_cudf._legacy.io.to_orc
200
    - Recommended migration: df.to_orc(path, **kwargs)
201
    """
202
```
203

204
## Usage Examples
205

206
### Reading CSV Files
207

208
```python
209
import dask_cudf
210

211
# Read single CSV file
212
df = dask_cudf.read_csv('data.csv')
213

214
# Read multiple CSV files with pattern
215
df = dask_cudf.read_csv('data/*.csv')
216

217
# Read with specific options
218
df = dask_cudf.read_csv(
219
    'data.csv',
220
    dtype={'id': 'int64', 'value': 'float64'},
221
    usecols=['id', 'value', 'category'],
222
    skiprows=1
223
)
224

225
result = df.compute()
226
```
227

228
### Reading Parquet with Filters
229

230
```python
231
# Read Parquet with column selection and filtering
232
df = dask_cudf.read_parquet(
233
    'data.parquet',
234
    columns=['id', 'value', 'timestamp'],
235
    filters=[('timestamp', '>=', '2023-01-01')]
236
)
237

238
# Process filtered data
239
summary = df.groupby('id')['value'].mean()
240
result = summary.compute()
241
```
242

243
### Working with Remote Data
244

245
```python
246
# Read from S3 with storage options
247
df = dask_cudf.read_parquet(
248
    's3://bucket/data/',
249
    storage_options={
250
        'key': 'access_key',
251
        'secret': 'secret_key'
252
    }
253
)
254

255
# Read JSON Lines from remote location
256
df = dask_cudf.read_json(
257
    's3://bucket/jsonl_data/*.jsonl',
258
    lines=True,
259
    storage_options={'anon': True}
260
)
261
```
262

263
### Configuration for Automatic Backend
264

265
```python
266
import dask
267
import dask.dataframe as dd
268

269
# Configure cuDF backend globally
270
dask.config.set({"dataframe.backend": "cudf"})
271

272
# Now standard Dask functions use cuDF backend
273
df = dd.read_csv('data.csv')  # Automatically uses cuDF
274
result = df.groupby('category').sum().compute()  # GPU-accelerated
275
```
276

277
### Reading Text Files
278

279
```python
280
# Read large text files with automatic chunking
281
df = dask_cudf.read_text(
282
    'large_text_file.txt',
283
    delimiter='\n',
284
    chunksize='128 MiB'
285
)
286

287
# Read with specific byte range
288
df_range = dask_cudf.read_text(
289
    'data.txt',
290
    delimiter='|',
291
    byte_range=(1000, 5000)  # Read bytes 1000-5000
292
)
293

294
result = df.compute()
295
```

Version

Tile

Files

data-io.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

data-io.mddocs/