or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-operations.mddata-io.mddata-type-accessors.mdgroupby-operations.mdindex.md

data-io.mddocs/

0

# Data I/O Operations

1

2

Read and write data in various formats with GPU acceleration and automatic cuDF backend integration. All I/O operations leverage the cuDF backend when configured, providing significant performance improvements for compatible data formats.

3

4

## Capabilities

5

6

### CSV Operations

7

8

Read CSV files with GPU acceleration using cuDF's high-performance CSV parser with automatic type inference and memory-efficient streaming.

9

10

```python { .api }

11

def read_csv(*args, **kwargs):

12

"""

13

Read CSV file(s) using cuDF backend.

14

15

Uses dask.dataframe.read_csv with cudf backend configured.

16

Supports all standard CSV reading options plus cuDF-specific optimizations.

17

18

Parameters:

19

- path: str or list - File path(s) to read

20

- **kwargs: Additional arguments passed to cudf.read_csv via Dask

21

22

Common Parameters:

23

- sep: str, default ',' - Field delimiter

24

- header: int or None, default 'infer' - Row to use as column names

25

- names: list, optional - Column names to use

26

- dtype: dict, optional - Data types for columns

27

- usecols: list, optional - Columns to read

28

- skiprows: int, optional - Rows to skip at start

29

- nrows: int, optional - Number of rows to read

30

- na_values: list, optional - Values to treat as NaN

31

32

Returns:

33

DataFrame - Dask-cuDF DataFrame with CSV data

34

35

Notes:

36

- Automatically uses cuDF backend when dataframe.backend="cudf"

37

- Supports remote filesystems via fsspec

38

- Optimized for large files with automatic partitioning

39

"""

40

```

41

42

### JSON Operations

43

44

Read JSON files and JSON Lines format with GPU acceleration and efficient nested data handling.

45

46

```python { .api }

47

def read_json(*args, **kwargs):

48

"""

49

Read JSON file(s) using cuDF backend.

50

51

Uses dask.dataframe.read_json with cudf backend configured.

52

Supports both standard JSON and JSON Lines formats.

53

54

Parameters:

55

- path: str or list - File path(s) to read

56

- **kwargs: Additional arguments passed to cudf.read_json via Dask

57

58

Common Parameters:

59

- orient: str, default 'records' - JSON orientation

60

- lines: bool, default False - Read as JSON Lines format

61

- dtype: dict, optional - Data types for columns

62

- compression: str, optional - Compression type ('gzip', 'bz2', etc.)

63

64

Returns:

65

DataFrame - Dask-cuDF DataFrame with JSON data

66

67

Notes:

68

- JSON Lines format recommended for large datasets

69

- Supports nested JSON structures with automatic flattening

70

"""

71

```

72

73

### Parquet Operations

74

75

Read Parquet files with GPU acceleration using cuDF's optimized Parquet reader with column pruning and predicate pushdown.

76

77

```python { .api }

78

def read_parquet(*args, **kwargs):

79

"""

80

Read Parquet file(s) using cuDF backend.

81

82

Uses dask.dataframe.read_parquet with cudf backend configured.

83

Provides optimized reading with column selection and filtering.

84

85

Parameters:

86

- path: str or list - File path(s) or directory to read

87

- **kwargs: Additional arguments passed via Dask

88

89

Common Parameters:

90

- columns: list, optional - Columns to read (column pruning)

91

- filters: list, optional - Row filters for predicate pushdown

92

- engine: str, default 'cudf' - Parquet engine to use

93

- index: str or False, optional - Column to use as index

94

- storage_options: dict, optional - Filesystem options

95

96

Returns:

97

DataFrame - Dask-cuDF DataFrame with Parquet data

98

99

Notes:

100

- Automatically partitions based on Parquet file structure

101

- Supports nested column types and complex schemas

102

- Optimized for large datasets with efficient memory usage

103

"""

104

```

105

106

### ORC Operations

107

108

Read ORC (Optimized Row Columnar) files with GPU acceleration for high-performance columnar data access.

109

110

```python { .api }

111

def read_orc(*args, **kwargs):

112

"""

113

Read ORC file(s) using cuDF backend.

114

115

Uses dask.dataframe.read_orc with cudf backend configured.

116

Optimized for ORC's columnar format with GPU acceleration.

117

118

Parameters:

119

- path: str or list - File path(s) to read

120

- **kwargs: Additional arguments passed to cudf.read_orc via Dask

121

122

Common Parameters:

123

- columns: list, optional - Columns to read

124

- stripes: list, optional - Stripe indices to read

125

- skiprows: int, optional - Rows to skip

126

- num_rows: int, optional - Number of rows to read

127

128

Returns:

129

DataFrame - Dask-cuDF DataFrame with ORC data

130

131

Notes:

132

- Leverages ORC's built-in compression and encoding

133

- Supports complex nested data types

134

- Optimized stripe-level reading for large files

135

"""

136

137

def read_text(path, chunksize="256 MiB", **kwargs):

138

"""

139

Read text files using cuDF backend.

140

141

Available in both expression and legacy modes. In expression mode,

142

uses DataFrame.read_text method. In legacy mode, uses direct implementation.

143

144

Parameters:

145

- path: str or list - File path(s) to read

146

- chunksize: str or int, default "256 MiB" - Size of each partition

147

- **kwargs: Additional arguments passed to cudf.read_text

148

149

Common Parameters:

150

- delimiter: str - Text delimiter for parsing

151

- byte_range: tuple, optional - (offset, size) for reading specific byte range

152

153

Returns:

154

DataFrame - Dask-cuDF DataFrame with parsed text data

155

156

Notes:

157

- Conditional availability based on DASK_DATAFRAME__QUERY_PLANNING setting

158

- Supports large text files with automatic chunking

159

- Uses cuDF's optimized text parsing capabilities

160

"""

161

```

162

163

### Deprecated I/O Interface

164

165

Legacy I/O functions available through the `dask_cudf.io` module (deprecated in favor of top-level functions).

166

167

```python { .api }

168

# Deprecated - use dask_cudf.read_csv instead

169

dask_cudf.io.read_csv(*args, **kwargs)

170

171

# Deprecated - use dask_cudf.read_json instead

172

dask_cudf.io.read_json(*args, **kwargs)

173

174

# Deprecated - use dask_cudf.read_parquet instead

175

dask_cudf.io.read_parquet(*args, **kwargs)

176

177

# Deprecated - use dask_cudf.read_orc instead

178

dask_cudf.io.read_orc(*args, **kwargs)

179

180

# Deprecated - use DataFrame.to_parquet method instead

181

dask_cudf.io.to_parquet(df, path, **kwargs)

182

183

def to_orc(df, path, **kwargs):

184

"""

185

Write DataFrame to ORC format.

186

187

DEPRECATED: This function is deprecated and will be removed.

188

Use DataFrame.to_orc method instead.

189

190

Parameters:

191

- df: DataFrame - DataFrame to write

192

- path: str - Output path

193

- **kwargs: Additional arguments

194

195

Raises:

196

NotImplementedError - Function is no longer supported

197

198

Notes:

199

- Legacy implementation available via dask_cudf._legacy.io.to_orc

200

- Recommended migration: df.to_orc(path, **kwargs)

201

"""

202

```

203

204

## Usage Examples

205

206

### Reading CSV Files

207

208

```python

209

import dask_cudf

210

211

# Read single CSV file

212

df = dask_cudf.read_csv('data.csv')

213

214

# Read multiple CSV files with pattern

215

df = dask_cudf.read_csv('data/*.csv')

216

217

# Read with specific options

218

df = dask_cudf.read_csv(

219

'data.csv',

220

dtype={'id': 'int64', 'value': 'float64'},

221

usecols=['id', 'value', 'category'],

222

skiprows=1

223

)

224

225

result = df.compute()

226

```

227

228

### Reading Parquet with Filters

229

230

```python

231

# Read Parquet with column selection and filtering

232

df = dask_cudf.read_parquet(

233

'data.parquet',

234

columns=['id', 'value', 'timestamp'],

235

filters=[('timestamp', '>=', '2023-01-01')]

236

)

237

238

# Process filtered data

239

summary = df.groupby('id')['value'].mean()

240

result = summary.compute()

241

```

242

243

### Working with Remote Data

244

245

```python

246

# Read from S3 with storage options

247

df = dask_cudf.read_parquet(

248

's3://bucket/data/',

249

storage_options={

250

'key': 'access_key',

251

'secret': 'secret_key'

252

}

253

)

254

255

# Read JSON Lines from remote location

256

df = dask_cudf.read_json(

257

's3://bucket/jsonl_data/*.jsonl',

258

lines=True,

259

storage_options={'anon': True}

260

)

261

```

262

263

### Configuration for Automatic Backend

264

265

```python

266

import dask

267

import dask.dataframe as dd

268

269

# Configure cuDF backend globally

270

dask.config.set({"dataframe.backend": "cudf"})

271

272

# Now standard Dask functions use cuDF backend

273

df = dd.read_csv('data.csv') # Automatically uses cuDF

274

result = df.groupby('category').sum().compute() # GPU-accelerated

275

```

276

277

### Reading Text Files

278

279

```python

280

# Read large text files with automatic chunking

281

df = dask_cudf.read_text(

282

'large_text_file.txt',

283

delimiter='\n',

284

chunksize='128 MiB'

285

)

286

287

# Read with specific byte range

288

df_range = dask_cudf.read_text(

289

'data.txt',

290

delimiter='|',

291

byte_range=(1000, 5000) # Read bytes 1000-5000

292

)

293

294

result = df.compute()

295

```