or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-dataset-classes.mddata-loading.mddataset-building.mddataset-information.mddataset-operations.mdfeatures-and-types.mdindex.md

data-loading.mddocs/

0

# Data Loading

1

2

The primary interface for loading datasets from the HuggingFace Hub, local files, or custom data sources. This module provides functions for automatic format detection, streaming for large datasets, and flexible data splitting.

3

4

## Capabilities

5

6

### Loading Datasets from Hub and Files

7

8

The main entry point for loading datasets, supporting thousands of datasets from the HuggingFace Hub as well as local files in various formats (CSV, JSON, Parquet, etc.).

9

10

```python { .api }

11

def load_dataset(

12

path: str,

13

name: Optional[str] = None,

14

data_dir: Optional[str] = None,

15

data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,

16

split: Optional[Union[str, Split, list[str], list[Split]]] = None,

17

cache_dir: Optional[str] = None,

18

features: Optional[Features] = None,

19

download_config: Optional[DownloadConfig] = None,

20

download_mode: Optional[Union[DownloadMode, str]] = None,

21

verification_mode: Optional[Union[VerificationMode, str]] = None,

22

keep_in_memory: Optional[bool] = None,

23

save_infos: bool = False,

24

revision: Optional[Union[str, Version]] = None,

25

token: Optional[Union[bool, str]] = None,

26

streaming: bool = False,

27

num_proc: Optional[int] = None,

28

storage_options: Optional[dict] = None,

29

**config_kwargs,

30

) -> Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]:

31

"""

32

Load a dataset from the Hugging Face Hub, or a local dataset.

33

34

Parameters:

35

- path (str): Path or name of the dataset

36

- name (str, optional): Defining the name of the dataset configuration

37

- data_dir (str, optional): Defining the data_dir of the dataset configuration

38

- data_files (str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], optional): Path(s) to source data file(s)

39

- split (str, Split, list[str], list[Split], optional): Which split of the data to load

40

- cache_dir (str, optional): Directory to read/write data

41

- features (Features, optional): Set the dataset features type to align and scale your features

42

- download_config (DownloadConfig, optional): Specific download configuration parameters

43

- download_mode (DownloadMode or str, optional): Select the download/generation mode

44

- verification_mode (VerificationMode or str, optional): Select the verification mode

45

- keep_in_memory (bool, optional): Whether to copy the dataset in-memory

46

- save_infos (bool): Save the dataset information (checksums/size/splits/...)

47

- revision (str, Version, optional): Version of the dataset script to load

48

- token (bool or str, optional): Optional string or boolean to use as Bearer token for remote files

49

- streaming (bool): If True, don't download the data files. Instead, it streams the data progressively while iterating

50

- num_proc (int, optional): Number of processes when downloading and generating the dataset locally

51

- storage_options (dict, optional): Key/value pairs to be passed on to the file-system backend

52

- **config_kwargs: Keyword arguments to be passed to the BuilderConfig and used in the DatasetBuilder

53

54

Returns:

55

- Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]: Depending on split and streaming parameters

56

"""

57

```

58

59

**Usage Examples:**

60

61

```python

62

# Load a dataset from the Hub

63

dataset = load_dataset("squad")

64

65

# Load specific split

66

train_dataset = load_dataset("squad", split="train")

67

68

# Load with streaming for large datasets

69

streaming_dataset = load_dataset("oscar", "unshuffled_deduplicated_en", streaming=True)

70

71

# Load local CSV files

72

dataset = load_dataset("csv", data_files="my_file.csv")

73

74

# Load multiple files with different splits

75

dataset = load_dataset("csv", data_files={

76

"train": ["train1.csv", "train2.csv"],

77

"test": "test.csv"

78

})

79

```

80

81

### Loading Dataset Builders

82

83

Load a dataset builder without building the dataset, useful for inspecting dataset information before downloading.

84

85

```python { .api }

86

def load_dataset_builder(

87

path: str,

88

name: Optional[str] = None,

89

data_dir: Optional[str] = None,

90

data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,

91

cache_dir: Optional[str] = None,

92

features: Optional[Features] = None,

93

download_config: Optional[DownloadConfig] = None,

94

download_mode: Optional[Union[DownloadMode, str]] = None,

95

revision: Optional[Union[str, Version]] = None,

96

token: Optional[Union[bool, str]] = None,

97

storage_options: Optional[dict] = None,

98

**config_kwargs,

99

) -> DatasetBuilder:

100

"""

101

Load a dataset builder which can be used to inspect dataset information.

102

103

Parameters:

104

- path (str): Path or name of the dataset

105

- name (str, optional): Defining the name of the dataset configuration

106

- data_dir (str, optional): Defining the data_dir of the dataset configuration

107

- data_files (str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], optional): Path(s) to source data file(s)

108

- cache_dir (str, optional): Directory to read/write data

109

- features (Features, optional): Set the dataset features type

110

- download_config (DownloadConfig, optional): Specific download configuration parameters

111

- download_mode (DownloadMode or str, optional): Select the download/generation mode

112

- revision (str, Version, optional): Version of the dataset script to load

113

- token (bool or str, optional): Optional string or boolean to use as Bearer token

114

- storage_options (dict, optional): Key/value pairs to be passed on to the file-system backend

115

- **config_kwargs: Keyword arguments to be passed to the BuilderConfig

116

117

Returns:

118

- DatasetBuilder: A DatasetBuilder instance

119

"""

120

```

121

122

### Loading from Disk

123

124

Load datasets that were previously saved to disk using the save_to_disk method.

125

126

```python { .api }

127

def load_from_disk(

128

dataset_path: PathLike,

129

keep_in_memory: Optional[bool] = None,

130

storage_options: Optional[dict] = None

131

) -> Union[Dataset, DatasetDict]:

132

"""

133

Load a dataset that was previously saved using save_to_disk from a filesystem using a path.

134

135

Parameters:

136

- dataset_path (PathLike): Path (e.g. "dataset/train") or remote URI (e.g. "s3://my-bucket/dataset/train")

137

- keep_in_memory (bool, optional): Whether to copy the dataset in-memory

138

- storage_options (dict, optional): Key/value pairs to be passed on to the file-system backend

139

140

Returns:

141

- Union[Dataset, DatasetDict]: If the saved dataset is a Dataset, returns Dataset. If the saved dataset is a DatasetDict, returns DatasetDict

142

"""

143

```

144

145

**Usage Examples:**

146

147

```python

148

# Inspect dataset without downloading

149

builder = load_dataset_builder("squad")

150

print(builder.info.description)

151

print(builder.info.features)

152

153

# Load previously saved dataset

154

dataset = load_from_disk("./my_saved_dataset")

155

```

156

157

## Types

158

159

### Path Types

160

161

```python { .api }

162

from os import PathLike

163

```

164

165

### Download and Verification Modes

166

167

```python { .api }

168

class DownloadMode:

169

"""Download behavior modes."""

170

REUSE_DATASET_IF_EXISTS: str = "reuse_dataset_if_exists"

171

REUSE_CACHE_IF_EXISTS: str = "reuse_cache_if_exists"

172

FORCE_REDOWNLOAD: str = "force_redownload"

173

174

class VerificationMode:

175

"""Dataset verification modes."""

176

BASIC_CHECKS: str = "basic_checks"

177

ALL_CHECKS: str = "all_checks"

178

NO_CHECKS: str = "no_checks"

179

180

class DownloadConfig:

181

"""Configuration for download operations."""

182

183

def __init__(

184

self,

185

cache_dir: Optional[Union[str, Path]] = None,

186

force_download: bool = False,

187

resume_download: bool = False,

188

proxies: Optional[Dict[str, str]] = None,

189

token: Optional[Union[str, bool]] = None,

190

use_etag: bool = True,

191

num_proc: Optional[int] = None,

192

max_retries: int = 1,

193

**kwargs

194

): ...

195

196

class ReadInstruction:

197

"""Reading instruction for specifying dataset subsets and splits."""

198

199

def __init__(

200

self,

201

split_name: str,

202

from_: Optional[int] = None,

203

to: Optional[int] = None,

204

unit: str = 'abs',

205

): ...

206

207

@classmethod

208

def from_spec(cls, spec: str) -> "ReadInstruction": ...

209

```