Tessl Tile for pypi/datasets@4.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-dataset-classes.md data-loading.md dataset-building.md dataset-information.md dataset-operations.md features-and-types.md index.md

data-loading.mddocs/

0
# Data Loading
1

2
The primary interface for loading datasets from the HuggingFace Hub, local files, or custom data sources. This module provides functions for automatic format detection, streaming for large datasets, and flexible data splitting.
3

4
## Capabilities
5

6
### Loading Datasets from Hub and Files
7

8
The main entry point for loading datasets, supporting thousands of datasets from the HuggingFace Hub as well as local files in various formats (CSV, JSON, Parquet, etc.).
9

10
```python { .api }
11
def load_dataset(
12
    path: str,
13
    name: Optional[str] = None,
14
    data_dir: Optional[str] = None,
15
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
16
    split: Optional[Union[str, Split, list[str], list[Split]]] = None,
17
    cache_dir: Optional[str] = None,
18
    features: Optional[Features] = None,
19
    download_config: Optional[DownloadConfig] = None,
20
    download_mode: Optional[Union[DownloadMode, str]] = None,
21
    verification_mode: Optional[Union[VerificationMode, str]] = None,
22
    keep_in_memory: Optional[bool] = None,
23
    save_infos: bool = False,
24
    revision: Optional[Union[str, Version]] = None,
25
    token: Optional[Union[bool, str]] = None,
26
    streaming: bool = False,
27
    num_proc: Optional[int] = None,
28
    storage_options: Optional[dict] = None,
29
    **config_kwargs,
30
) -> Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]:
31
    """
32
    Load a dataset from the Hugging Face Hub, or a local dataset.
33

34
    Parameters:
35
    - path (str): Path or name of the dataset
36
    - name (str, optional): Defining the name of the dataset configuration
37
    - data_dir (str, optional): Defining the data_dir of the dataset configuration
38
    - data_files (str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], optional): Path(s) to source data file(s)
39
    - split (str, Split, list[str], list[Split], optional): Which split of the data to load
40
    - cache_dir (str, optional): Directory to read/write data
41
    - features (Features, optional): Set the dataset features type to align and scale your features
42
    - download_config (DownloadConfig, optional): Specific download configuration parameters
43
    - download_mode (DownloadMode or str, optional): Select the download/generation mode
44
    - verification_mode (VerificationMode or str, optional): Select the verification mode
45
    - keep_in_memory (bool, optional): Whether to copy the dataset in-memory
46
    - save_infos (bool): Save the dataset information (checksums/size/splits/...)
47
    - revision (str, Version, optional): Version of the dataset script to load
48
    - token (bool or str, optional): Optional string or boolean to use as Bearer token for remote files
49
    - streaming (bool): If True, don't download the data files. Instead, it streams the data progressively while iterating
50
    - num_proc (int, optional): Number of processes when downloading and generating the dataset locally
51
    - storage_options (dict, optional): Key/value pairs to be passed on to the file-system backend
52
    - **config_kwargs: Keyword arguments to be passed to the BuilderConfig and used in the DatasetBuilder
53

54
    Returns:
55
    - Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]: Depending on split and streaming parameters
56
    """
57
```
58

59
**Usage Examples:**
60

61
```python
62
# Load a dataset from the Hub
63
dataset = load_dataset("squad")
64

65
# Load specific split
66
train_dataset = load_dataset("squad", split="train")
67

68
# Load with streaming for large datasets
69
streaming_dataset = load_dataset("oscar", "unshuffled_deduplicated_en", streaming=True)
70

71
# Load local CSV files
72
dataset = load_dataset("csv", data_files="my_file.csv")
73

74
# Load multiple files with different splits
75
dataset = load_dataset("csv", data_files={
76
    "train": ["train1.csv", "train2.csv"],
77
    "test": "test.csv"
78
})
79
```
80

81
### Loading Dataset Builders
82

83
Load a dataset builder without building the dataset, useful for inspecting dataset information before downloading.
84

85
```python { .api }
86
def load_dataset_builder(
87
    path: str,
88
    name: Optional[str] = None,
89
    data_dir: Optional[str] = None,
90
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
91
    cache_dir: Optional[str] = None,
92
    features: Optional[Features] = None,
93
    download_config: Optional[DownloadConfig] = None,
94
    download_mode: Optional[Union[DownloadMode, str]] = None,
95
    revision: Optional[Union[str, Version]] = None,
96
    token: Optional[Union[bool, str]] = None,
97
    storage_options: Optional[dict] = None,
98
    **config_kwargs,
99
) -> DatasetBuilder:
100
    """
101
    Load a dataset builder which can be used to inspect dataset information.
102

103
    Parameters:
104
    - path (str): Path or name of the dataset
105
    - name (str, optional): Defining the name of the dataset configuration
106
    - data_dir (str, optional): Defining the data_dir of the dataset configuration
107
    - data_files (str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], optional): Path(s) to source data file(s)
108
    - cache_dir (str, optional): Directory to read/write data
109
    - features (Features, optional): Set the dataset features type
110
    - download_config (DownloadConfig, optional): Specific download configuration parameters
111
    - download_mode (DownloadMode or str, optional): Select the download/generation mode
112
    - revision (str, Version, optional): Version of the dataset script to load
113
    - token (bool or str, optional): Optional string or boolean to use as Bearer token
114
    - storage_options (dict, optional): Key/value pairs to be passed on to the file-system backend
115
    - **config_kwargs: Keyword arguments to be passed to the BuilderConfig
116

117
    Returns:
118
    - DatasetBuilder: A DatasetBuilder instance
119
    """
120
```
121

122
### Loading from Disk
123

124
Load datasets that were previously saved to disk using the save_to_disk method.
125

126
```python { .api }
127
def load_from_disk(
128
    dataset_path: PathLike, 
129
    keep_in_memory: Optional[bool] = None, 
130
    storage_options: Optional[dict] = None
131
) -> Union[Dataset, DatasetDict]:
132
    """
133
    Load a dataset that was previously saved using save_to_disk from a filesystem using a path.
134

135
    Parameters:
136
    - dataset_path (PathLike): Path (e.g. "dataset/train") or remote URI (e.g. "s3://my-bucket/dataset/train")
137
    - keep_in_memory (bool, optional): Whether to copy the dataset in-memory
138
    - storage_options (dict, optional): Key/value pairs to be passed on to the file-system backend
139

140
    Returns:
141
    - Union[Dataset, DatasetDict]: If the saved dataset is a Dataset, returns Dataset. If the saved dataset is a DatasetDict, returns DatasetDict
142
    """
143
```
144

145
**Usage Examples:**
146

147
```python
148
# Inspect dataset without downloading
149
builder = load_dataset_builder("squad")
150
print(builder.info.description)
151
print(builder.info.features)
152

153
# Load previously saved dataset
154
dataset = load_from_disk("./my_saved_dataset")
155
```
156

157
## Types
158

159
### Path Types
160

161
```python { .api }
162
from os import PathLike
163
```
164

165
### Download and Verification Modes
166

167
```python { .api }
168
class DownloadMode:
169
    """Download behavior modes."""
170
    REUSE_DATASET_IF_EXISTS: str = "reuse_dataset_if_exists"
171
    REUSE_CACHE_IF_EXISTS: str = "reuse_cache_if_exists" 
172
    FORCE_REDOWNLOAD: str = "force_redownload"
173

174
class VerificationMode:
175
    """Dataset verification modes."""
176
    BASIC_CHECKS: str = "basic_checks"
177
    ALL_CHECKS: str = "all_checks"
178
    NO_CHECKS: str = "no_checks"
179

180
class DownloadConfig:
181
    """Configuration for download operations."""
182
    
183
    def __init__(
184
        self,
185
        cache_dir: Optional[Union[str, Path]] = None,
186
        force_download: bool = False,
187
        resume_download: bool = False,
188
        proxies: Optional[Dict[str, str]] = None,
189
        token: Optional[Union[str, bool]] = None,
190
        use_etag: bool = True,
191
        num_proc: Optional[int] = None,
192
        max_retries: int = 1,
193
        **kwargs
194
    ): ...
195

196
class ReadInstruction:
197
    """Reading instruction for specifying dataset subsets and splits."""
198
    
199
    def __init__(
200
        self,
201
        split_name: str,
202
        from_: Optional[int] = None,
203
        to: Optional[int] = None,
204
        unit: str = 'abs',
205
    ): ...
206
    
207
    @classmethod
208
    def from_spec(cls, spec: str) -> "ReadInstruction": ...
209
```

Version

Tile

Files

data-loading.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

data-loading.mddocs/