Tessl Tile for pypi/datasets@4.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-datasets

HuggingFace community-driven open-source library of datasets for machine learning with one-line dataloaders, efficient preprocessing, and multi-framework support

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/datasets@4.0.x

To install, run

npx @tessl/cli install tessl/pypi-datasets@4.0.0

0
# HuggingFace Datasets
1

2
A comprehensive dataset management library that enables developers to easily load, process, and work with machine learning datasets. It offers one-line dataloaders for thousands of public datasets from the HuggingFace Datasets Hub, efficient data pre-processing capabilities with memory-mapped storage using Apache Arrow for handling large datasets without RAM limitations, and built-in interoperability with major ML frameworks including NumPy, PyTorch, TensorFlow, JAX, and Pandas.
3

4
## Package Information
5

6
- **Package Name**: datasets
7
- **Language**: Python
8
- **Installation**: `pip install datasets`
9

10
## Core Imports
11

12
```python
13
import datasets
14
```
15

16
Common patterns for loading and working with datasets:
17

18
```python
19
from datasets import load_dataset, Dataset, DatasetDict
20
from datasets import concatenate_datasets, interleave_datasets
21
```
22

23
## Basic Usage
24

25
```python
26
from datasets import load_dataset
27

28
# Load a dataset from the Hub
29
dataset = load_dataset("squad", split="train")
30

31
# Access dataset features and data
32
print(dataset.features)
33
print(len(dataset))
34
print(dataset[0])
35

36
# Apply transformations
37
def preprocess(example):
38
    example["question_length"] = len(example["question"])
39
    return example
40

41
dataset = dataset.map(preprocess)
42

43
# Convert to different formats
44
torch_dataset = dataset.with_format("torch")
45
pandas_df = dataset.to_pandas()
46

47
# Save to disk
48
dataset.save_to_disk("./my_dataset")
49
```
50

51
## Architecture
52

53
The datasets library is built around these key components:
54

55
- **Dataset Classes**: `Dataset` for map-style access and `IterableDataset` for streaming large datasets
56
- **Loading System**: `load_dataset()` function with automatic discovery of dataset builders 
57
- **Features System**: Comprehensive type definitions for structured data (text, audio, images, etc.)
58
- **Arrow Backend**: Memory-mapped storage using Apache Arrow for efficient data handling
59
- **Caching System**: Fingerprint-based caching for reproducible data processing
60
- **Hub Integration**: Direct access to thousands of datasets from the HuggingFace Hub
61

62
This design enables efficient processing of datasets ranging from small research datasets to massive production corpora, with seamless integration into popular ML frameworks and automatic optimization through caching and memory mapping.
63

64
## Capabilities
65

66
### Data Loading and Discovery
67

68
The primary interface for loading datasets from the HuggingFace Hub, local files, or custom data sources. Supports automatic format detection, streaming for large datasets, and flexible data splitting.
69

70
```python { .api }
71
def load_dataset(
72
    path: str,
73
    name: Optional[str] = None,
74
    data_dir: Optional[str] = None,
75
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
76
    split: Optional[Union[str, Split]] = None,
77
    cache_dir: Optional[str] = None,
78
    features: Optional[Features] = None,
79
    download_config: Optional[DownloadConfig] = None,
80
    download_mode: Optional[Union[DownloadMode, str]] = None,
81
    verification_mode: Optional[Union[VerificationMode, str]] = None,
82
    keep_in_memory: Optional[bool] = None,
83
    save_infos: bool = False,
84
    revision: Optional[Union[str, Version]] = None,
85
    token: Optional[Union[bool, str]] = None,
86
    streaming: bool = False,
87
    num_proc: Optional[int] = None,
88
    storage_options: Optional[Dict] = None,
89
    trust_remote_code: bool = None,
90
    **config_kwargs,
91
) -> Union[Dataset, DatasetDict, IterableDataset, IterableDatasetDict]:
92
    """Load a dataset from the HuggingFace Hub, local files, or custom sources."""
93

94
def load_dataset_builder(
95
    path: str,
96
    name: Optional[str] = None,
97
    data_dir: Optional[str] = None,
98
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
99
    cache_dir: Optional[str] = None,
100
    features: Optional[Features] = None,
101
    download_config: Optional[DownloadConfig] = None,
102
    download_mode: Optional[Union[DownloadMode, str]] = None,
103
    revision: Optional[Union[str, Version]] = None,
104
    token: Optional[Union[bool, str]] = None,
105
    storage_options: Optional[Dict] = None,
106
    trust_remote_code: bool = None,
107
    **config_kwargs,
108
) -> DatasetBuilder:
109
    """Load a dataset builder without building the dataset."""
110

111
def load_from_disk(dataset_path: str, fs=None, keep_in_memory: Optional[bool] = None) -> Union[Dataset, DatasetDict]:
112
    """Load a dataset that was previously saved using save_to_disk."""
113
```
114

115
[Data Loading](./data-loading.md)
116

117
### Core Dataset Classes
118

119
The fundamental dataset classes that provide different access patterns and capabilities for working with dataset collections.
120

121
```python { .api }
122
class Dataset:
123
    """Map-style dataset backed by Apache Arrow for efficient random access."""
124
    
125
    def __getitem__(self, key): ...
126
    def __len__(self) -> int: ...
127
    def map(self, function, **kwargs) -> "Dataset": ...
128
    def filter(self, function, **kwargs) -> "Dataset": ...
129
    def select(self, indices) -> "Dataset": ...
130
    def with_format(self, type: Optional[str] = None, **kwargs) -> "Dataset": ...
131
    def to_pandas(self) -> "pandas.DataFrame": ...
132
    def save_to_disk(self, dataset_path: str) -> None: ...
133

134
class DatasetDict(dict):
135
    """Dictionary of Dataset objects, typically for train/validation/test splits."""
136
    
137
    def map(self, function, **kwargs) -> "DatasetDict": ...
138
    def filter(self, function, **kwargs) -> "DatasetDict": ...
139
    def with_format(self, type: Optional[str] = None, **kwargs) -> "DatasetDict": ...
140
    def save_to_disk(self, dataset_dict_path: str) -> None: ...
141

142
class IterableDataset:
143
    """Iterable-style dataset for streaming large datasets without loading into memory."""
144
    
145
    def __iter__(self): ...
146
    def map(self, function, **kwargs) -> "IterableDataset": ...
147
    def filter(self, function, **kwargs) -> "IterableDataset": ...
148
    def take(self, n: int) -> "IterableDataset": ...
149
    def skip(self, n: int) -> "IterableDataset": ...
150
```
151

152
[Core Dataset Classes](./core-dataset-classes.md)
153

154
### Dataset Operations
155

156
Functions for combining, transforming, and manipulating datasets, including concatenation, interleaving, and caching control.
157

158
```python { .api }
159
def concatenate_datasets(
160
    dsets: List[Dataset],
161
    info: Optional[DatasetInfo] = None,
162
    split: Optional[NamedSplit] = None,
163
    axis: int = 0,
164
) -> Dataset:
165
    """Concatenate multiple Dataset objects."""
166

167
def interleave_datasets(
168
    datasets: List[Union[Dataset, IterableDataset]],
169
    probabilities: Optional[List[float]] = None,
170
    seed: Optional[int] = None,
171
    info: Optional[DatasetInfo] = None,
172
    split: Optional[NamedSplit] = None,
173
    stopping_strategy: str = "first_exhausted",
174
) -> Union[Dataset, IterableDataset]:
175
    """Interleave multiple datasets."""
176

177
def enable_caching() -> None:
178
    """Enable caching of dataset operations."""
179

180
def disable_caching() -> None:
181
    """Disable caching of dataset operations."""
182

183
def is_caching_enabled() -> bool:
184
    """Check if caching is currently enabled."""
185
```
186

187
[Dataset Operations](./dataset-operations.md)
188

189
### Features and Type System
190

191
Comprehensive type system for defining and validating dataset schemas, supporting primitive types, complex nested structures, and multimedia data.
192

193
```python { .api }
194
class Features(dict):
195
    """Dictionary-like container for dataset features with type validation."""
196
    
197
    def encode_example(self, example: dict) -> dict: ...
198
    def decode_example(self, example: dict) -> dict: ...
199

200
class Value:
201
    """Feature for primitive data types (int32, float64, string, bool, etc.)."""
202
    
203
    def __init__(self, dtype: str, id: Optional[str] = None): ...
204

205
class ClassLabel:
206
    """Feature for categorical/classification labels."""
207
    
208
    def __init__(
209
        self,
210
        num_classes: Optional[int] = None,
211
        names: Optional[List[str]] = None,
212
        names_file: Optional[str] = None,
213
        id: Optional[str] = None,
214
    ): ...
215

216
class Audio:
217
    """Feature for audio data with automatic format handling."""
218
    
219
    def __init__(self, sampling_rate: Optional[int] = None, mono: bool = True, decode: bool = True): ...
220

221
class Image:
222
    """Feature for image data with automatic format handling."""
223
    
224
    def __init__(self, decode: bool = True, id: Optional[str] = None): ...
225
```
226

227
[Features and Types](./features-and-types.md)
228

229
### Dataset Building
230

231
Classes and utilities for creating custom dataset builders and configurations for new datasets.
232

233
```python { .api }
234
class DatasetBuilder(ABC):
235
    """Abstract base class for dataset builders."""
236
    
237
    def download_and_prepare(self, download_config: Optional[DownloadConfig] = None, **kwargs) -> None: ...
238
    def as_dataset(self, split: Optional[Split] = None, **kwargs) -> Union[Dataset, DatasetDict]: ...
239

240
class GeneratorBasedBuilder(DatasetBuilder):
241
    """Dataset builder for datasets generated from Python generators."""
242
    
243
    def _generate_examples(self, **kwargs): ...
244

245
class BuilderConfig:
246
    """Configuration class for dataset builders."""
247
    
248
    def __init__(
249
        self,
250
        name: str = "default",
251
        version: Optional[Union[str, Version]] = "0.0.0",
252
        data_dir: Optional[str] = None,
253
        data_files: Optional[DataFilesDict] = None,
254
        description: Optional[str] = None,
255
    ): ...
256
```
257

258
[Dataset Building](./dataset-building.md)
259

260
### Dataset Information and Inspection
261

262
Functions and classes for inspecting dataset metadata, configurations, and available splits.
263

264
```python { .api }
265
class DatasetInfo:
266
    """Container for dataset metadata and information."""
267
    
268
    description: str
269
    features: Optional[Features]
270
    total_num_examples: Optional[int]
271
    splits: Optional[SplitDict]
272
    supervised_keys: Optional[SupervisedKeysData]
273
    version: Optional[Version]
274
    license: Optional[str]
275
    citation: Optional[str]
276

277
def get_dataset_config_names(path: str, **kwargs) -> List[str]:
278
    """Get available configuration names for a dataset."""
279

280
def get_dataset_split_names(path: str, config_name: Optional[str] = None, **kwargs) -> List[str]:
281
    """Get available split names for a dataset."""
282

283
def get_dataset_infos(path: str, **kwargs) -> Dict[str, DatasetInfo]:
284
    """Get information about all configurations of a dataset."""
285
```
286

287
[Dataset Information](./dataset-information.md)
288

289
## Types
290

291
```python { .api }
292
class Split:
293
    """Standard dataset splits."""
294
    TRAIN: str = "train"
295
    TEST: str = "test" 
296
    VALIDATION: str = "validation"
297

298
class DownloadMode:
299
    """Download behavior modes."""
300
    REUSE_DATASET_IF_EXISTS: str = "reuse_dataset_if_exists"
301
    REUSE_CACHE_IF_EXISTS: str = "reuse_cache_if_exists"
302
    FORCE_REDOWNLOAD: str = "force_redownload"
303

304
class VerificationMode:
305
    """Dataset verification modes."""
306
    BASIC_CHECKS: str = "basic_checks"
307
    ALL_CHECKS: str = "all_checks"
308
    NO_CHECKS: str = "no_checks"
309
```