tessl/pypi-datasets

HuggingFace community-driven open-source library of datasets for machine learning with one-line dataloaders, efficient preprocessing, and multi-framework support

—

Pending

Overview

Eval results

Files

Data Loading

Name: tessl/pypi-datasets
Author: tessl

The primary interface for loading datasets from the HuggingFace Hub, local files, or custom data sources. This module provides functions for automatic format detection, streaming for large datasets, and flexible data splitting.

Capabilities

Loading Datasets from Hub and Files

The main entry point for loading datasets, supporting thousands of datasets from the HuggingFace Hub as well as local files in various formats (CSV, JSON, Parquet, etc.).

def load_dataset(
    path: str,
    name: Optional[str] = None,
    data_dir: Optional[str] = None,
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
    split: Optional[Union[str, Split, list[str], list[Split]]] = None,
    cache_dir: Optional[str] = None,
    features: Optional[Features] = None,
    download_config: Optional[DownloadConfig] = None,
    download_mode: Optional[Union[DownloadMode, str]] = None,
    verification_mode: Optional[Union[VerificationMode, str]] = None,
    keep_in_memory: Optional[bool] = None,
    save_infos: bool = False,
    revision: Optional[Union[str, Version]] = None,
    token: Optional[Union[bool, str]] = None,
    streaming: bool = False,
    num_proc: Optional[int] = None,
    storage_options: Optional[dict] = None,
    **config_kwargs,
) -> Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]:
    """
    Load a dataset from the Hugging Face Hub, or a local dataset.

    Parameters:
    - path (str): Path or name of the dataset
    - name (str, optional): Defining the name of the dataset configuration
    - data_dir (str, optional): Defining the data_dir of the dataset configuration
    - data_files (str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], optional): Path(s) to source data file(s)
    - split (str, Split, list[str], list[Split], optional): Which split of the data to load
    - cache_dir (str, optional): Directory to read/write data
    - features (Features, optional): Set the dataset features type to align and scale your features
    - download_config (DownloadConfig, optional): Specific download configuration parameters
    - download_mode (DownloadMode or str, optional): Select the download/generation mode
    - verification_mode (VerificationMode or str, optional): Select the verification mode
    - keep_in_memory (bool, optional): Whether to copy the dataset in-memory
    - save_infos (bool): Save the dataset information (checksums/size/splits/...)
    - revision (str, Version, optional): Version of the dataset script to load
    - token (bool or str, optional): Optional string or boolean to use as Bearer token for remote files
    - streaming (bool): If True, don't download the data files. Instead, it streams the data progressively while iterating
    - num_proc (int, optional): Number of processes when downloading and generating the dataset locally
    - storage_options (dict, optional): Key/value pairs to be passed on to the file-system backend
    - **config_kwargs: Keyword arguments to be passed to the BuilderConfig and used in the DatasetBuilder

    Returns:
    - Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]: Depending on split and streaming parameters
    """

Usage Examples:

# Load a dataset from the Hub
dataset = load_dataset("squad")

# Load specific split
train_dataset = load_dataset("squad", split="train")

# Load with streaming for large datasets
streaming_dataset = load_dataset("oscar", "unshuffled_deduplicated_en", streaming=True)

# Load local CSV files
dataset = load_dataset("csv", data_files="my_file.csv")

# Load multiple files with different splits
dataset = load_dataset("csv", data_files={
    "train": ["train1.csv", "train2.csv"],
    "test": "test.csv"
})

Loading Dataset Builders

Load a dataset builder without building the dataset, useful for inspecting dataset information before downloading.

def load_dataset_builder(
    path: str,
    name: Optional[str] = None,
    data_dir: Optional[str] = None,
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
    cache_dir: Optional[str] = None,
    features: Optional[Features] = None,
    download_config: Optional[DownloadConfig] = None,
    download_mode: Optional[Union[DownloadMode, str]] = None,
    revision: Optional[Union[str, Version]] = None,
    token: Optional[Union[bool, str]] = None,
    storage_options: Optional[dict] = None,
    **config_kwargs,
) -> DatasetBuilder:
    """
    Load a dataset builder which can be used to inspect dataset information.

    Parameters:
    - path (str): Path or name of the dataset
    - name (str, optional): Defining the name of the dataset configuration
    - data_dir (str, optional): Defining the data_dir of the dataset configuration
    - data_files (str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], optional): Path(s) to source data file(s)
    - cache_dir (str, optional): Directory to read/write data
    - features (Features, optional): Set the dataset features type
    - download_config (DownloadConfig, optional): Specific download configuration parameters
    - download_mode (DownloadMode or str, optional): Select the download/generation mode
    - revision (str, Version, optional): Version of the dataset script to load
    - token (bool or str, optional): Optional string or boolean to use as Bearer token
    - storage_options (dict, optional): Key/value pairs to be passed on to the file-system backend
    - **config_kwargs: Keyword arguments to be passed to the BuilderConfig

    Returns:
    - DatasetBuilder: A DatasetBuilder instance
    """

Loading from Disk

Load datasets that were previously saved to disk using the save_to_disk method.

def load_from_disk(
    dataset_path: PathLike, 
    keep_in_memory: Optional[bool] = None, 
    storage_options: Optional[dict] = None
) -> Union[Dataset, DatasetDict]:
    """
    Load a dataset that was previously saved using save_to_disk from a filesystem using a path.

    Parameters:
    - dataset_path (PathLike): Path (e.g. "dataset/train") or remote URI (e.g. "s3://my-bucket/dataset/train")
    - keep_in_memory (bool, optional): Whether to copy the dataset in-memory
    - storage_options (dict, optional): Key/value pairs to be passed on to the file-system backend

    Returns:
    - Union[Dataset, DatasetDict]: If the saved dataset is a Dataset, returns Dataset. If the saved dataset is a DatasetDict, returns DatasetDict
    """

Usage Examples:

# Inspect dataset without downloading
builder = load_dataset_builder("squad")
print(builder.info.description)
print(builder.info.features)

# Load previously saved dataset
dataset = load_from_disk("./my_saved_dataset")

Types

Path Types

from os import PathLike

Download and Verification Modes

class DownloadMode:
    """Download behavior modes."""
    REUSE_DATASET_IF_EXISTS: str = "reuse_dataset_if_exists"
    REUSE_CACHE_IF_EXISTS: str = "reuse_cache_if_exists" 
    FORCE_REDOWNLOAD: str = "force_redownload"

class VerificationMode:
    """Dataset verification modes."""
    BASIC_CHECKS: str = "basic_checks"
    ALL_CHECKS: str = "all_checks"
    NO_CHECKS: str = "no_checks"

class DownloadConfig:
    """Configuration for download operations."""
    
    def __init__(
        self,
        cache_dir: Optional[Union[str, Path]] = None,
        force_download: bool = False,
        resume_download: bool = False,
        proxies: Optional[Dict[str, str]] = None,
        token: Optional[Union[str, bool]] = None,
        use_etag: bool = True,
        num_proc: Optional[int] = None,
        max_retries: int = 1,
        **kwargs
    ): ...

class ReadInstruction:
    """Reading instruction for specifying dataset subsets and splits."""
    
    def __init__(
        self,
        split_name: str,
        from_: Optional[int] = None,
        to: Optional[int] = None,
        unit: str = 'abs',
    ): ...
    
    @classmethod
    def from_spec(cls, spec: str) -> "ReadInstruction": ...

Install with Tessl CLI