tessl/pypi-datasets

HuggingFace community-driven open-source library of datasets for machine learning with one-line dataloaders, efficient preprocessing, and multi-framework support

—

Pending

Overview

Eval results

Files

Dataset Operations

Name: tessl/pypi-datasets
Author: tessl

Functions for combining, transforming, and manipulating datasets, including concatenation, interleaving, and caching control. These operations enable composition of multiple datasets and fine-grained control over dataset processing behavior.

Capabilities

Dataset Combination

Functions for combining multiple datasets into unified collections, supporting both vertical (row-wise) and horizontal (column-wise) concatenation, as well as sophisticated interleaving patterns.

def concatenate_datasets(
    dsets: List[Union[Dataset, IterableDataset]],
    info: Optional[DatasetInfo] = None,
    split: Optional[NamedSplit] = None,
    axis: int = 0,
) -> Union[Dataset, IterableDataset]:
    """
    Converts a list of datasets with the same schema into a single dataset.

    Parameters:
    - dsets (List[Dataset] or List[IterableDataset]): List of datasets to concatenate
    - info (DatasetInfo, optional): Dataset information, like description, citation, etc.
    - split (NamedSplit, optional): Name of the dataset split
    - axis (int): Axis to concatenate over, where 0 means over rows (vertically) and 1 means over columns (horizontally)

    Returns:
    - Union[Dataset, IterableDataset]: Concatenated dataset of the same type as input datasets
    """

def interleave_datasets(
    datasets: List[Union[Dataset, IterableDataset]],
    probabilities: Optional[List[float]] = None,
    seed: Optional[int] = None,
    info: Optional[DatasetInfo] = None,
    split: Optional[NamedSplit] = None,
    stopping_strategy: str = "first_exhausted",
) -> Union[Dataset, IterableDataset]:
    """
    Interleave several datasets (sources) into a single dataset by alternating between sources.

    Parameters:
    - datasets (List[Dataset] or List[IterableDataset]): List of datasets to interleave
    - probabilities (List[float], optional): If specified, examples are sampled from sources according to these probabilities
    - seed (int, optional): Random seed used to choose a source for each example
    - info (DatasetInfo, optional): Dataset information, like description, citation, etc.
    - split (NamedSplit, optional): Name of the dataset split
    - stopping_strategy (str): Either "first_exhausted" (stop when first dataset is exhausted) or "all_exhausted" (oversample until all datasets exhausted)

    Returns:
    - Union[Dataset, IterableDataset]: Interleaved dataset of the same type as input datasets
    """

Usage Examples:

from datasets import Dataset, concatenate_datasets, interleave_datasets

# Create sample datasets
ds1 = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]})
ds2 = Dataset.from_dict({"text": ["foo", "bar"], "label": [1, 0]})
ds3 = Dataset.from_dict({"text": ["alice", "bob"], "label": [0, 1]})

# Concatenate datasets vertically (append rows)
combined = concatenate_datasets([ds1, ds2, ds3])
print(len(combined))  # 6

# Interleave datasets with equal probability
interleaved = interleave_datasets([ds1, ds2, ds3])
print(interleaved["text"])  # ['hello', 'foo', 'alice', 'world', 'bar', 'bob']

# Interleave with custom probabilities
weighted = interleave_datasets([ds1, ds2, ds3], probabilities=[0.7, 0.2, 0.1], seed=42)

# Different stopping strategies
all_exhausted = interleave_datasets([ds1, ds2, ds3], stopping_strategy="all_exhausted")

Caching Control

Global functions for controlling the caching behavior of dataset operations. By default, dataset transformations are cached for reproducibility and performance.

def enable_caching() -> None:
    """
    Enable caching of dataset operations.
    
    When enabled (default), data transformations are stored in cache files named using 
    dataset fingerprints. This allows reloading existing cache files if they've already 
    been computed, improving performance for repeated operations.
    """

def disable_caching() -> None:
    """
    Disable caching of dataset operations.
    
    When disabled, cache files are always recreated and existing cache files are ignored. 
    This forces recomputation of all transformations but ensures fresh processing of data.
    """

def is_caching_enabled() -> bool:
    """
    Check if caching is currently enabled.
    
    Returns:
    - bool: True if caching is enabled, False otherwise
    """

Usage Examples:

from datasets import disable_caching, enable_caching, is_caching_enabled, load_dataset

# Check current caching status
print(f"Caching enabled: {is_caching_enabled()}")  # True by default

# Disable caching for fresh processing
disable_caching()
dataset = load_dataset("squad", split="train[:100]")
processed = dataset.map(lambda x: {"length": len(x["question"])})  # Always recomputed

# Re-enable caching
enable_caching()
cached_processed = dataset.map(lambda x: {"length": len(x["question"])})  # Uses cache if available

Progress Bar Control

Functions for controlling the display of progress bars during dataset operations, particularly useful for long-running transformations.

def enable_progress_bar() -> None:
    """Enable progress bar display during dataset operations."""

def disable_progress_bar() -> None:
    """Disable progress bar display during dataset operations."""

def is_progress_bar_enabled() -> bool:
    """
    Check if progress bars are currently enabled.
    
    Returns:
    - bool: True if progress bars are enabled, False otherwise
    """

def enable_progress_bars() -> None:
    """Enable progress bars (plural form for consistency)."""

def disable_progress_bars() -> None:
    """Disable progress bars (plural form for consistency)."""

def are_progress_bars_disabled() -> bool:
    """
    Check if progress bars are currently disabled.
    
    Returns:
    - bool: True if progress bars are disabled, False otherwise
    """

Usage Examples:

from datasets import disable_progress_bar, enable_progress_bar, load_dataset

# Disable progress bars for cleaner output
disable_progress_bar()
dataset = load_dataset("squad", split="train")
processed = dataset.map(lambda x: {"length": len(x["question"])})  # No progress bar shown

# Re-enable progress bars
enable_progress_bar()
filtered = processed.filter(lambda x: x["length"] > 10)  # Progress bar displayed

Experimental Features

Decorator for marking experimental functionality that may change in future versions.

def experimental(fn):
    """
    Decorator to mark experimental features.
    
    Features marked as experimental may have their API changed or removed in future versions
    without a deprecation cycle. Use with caution in production code.
    """

Advanced Dataset Operations

Column-wise Concatenation

# Concatenate datasets horizontally (add columns)
# Note: datasets must have the same number of rows
ds1 = Dataset.from_dict({"text": ["hello", "world"]})
ds2 = Dataset.from_dict({"label": [0, 1]})

# Horizontal concatenation (axis=1)
combined = concatenate_datasets([ds1, ds2], axis=1)
print(combined.column_names)  # ['text', 'label']

Complex Interleaving Patterns

# Create datasets of different sizes
small_ds = Dataset.from_dict({"text": ["a", "b"]})
medium_ds = Dataset.from_dict({"text": ["c", "d", "e"]})
large_ds = Dataset.from_dict({"text": ["f", "g", "h", "i"]})

# Use probabilities to control sampling
# Higher probability = more examples from that dataset
interleaved = interleave_datasets(
    [small_ds, medium_ds, large_ds], 
    probabilities=[0.1, 0.3, 0.6],  # Favor the large dataset
    seed=42,
    stopping_strategy="all_exhausted"  # Ensure all data is used
)

Performance Considerations

Caching: Enabled by default, provides significant speedup for repeated operations
Memory Usage: concatenate_datasets creates a new dataset referencing original data
Streaming: Both operations work with IterableDataset for memory-efficient processing
Fingerprinting: Each operation updates the dataset fingerprint for cache invalidation
Multiprocessing: Operations inherit multiprocessing settings from constituent datasets

Error Handling

Common error scenarios and their solutions:

# Schema mismatch in concatenation
try:
    ds1 = Dataset.from_dict({"text": ["hello"]})
    ds2 = Dataset.from_dict({"label": [0]})  # Different columns
    concatenate_datasets([ds1, ds2])  # Will fail
except ValueError as e:
    print("Schema mismatch - ensure datasets have compatible features")

# Empty dataset list
try:
    concatenate_datasets([])  # Will fail
except ValueError as e:
    print("Cannot concatenate empty list of datasets")

# Probability mismatch in interleaving
try:
    interleave_datasets([ds1, ds2], probabilities=[0.5])  # Wrong length
except ValueError as e:
    print("Probabilities list must match number of datasets")

Install with Tessl CLI