CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-pooch

A friend to fetch your data files

Pending
Overview
Eval results
Files

utilities-helpers.mddocs/

Utilities and Helpers

Helper functions for cache management, version handling, file hashing, and registry creation. These utilities support data management workflows and provide essential functionality for working with Pooch.

Capabilities

Cache Management

Functions for managing cache directories and storage locations across different operating systems.

def os_cache(project: str) -> str:
    """
    Get the default cache location for the operating system.

    Parameters:
    - project: The name of your project. The cache folder will be created under this name in the appropriate OS cache directory

    Returns:
    The default cache location for your OS
    """

Version Handling

Utilities for handling version strings and development versions in data management workflows.

def check_version(version: str, fallback: str = "master") -> str:
    """
    Check if a version string is a development version and format accordingly.

    Parameters:
    - version: The version string to check
    - fallback: The name to use for development versions

    Returns:
    The version string or the fallback if it's a development version
    """

File Hashing

Functions for calculating and verifying file hashes to ensure data integrity.

def file_hash(fname: str, alg: str = "sha256") -> str:
    """
    Calculate the hash of a given file.

    Parameters:
    - fname: The path to the file
    - alg: The hashing algorithm to use. Supported algorithms include 'sha256', 'sha1', 'md5', and others available in hashlib

    Returns:
    The hash of the file as a hexadecimal string
    """

Registry Management

Functions for creating and managing file registries with hash information.

def make_registry(directory: str, output: str, recursive: bool = True) -> None:
    """
    Create a registry file with the hashes of all files in a directory.

    Parameters:
    - directory: The directory for which to create the registry
    - output: The path to the output registry file
    - recursive: If True, will include files in subdirectories
    """

Logging

Access to Pooch's internal logging system for debugging and monitoring.

def get_logger() -> logging.Logger:
    """
    Get the default Pooch logger.

    Returns:
    The logger object for Pooch
    """

Usage Examples

Setting Up Cache Directory

import pooch

# Get OS-appropriate cache directory for your project
cache_dir = pooch.os_cache("myproject")
print(f"Cache directory: {cache_dir}")

# Use in data manager creation
data_manager = pooch.create(
    path=cache_dir,
    base_url="https://example.com/data/",
)

Version Management

import pooch

# Handle version strings for development builds
version = "1.2.3+dev"
safe_version = pooch.check_version(version, fallback="main")
print(f"Using version: {safe_version}")  # Will use "main" for dev version

# Use in URL formatting
base_url = "https://github.com/myproject/data/raw/{version}/"
formatted_url = base_url.format(version=safe_version)

File Hash Calculation

import pooch

# Calculate SHA256 hash (default)
hash_value = pooch.file_hash("data.csv")
print(f"SHA256: {hash_value}")

# Calculate MD5 hash
md5_hash = pooch.file_hash("data.csv", alg="md5")
print(f"MD5: {md5_hash}")

# Use in registry
registry = {
    "data.csv": f"sha256:{hash_value}",
    "readme.txt": f"md5:{md5_hash}",
}

Registry Creation

import pooch

# Create registry for all files in a directory
pooch.make_registry("./data", "./registry.txt", recursive=True)

# The registry.txt file will contain:
# data/file1.csv sha256:abc123...
# data/subdir/file2.txt md5:def456...

# Load registry into Pooch manager
data_manager = pooch.create(
    path=pooch.os_cache("myproject"),
    base_url="https://example.com/data/",
)
data_manager.load_registry("./registry.txt")

Logging Configuration

import pooch
import logging

# Get Pooch logger
logger = pooch.get_logger()

# Configure logging level
logger.setLevel(logging.DEBUG)

# Add custom handler
handler = logging.FileHandler("pooch.log")
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)

# Now Pooch operations will be logged
fname = pooch.retrieve(
    "https://example.com/data.csv",
    known_hash="md5:abc123...",
)

Complete Workflow Example

import pooch
import os

# Set up project data management
project_name = "my-analysis-project"
cache_dir = pooch.os_cache(project_name)

# Create registry from existing data directory
if os.path.exists("./reference_data"):
    pooch.make_registry("./reference_data", "./data_registry.txt")

# Set up data manager
data_manager = pooch.create(
    path=cache_dir,
    base_url="https://github.com/myuser/my-analysis-project/raw/{version}/data/",
    version=pooch.check_version("1.0.0+dev", fallback="main"),
)

# Load registry
if os.path.exists("./data_registry.txt"):
    data_manager.load_registry("./data_registry.txt")

# Configure logging
logger = pooch.get_logger()
logger.setLevel(logging.INFO)

# Fetch data files
dataset1 = data_manager.fetch("dataset1.csv")
dataset2 = data_manager.fetch("dataset2.zip", processor=pooch.Unzip())

print(f"Dataset 1: {dataset1}")
print(f"Dataset 2 files: {dataset2}")

Hash Verification

import pooch

# Verify file integrity
def verify_file_integrity(filepath, expected_hash):
    """Verify a file's integrity against expected hash."""
    actual_hash = pooch.file_hash(filepath)
    
    # Handle different hash formats
    if ":" in expected_hash:
        alg, expected = expected_hash.split(":", 1)
        actual_hash = pooch.file_hash(filepath, alg=alg)
    else:
        expected = expected_hash
    
    return actual_hash == expected

# Example usage
file_ok = verify_file_integrity("data.csv", "md5:abc123def456...")
if file_ok:
    print("File integrity verified!")
else:
    print("File may be corrupted!")

Install with Tessl CLI

npx tessl i tessl/pypi-pooch

docs

core-data-management.md

download-protocols.md

file-processing.md

index.md

utilities-helpers.md

tile.json