or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

core-data-management.mddownload-protocols.mdfile-processing.mdindex.mdutilities-helpers.md
tile.json

tessl/pypi-pooch

A friend to fetch your data files

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/pooch@1.8.x

To install, run

npx @tessl/cli install tessl/pypi-pooch@1.8.0

index.mddocs/

Pooch

A Python library that manages data by downloading files from servers (HTTP, FTP, data repositories like Zenodo and figshare) only when needed and storing them locally in a data cache. Pooch features pure Python implementation with minimal dependencies, built-in post-processors for unzipping/decompressing data, and is designed to be extended with custom downloaders and processors.

Package Information

  • Package Name: pooch
  • Language: Python
  • Installation: pip install pooch

Core Imports

import pooch

For common usage patterns:

from pooch import retrieve, create, Pooch

Basic Usage

import pooch

# Download a single file with hash verification
fname = pooch.retrieve(
    url="https://github.com/fatiando/pooch/raw/v1.8.2/data/tiny-data.txt",
    known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",
)

# For managing multiple files, create a Pooch instance
data_manager = pooch.create(
    path=pooch.os_cache("myproject"),
    base_url="https://github.com/myproject/data/raw/{version}/",
    version="v1.0.0",
    registry={
        "dataset1.csv": "md5:ab12cd34ef56...",
        "dataset2.zip": "sha256:12345abc...",
    }
)

# Fetch files from the registry
data_file = data_manager.fetch("dataset1.csv")

Architecture

Pooch is built around three main concepts:

  • Data Management: Central Pooch class manages registries of files with their expected hashes and download URLs
  • Download Protocol: Extensible downloader system supporting HTTP, FTP, SFTP, and DOI-based repositories
  • Post-Processing: Processor chain for automatic decompression, unpacking, and custom transformations

This design enables scientific reproducibility by ensuring consistent data versions across different environments while supporting flexible data hosting and processing workflows.

Capabilities

Core Data Management

Primary functionality for downloading and caching individual files or managing collections of data files with version control and hash verification.

def retrieve(url, known_hash, fname=None, path=None, processor=None, downloader=None, progressbar=False): ...
def create(path, base_url, version=None, version_dev="master", env=None, registry=None, urls=None, retry_if_failed=0, allow_updates=True): ...
class Pooch: ...

Core Data Management

File Download Protocols

Specialized downloader classes for different protocols and authentication methods, including HTTP/HTTPS with custom headers, FTP with authentication, SFTP, and DOI-based repository downloads.

class HTTPDownloader: ...
class FTPDownloader: ...
class SFTPDownloader: ...
class DOIDownloader: ...
def choose_downloader(url, progressbar=False): ...
def doi_to_url(doi): ...
def doi_to_repository(doi): ...

Download Protocols

File Processing

Post-download processors for automatic decompression, archive extraction, and custom file transformations that execute after successful downloads.

class Decompress: ...
class Unzip: ...
class Untar: ...

File Processing

Utilities and Helpers

Helper functions for cache management, version handling, file hashing, and registry creation to support data management workflows.

def os_cache(project): ...
def check_version(version, fallback="master"): ...
def file_hash(fname, alg="sha256"): ...
def make_registry(directory, output, recursive=True): ...
def get_logger(): ...

Utilities and Helpers

Version Information

__version__: str  # Package version string with 'v' prefix

Testing

def test(doctest=True, verbose=True, coverage=False): ...