A friend to fetch your data files
npx @tessl/cli install tessl/pypi-pooch@1.8.0A Python library that manages data by downloading files from servers (HTTP, FTP, data repositories like Zenodo and figshare) only when needed and storing them locally in a data cache. Pooch features pure Python implementation with minimal dependencies, built-in post-processors for unzipping/decompressing data, and is designed to be extended with custom downloaders and processors.
pip install poochimport poochFor common usage patterns:
from pooch import retrieve, create, Poochimport pooch
# Download a single file with hash verification
fname = pooch.retrieve(
url="https://github.com/fatiando/pooch/raw/v1.8.2/data/tiny-data.txt",
known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",
)
# For managing multiple files, create a Pooch instance
data_manager = pooch.create(
path=pooch.os_cache("myproject"),
base_url="https://github.com/myproject/data/raw/{version}/",
version="v1.0.0",
registry={
"dataset1.csv": "md5:ab12cd34ef56...",
"dataset2.zip": "sha256:12345abc...",
}
)
# Fetch files from the registry
data_file = data_manager.fetch("dataset1.csv")Pooch is built around three main concepts:
Pooch class manages registries of files with their expected hashes and download URLsThis design enables scientific reproducibility by ensuring consistent data versions across different environments while supporting flexible data hosting and processing workflows.
Primary functionality for downloading and caching individual files or managing collections of data files with version control and hash verification.
def retrieve(url, known_hash, fname=None, path=None, processor=None, downloader=None, progressbar=False): ...
def create(path, base_url, version=None, version_dev="master", env=None, registry=None, urls=None, retry_if_failed=0, allow_updates=True): ...
class Pooch: ...Specialized downloader classes for different protocols and authentication methods, including HTTP/HTTPS with custom headers, FTP with authentication, SFTP, and DOI-based repository downloads.
class HTTPDownloader: ...
class FTPDownloader: ...
class SFTPDownloader: ...
class DOIDownloader: ...
def choose_downloader(url, progressbar=False): ...
def doi_to_url(doi): ...
def doi_to_repository(doi): ...Post-download processors for automatic decompression, archive extraction, and custom file transformations that execute after successful downloads.
class Decompress: ...
class Unzip: ...
class Untar: ...Helper functions for cache management, version handling, file hashing, and registry creation to support data management workflows.
def os_cache(project): ...
def check_version(version, fallback="master"): ...
def file_hash(fname, alg="sha256"): ...
def make_registry(directory, output, recursive=True): ...
def get_logger(): ...__version__: str # Package version string with 'v' prefixdef test(doctest=True, verbose=True, coverage=False): ...