A friend to fetch your data files
npx @tessl/cli install tessl/pypi-pooch@1.8.00
# Pooch
1
2
A Python library that manages data by downloading files from servers (HTTP, FTP, data repositories like Zenodo and figshare) only when needed and storing them locally in a data cache. Pooch features pure Python implementation with minimal dependencies, built-in post-processors for unzipping/decompressing data, and is designed to be extended with custom downloaders and processors.
3
4
## Package Information
5
6
- **Package Name**: pooch
7
- **Language**: Python
8
- **Installation**: `pip install pooch`
9
10
## Core Imports
11
12
```python
13
import pooch
14
```
15
16
For common usage patterns:
17
18
```python
19
from pooch import retrieve, create, Pooch
20
```
21
22
## Basic Usage
23
24
```python
25
import pooch
26
27
# Download a single file with hash verification
28
fname = pooch.retrieve(
29
url="https://github.com/fatiando/pooch/raw/v1.8.2/data/tiny-data.txt",
30
known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",
31
)
32
33
# For managing multiple files, create a Pooch instance
34
data_manager = pooch.create(
35
path=pooch.os_cache("myproject"),
36
base_url="https://github.com/myproject/data/raw/{version}/",
37
version="v1.0.0",
38
registry={
39
"dataset1.csv": "md5:ab12cd34ef56...",
40
"dataset2.zip": "sha256:12345abc...",
41
}
42
)
43
44
# Fetch files from the registry
45
data_file = data_manager.fetch("dataset1.csv")
46
```
47
48
## Architecture
49
50
Pooch is built around three main concepts:
51
52
- **Data Management**: Central `Pooch` class manages registries of files with their expected hashes and download URLs
53
- **Download Protocol**: Extensible downloader system supporting HTTP, FTP, SFTP, and DOI-based repositories
54
- **Post-Processing**: Processor chain for automatic decompression, unpacking, and custom transformations
55
56
This design enables scientific reproducibility by ensuring consistent data versions across different environments while supporting flexible data hosting and processing workflows.
57
58
## Capabilities
59
60
### Core Data Management
61
62
Primary functionality for downloading and caching individual files or managing collections of data files with version control and hash verification.
63
64
```python { .api }
65
def retrieve(url, known_hash, fname=None, path=None, processor=None, downloader=None, progressbar=False): ...
66
def create(path, base_url, version=None, version_dev="master", env=None, registry=None, urls=None, retry_if_failed=0, allow_updates=True): ...
67
class Pooch: ...
68
```
69
70
[Core Data Management](./core-data-management.md)
71
72
### File Download Protocols
73
74
Specialized downloader classes for different protocols and authentication methods, including HTTP/HTTPS with custom headers, FTP with authentication, SFTP, and DOI-based repository downloads.
75
76
```python { .api }
77
class HTTPDownloader: ...
78
class FTPDownloader: ...
79
class SFTPDownloader: ...
80
class DOIDownloader: ...
81
def choose_downloader(url, progressbar=False): ...
82
def doi_to_url(doi): ...
83
def doi_to_repository(doi): ...
84
```
85
86
[Download Protocols](./download-protocols.md)
87
88
### File Processing
89
90
Post-download processors for automatic decompression, archive extraction, and custom file transformations that execute after successful downloads.
91
92
```python { .api }
93
class Decompress: ...
94
class Unzip: ...
95
class Untar: ...
96
```
97
98
[File Processing](./file-processing.md)
99
100
### Utilities and Helpers
101
102
Helper functions for cache management, version handling, file hashing, and registry creation to support data management workflows.
103
104
```python { .api }
105
def os_cache(project): ...
106
def check_version(version, fallback="master"): ...
107
def file_hash(fname, alg="sha256"): ...
108
def make_registry(directory, output, recursive=True): ...
109
def get_logger(): ...
110
```
111
112
[Utilities and Helpers](./utilities-helpers.md)
113
114
## Version Information
115
116
```python { .api }
117
__version__: str # Package version string with 'v' prefix
118
```
119
120
## Testing
121
122
```python { .api }
123
def test(doctest=True, verbose=True, coverage=False): ...
124
```