0
# Core Data Management
1
2
Primary functionality for downloading and caching individual files or managing collections of data files with version control and hash verification. These functions form the foundation of Pooch's data management capabilities.
3
4
## Capabilities
5
6
### Single File Download
7
8
Downloads and caches individual files with hash verification, supporting custom processors and downloaders.
9
10
```python { .api }
11
def retrieve(
12
url: str,
13
known_hash: str | None,
14
fname: str | None = None,
15
path: str | None = None,
16
processor: callable | None = None,
17
downloader: callable | None = None,
18
progressbar: bool = False
19
) -> str:
20
"""
21
Download and cache a single file locally.
22
23
Parameters:
24
- url: The URL to the file that is to be downloaded
25
- known_hash: A known hash (checksum) of the file. Will be used to verify the download. By default, assumes SHA256. To specify different algorithm, prepend with 'algorithm:', e.g., 'md5:pw9co2iun29juoh'. If None, will NOT check the hash
26
- fname: The name that will be used to save the file. If None, will create a unique file name
27
- path: The location of the cache folder on disk. If None, will save to a pooch folder in the default cache location
28
- processor: If not None, then a function that will be called before returning the full path and after the file has been downloaded
29
- downloader: If not None, then a function that will be called to download a given URL to a provided local file name
30
- progressbar: If True, will print a progress bar of the download. Requires tqdm to be installed
31
32
Returns:
33
The absolute path (including the file name) of the file in the local storage
34
"""
35
```
36
37
### Data Manager Factory
38
39
Creates a Pooch instance with sensible defaults for managing multiple data files with versioning support.
40
41
```python { .api }
42
def create(
43
path: str | list | tuple,
44
base_url: str,
45
version: str | None = None,
46
version_dev: str = "master",
47
env: str | None = None,
48
registry: dict | None = None,
49
urls: dict | None = None,
50
retry_if_failed: int = 0,
51
allow_updates: bool = True
52
) -> Pooch:
53
"""
54
Create a Pooch with sensible defaults to fetch data files.
55
56
Parameters:
57
- path: The path to the local data storage folder. If this is a list or tuple, will join the parts. The version will be appended to the end of this path
58
- base_url: Base URL for the remote data source. Should have a {version} formatting mark in it
59
- version: The version string for your project. Should be PEP440 compatible. If None, will not attempt to format base_url and no subfolder will be appended to path
60
- version_dev: The name used for the development version of a project. If your data is hosted on Github, then "master" is a good choice
61
- env: An environment variable that can be used to overwrite path
62
- registry: A record of the files that are managed by this Pooch. Keys should be the file names and the values should be their hashes
63
- urls: Custom URLs for downloading individual files in the registry
64
- retry_if_failed: Retry a file download the specified number of times if it fails
65
- allow_updates: Whether existing files in local storage that have a hash mismatch with the registry are allowed to update from the remote URL
66
67
Returns:
68
A Pooch instance configured with the given parameters
69
"""
70
```
71
72
### Data Manager Class
73
74
Manager for local data storage that can fetch from remote sources with registry-based file management.
75
76
```python { .api }
77
class Pooch:
78
"""
79
Manager for a local data storage that can fetch from a remote source.
80
81
Avoid creating Pooch instances directly. Use pooch.create instead.
82
"""
83
84
def __init__(
85
self,
86
path: str,
87
base_url: str,
88
registry: dict | None = None,
89
urls: dict | None = None,
90
retry_if_failed: int = 0,
91
allow_updates: bool = True
92
):
93
"""
94
Parameters:
95
- path: The path to the local data storage folder
96
- base_url: Base URL for the remote data source. All requests will be made relative to this URL
97
- registry: A record of the files that are managed by this Pooch. Keys should be the file names and values should be their hashes
98
- urls: Custom URLs for downloading individual files in the registry
99
- retry_if_failed: Retry a file download the specified number of times if it fails
100
- allow_updates: Whether existing files in local storage that have a hash mismatch with the registry are allowed to update from the remote URL
101
"""
102
103
@property
104
def abspath(self) -> Path:
105
"""Absolute path to the local storage."""
106
107
@property
108
def registry_files(self) -> list[str]:
109
"""List of file names on the registry."""
110
111
def fetch(
112
self,
113
fname: str,
114
processor: callable | None = None,
115
downloader: callable | None = None,
116
progressbar: bool = False
117
) -> str:
118
"""
119
Get the absolute path to a file in the local storage.
120
121
Parameters:
122
- fname: The file name (relative to the base_url of the remote data storage) of the file in the registry
123
- processor: If not None, then a function that will be called before returning the full path and after the file has been downloaded
124
- downloader: If not None, then a function that will be called to download a given URL to a provided local file name
125
- progressbar: If True, will print a progress bar of the download
126
127
Returns:
128
The absolute path to the file in the local storage
129
"""
130
131
def get_url(self, fname: str) -> str:
132
"""
133
Get the download URL for the given file.
134
135
Parameters:
136
- fname: The file name (relative to the base_url) in the registry
137
138
Returns:
139
The download URL for the file
140
"""
141
142
def load_registry(self, fname: str | object) -> None:
143
"""
144
Load entries from a file and add them to the registry.
145
146
Each line should contain file name and hash separated by a space.
147
Hash can specify algorithm using 'alg:hash' format. Custom URLs
148
can be specified as a third element. Line comments start with '#'.
149
150
Parameters:
151
- fname: Path to the registry file or an open file object
152
"""
153
154
def load_registry_from_doi(self) -> None:
155
"""
156
Populate the registry using the data repository API.
157
158
Fill the registry with all files available in the data repository,
159
along with their hashes. Makes a request to the repository API to
160
retrieve this information. No files are downloaded during this process.
161
162
Requires that the Pooch was created with a DOI base_url.
163
"""
164
165
def is_available(self, fname: str, downloader: callable | None = None) -> bool:
166
"""
167
Check if a file is available for download from the remote storage.
168
169
Parameters:
170
- fname: The file name (relative to the base_url) in the registry
171
- downloader: If not None, then a function that will be called to check if the file is available
172
173
Returns:
174
True if the file is available, False otherwise
175
"""
176
```
177
178
## Usage Examples
179
180
### Basic Single File Download
181
182
```python
183
import pooch
184
185
# Download a single file with hash verification
186
fname = pooch.retrieve(
187
url="https://github.com/fatiando/pooch/raw/v1.8.2/data/tiny-data.txt",
188
known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",
189
)
190
191
with open(fname) as f:
192
data = f.read()
193
```
194
195
### Managing Multiple Files
196
197
```python
198
import pooch
199
200
# Create a data manager for your project
201
data_manager = pooch.create(
202
path=pooch.os_cache("myproject"),
203
base_url="https://github.com/myproject/data/raw/{version}/",
204
version="v1.0.0",
205
registry={
206
"temperature.csv": "md5:ab12cd34ef56...",
207
"pressure.dat": "sha256:12345abc...",
208
"readme.txt": "md5:987fde65...",
209
}
210
)
211
212
# Fetch files from the registry
213
temp_data = data_manager.fetch("temperature.csv")
214
pressure_data = data_manager.fetch("pressure.dat")
215
216
# Check what files are available
217
print(data_manager.registry_files)
218
```
219
220
### Registry Management
221
222
```python
223
import pooch
224
225
# Create registry from directory
226
pooch.make_registry("data/", "registry.txt", recursive=True)
227
228
# Load registry from file
229
data_manager = pooch.create(
230
path=pooch.os_cache("myproject"),
231
base_url="https://example.com/data/",
232
)
233
data_manager.load_registry("registry.txt")
234
```