or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-data-management.mddownload-protocols.mdfile-processing.mdindex.mdutilities-helpers.md

core-data-management.mddocs/

0

# Core Data Management

1

2

Primary functionality for downloading and caching individual files or managing collections of data files with version control and hash verification. These functions form the foundation of Pooch's data management capabilities.

3

4

## Capabilities

5

6

### Single File Download

7

8

Downloads and caches individual files with hash verification, supporting custom processors and downloaders.

9

10

```python { .api }

11

def retrieve(

12

url: str,

13

known_hash: str | None,

14

fname: str | None = None,

15

path: str | None = None,

16

processor: callable | None = None,

17

downloader: callable | None = None,

18

progressbar: bool = False

19

) -> str:

20

"""

21

Download and cache a single file locally.

22

23

Parameters:

24

- url: The URL to the file that is to be downloaded

25

- known_hash: A known hash (checksum) of the file. Will be used to verify the download. By default, assumes SHA256. To specify different algorithm, prepend with 'algorithm:', e.g., 'md5:pw9co2iun29juoh'. If None, will NOT check the hash

26

- fname: The name that will be used to save the file. If None, will create a unique file name

27

- path: The location of the cache folder on disk. If None, will save to a pooch folder in the default cache location

28

- processor: If not None, then a function that will be called before returning the full path and after the file has been downloaded

29

- downloader: If not None, then a function that will be called to download a given URL to a provided local file name

30

- progressbar: If True, will print a progress bar of the download. Requires tqdm to be installed

31

32

Returns:

33

The absolute path (including the file name) of the file in the local storage

34

"""

35

```

36

37

### Data Manager Factory

38

39

Creates a Pooch instance with sensible defaults for managing multiple data files with versioning support.

40

41

```python { .api }

42

def create(

43

path: str | list | tuple,

44

base_url: str,

45

version: str | None = None,

46

version_dev: str = "master",

47

env: str | None = None,

48

registry: dict | None = None,

49

urls: dict | None = None,

50

retry_if_failed: int = 0,

51

allow_updates: bool = True

52

) -> Pooch:

53

"""

54

Create a Pooch with sensible defaults to fetch data files.

55

56

Parameters:

57

- path: The path to the local data storage folder. If this is a list or tuple, will join the parts. The version will be appended to the end of this path

58

- base_url: Base URL for the remote data source. Should have a {version} formatting mark in it

59

- version: The version string for your project. Should be PEP440 compatible. If None, will not attempt to format base_url and no subfolder will be appended to path

60

- version_dev: The name used for the development version of a project. If your data is hosted on Github, then "master" is a good choice

61

- env: An environment variable that can be used to overwrite path

62

- registry: A record of the files that are managed by this Pooch. Keys should be the file names and the values should be their hashes

63

- urls: Custom URLs for downloading individual files in the registry

64

- retry_if_failed: Retry a file download the specified number of times if it fails

65

- allow_updates: Whether existing files in local storage that have a hash mismatch with the registry are allowed to update from the remote URL

66

67

Returns:

68

A Pooch instance configured with the given parameters

69

"""

70

```

71

72

### Data Manager Class

73

74

Manager for local data storage that can fetch from remote sources with registry-based file management.

75

76

```python { .api }

77

class Pooch:

78

"""

79

Manager for a local data storage that can fetch from a remote source.

80

81

Avoid creating Pooch instances directly. Use pooch.create instead.

82

"""

83

84

def __init__(

85

self,

86

path: str,

87

base_url: str,

88

registry: dict | None = None,

89

urls: dict | None = None,

90

retry_if_failed: int = 0,

91

allow_updates: bool = True

92

):

93

"""

94

Parameters:

95

- path: The path to the local data storage folder

96

- base_url: Base URL for the remote data source. All requests will be made relative to this URL

97

- registry: A record of the files that are managed by this Pooch. Keys should be the file names and values should be their hashes

98

- urls: Custom URLs for downloading individual files in the registry

99

- retry_if_failed: Retry a file download the specified number of times if it fails

100

- allow_updates: Whether existing files in local storage that have a hash mismatch with the registry are allowed to update from the remote URL

101

"""

102

103

@property

104

def abspath(self) -> Path:

105

"""Absolute path to the local storage."""

106

107

@property

108

def registry_files(self) -> list[str]:

109

"""List of file names on the registry."""

110

111

def fetch(

112

self,

113

fname: str,

114

processor: callable | None = None,

115

downloader: callable | None = None,

116

progressbar: bool = False

117

) -> str:

118

"""

119

Get the absolute path to a file in the local storage.

120

121

Parameters:

122

- fname: The file name (relative to the base_url of the remote data storage) of the file in the registry

123

- processor: If not None, then a function that will be called before returning the full path and after the file has been downloaded

124

- downloader: If not None, then a function that will be called to download a given URL to a provided local file name

125

- progressbar: If True, will print a progress bar of the download

126

127

Returns:

128

The absolute path to the file in the local storage

129

"""

130

131

def get_url(self, fname: str) -> str:

132

"""

133

Get the download URL for the given file.

134

135

Parameters:

136

- fname: The file name (relative to the base_url) in the registry

137

138

Returns:

139

The download URL for the file

140

"""

141

142

def load_registry(self, fname: str | object) -> None:

143

"""

144

Load entries from a file and add them to the registry.

145

146

Each line should contain file name and hash separated by a space.

147

Hash can specify algorithm using 'alg:hash' format. Custom URLs

148

can be specified as a third element. Line comments start with '#'.

149

150

Parameters:

151

- fname: Path to the registry file or an open file object

152

"""

153

154

def load_registry_from_doi(self) -> None:

155

"""

156

Populate the registry using the data repository API.

157

158

Fill the registry with all files available in the data repository,

159

along with their hashes. Makes a request to the repository API to

160

retrieve this information. No files are downloaded during this process.

161

162

Requires that the Pooch was created with a DOI base_url.

163

"""

164

165

def is_available(self, fname: str, downloader: callable | None = None) -> bool:

166

"""

167

Check if a file is available for download from the remote storage.

168

169

Parameters:

170

- fname: The file name (relative to the base_url) in the registry

171

- downloader: If not None, then a function that will be called to check if the file is available

172

173

Returns:

174

True if the file is available, False otherwise

175

"""

176

```

177

178

## Usage Examples

179

180

### Basic Single File Download

181

182

```python

183

import pooch

184

185

# Download a single file with hash verification

186

fname = pooch.retrieve(

187

url="https://github.com/fatiando/pooch/raw/v1.8.2/data/tiny-data.txt",

188

known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",

189

)

190

191

with open(fname) as f:

192

data = f.read()

193

```

194

195

### Managing Multiple Files

196

197

```python

198

import pooch

199

200

# Create a data manager for your project

201

data_manager = pooch.create(

202

path=pooch.os_cache("myproject"),

203

base_url="https://github.com/myproject/data/raw/{version}/",

204

version="v1.0.0",

205

registry={

206

"temperature.csv": "md5:ab12cd34ef56...",

207

"pressure.dat": "sha256:12345abc...",

208

"readme.txt": "md5:987fde65...",

209

}

210

)

211

212

# Fetch files from the registry

213

temp_data = data_manager.fetch("temperature.csv")

214

pressure_data = data_manager.fetch("pressure.dat")

215

216

# Check what files are available

217

print(data_manager.registry_files)

218

```

219

220

### Registry Management

221

222

```python

223

import pooch

224

225

# Create registry from directory

226

pooch.make_registry("data/", "registry.txt", recursive=True)

227

228

# Load registry from file

229

data_manager = pooch.create(

230

path=pooch.os_cache("myproject"),

231

base_url="https://example.com/data/",

232

)

233

data_manager.load_registry("registry.txt")

234

```