or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

archive-utilities.mdcaching-integrity.mdfile-downloads.mdfolder-operations.mdindex.md

caching-integrity.mddocs/

0

# Caching and Integrity

1

2

Download files with intelligent caching, hash verification, and post-processing capabilities for reliable automation workflows.

3

4

## Capabilities

5

6

### Cached Download Function

7

8

Downloads files with caching support and integrity verification using multiple hash algorithms.

9

10

```python { .api }

11

from typing import Optional

12

13

def cached_download(

14

url=None,

15

path=None,

16

md5=None,

17

quiet=False,

18

postprocess=None,

19

hash: Optional[str] = None,

20

**kwargs

21

) -> str:

22

"""

23

Download file with caching and hash verification.

24

25

Parameters:

26

- url (str): URL to download from. Google Drive URLs supported.

27

- path (str): Cache file path. If None, auto-generated from URL.

28

- md5 (str): Expected MD5 hash (deprecated, use hash parameter).

29

- quiet (bool): Suppress terminal output. Default: False.

30

- postprocess (callable): Function to call with filename after download.

31

- hash (str): Hash in format 'algorithm:hexvalue' (e.g., 'sha256:abc123...').

32

Supported: md5, sha1, sha256, sha512.

33

- **kwargs: Additional arguments passed to download() function.

34

35

Returns:

36

str: Path to cached file.

37

38

Raises:

39

AssertionError: When file hash doesn't match expected value.

40

ValueError: When both md5 and hash parameters are specified.

41

"""

42

```

43

44

### Usage Examples

45

46

#### Basic Cached Download

47

48

```python

49

import gdown

50

51

# Simple cached download

52

url = "https://drive.google.com/uc?id=1l_5RK28JRL19wpT22B-DY9We3TVXnnQQ"

53

cached_path = gdown.cached_download(url)

54

print(f"File cached at: {cached_path}")

55

56

# Subsequent calls return cached file immediately

57

cached_path_again = gdown.cached_download(url) # No download, returns cached file

58

```

59

60

#### Hash Verification

61

62

```python

63

# SHA256 verification

64

url = "https://example.com/data.zip"

65

expected_hash = "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"

66

67

try:

68

path = gdown.cached_download(url, hash=expected_hash)

69

print(f"File verified and cached: {path}")

70

except AssertionError as e:

71

print(f"Hash verification failed: {e}")

72

```

73

74

#### Multiple Hash Algorithms

75

76

```python

77

# MD5 verification

78

gdown.cached_download(url, hash="md5:5d41402abc4b2a76b9719d911017c592")

79

80

# SHA1 verification

81

gdown.cached_download(url, hash="sha1:aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d")

82

83

# SHA512 verification

84

gdown.cached_download(url, hash="sha512:b109f3bbbc244eb82441917ed06d618b9008dd09b3befd1b5e07394c706a8bb980b1d7785e5976ec049b46df5f1326af5a2ea6d103fd07c95385ffab0cacbc86")

85

```

86

87

#### Custom Cache Location

88

89

```python

90

# Specify custom cache path

91

custom_path = "/tmp/my_cache/important_file.zip"

92

gdown.cached_download(url, path=custom_path, hash="sha256:abc123...")

93

```

94

95

#### Post-processing

96

97

```python

98

# Extract archive after download

99

def extract_archive(filepath):

100

print(f"Extracting {filepath}")

101

gdown.extractall(filepath, to="./extracted/")

102

103

# Download, verify, and extract

104

gdown.cached_download(

105

url="https://example.com/archive.zip",

106

hash="sha256:expected_hash_here",

107

postprocess=extract_archive

108

)

109

```

110

111

#### Integration with Download Options

112

113

```python

114

# Use all download() parameters

115

gdown.cached_download(

116

url="https://drive.google.com/uc?id=FILE_ID",

117

path="./cache/myfile.zip",

118

hash="sha256:expected_hash",

119

proxy="http://proxy:8080",

120

speed=512*1024, # 512KB/s

121

use_cookies=True,

122

fuzzy=True

123

)

124

```

125

126

### Hash Computation Utilities

127

128

```python { .api }

129

def md5sum(filename, blocksize=None) -> str:

130

"""

131

Calculate MD5 hash of file (deprecated).

132

133

Parameters:

134

- filename (str): Path to file to hash

135

- blocksize (int): Block size for reading file chunks. Default: 65536

136

137

Returns:

138

str: MD5 hexdigest string

139

140

Note: Deprecated and will be removed in future versions.

141

Use hash parameter in cached_download() instead.

142

"""

143

```

144

145

#### Usage Example

146

147

```python

148

import gdown

149

150

# Calculate MD5 hash of downloaded file (deprecated usage)

151

file_path = "downloaded_file.zip"

152

hash_value = gdown.md5sum(file_path)

153

print(f"MD5: {hash_value}")

154

155

# Preferred approach: Use hash parameter in cached_download

156

gdown.cached_download(url, hash=f"md5:{hash_value}")

157

```

158

159

## Cache Directory Structure

160

161

### Default Cache Location

162

163

Files are cached in `~/.cache/gdown/` with URL-based naming:

164

165

```

166

~/.cache/gdown/

167

├── https-COLON--SLASH--SLASH-drive.google.com-SLASH-uc-QUESTION-id-EQUAL-FILE_ID

168

├── _dl_lock # Download lock file

169

└── cookies.txt # Cookie storage

170

```

171

172

### URL to Filename Mapping

173

174

URLs are converted to filenames by replacing special characters:

175

- `/``-SLASH-`

176

- `:``-COLON-`

177

- `=``-EQUAL-`

178

- `?``-QUESTION-`

179

180

## Error Handling

181

182

```python

183

from gdown.exceptions import FileURLRetrievalError

184

185

try:

186

# Hash mismatch example

187

gdown.cached_download(

188

"https://example.com/file.zip",

189

hash="sha256:wrong_hash_value"

190

)

191

except AssertionError as e:

192

print(f"Hash verification failed: {e}")

193

# Re-download with correct hash or investigate file corruption

194

195

try:

196

# Download failure

197

gdown.cached_download("https://invalid-url.com/file.zip")

198

except FileURLRetrievalError as e:

199

print(f"Download failed: {e}")

200

```

201

202

## Supported Hash Algorithms

203

204

All algorithms from Python's `hashlib.algorithms_guaranteed`:

205

206

- **md5**: Fast but cryptographically weak (legacy support)

207

- **sha1**: Better than MD5 but still considered weak

208

- **sha256**: Recommended for most use cases

209

- **sha512**: Maximum security for critical files

210

211

## Best Practices

212

213

### Reliable Downloads

214

215

```python

216

# Always use hash verification for production

217

def reliable_download(url, expected_hash, max_retries=3):

218

for attempt in range(max_retries):

219

try:

220

return gdown.cached_download(url, hash=expected_hash)

221

except AssertionError:

222

if attempt == max_retries - 1:

223

raise

224

print(f"Hash mismatch, retrying... ({attempt + 1}/{max_retries})")

225

```

226

227

### Automation Workflows

228

229

```python

230

# Pipeline with post-processing

231

def process_dataset(url, dataset_hash):

232

# Download and verify

233

archive_path = gdown.cached_download(url, hash=dataset_hash)

234

235

# Extract

236

extract_dir = "./data/"

237

extracted_files = gdown.extractall(archive_path, to=extract_dir)

238

239

# Process files

240

for file_path in extracted_files:

241

if file_path.endswith('.csv'):

242

# Process CSV data

243

pass

244

245

return extracted_files

246

```