0
# Caching and Integrity
1
2
Download files with intelligent caching, hash verification, and post-processing capabilities for reliable automation workflows.
3
4
## Capabilities
5
6
### Cached Download Function
7
8
Downloads files with caching support and integrity verification using multiple hash algorithms.
9
10
```python { .api }
11
from typing import Optional
12
13
def cached_download(
14
url=None,
15
path=None,
16
md5=None,
17
quiet=False,
18
postprocess=None,
19
hash: Optional[str] = None,
20
**kwargs
21
) -> str:
22
"""
23
Download file with caching and hash verification.
24
25
Parameters:
26
- url (str): URL to download from. Google Drive URLs supported.
27
- path (str): Cache file path. If None, auto-generated from URL.
28
- md5 (str): Expected MD5 hash (deprecated, use hash parameter).
29
- quiet (bool): Suppress terminal output. Default: False.
30
- postprocess (callable): Function to call with filename after download.
31
- hash (str): Hash in format 'algorithm:hexvalue' (e.g., 'sha256:abc123...').
32
Supported: md5, sha1, sha256, sha512.
33
- **kwargs: Additional arguments passed to download() function.
34
35
Returns:
36
str: Path to cached file.
37
38
Raises:
39
AssertionError: When file hash doesn't match expected value.
40
ValueError: When both md5 and hash parameters are specified.
41
"""
42
```
43
44
### Usage Examples
45
46
#### Basic Cached Download
47
48
```python
49
import gdown
50
51
# Simple cached download
52
url = "https://drive.google.com/uc?id=1l_5RK28JRL19wpT22B-DY9We3TVXnnQQ"
53
cached_path = gdown.cached_download(url)
54
print(f"File cached at: {cached_path}")
55
56
# Subsequent calls return cached file immediately
57
cached_path_again = gdown.cached_download(url) # No download, returns cached file
58
```
59
60
#### Hash Verification
61
62
```python
63
# SHA256 verification
64
url = "https://example.com/data.zip"
65
expected_hash = "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
66
67
try:
68
path = gdown.cached_download(url, hash=expected_hash)
69
print(f"File verified and cached: {path}")
70
except AssertionError as e:
71
print(f"Hash verification failed: {e}")
72
```
73
74
#### Multiple Hash Algorithms
75
76
```python
77
# MD5 verification
78
gdown.cached_download(url, hash="md5:5d41402abc4b2a76b9719d911017c592")
79
80
# SHA1 verification
81
gdown.cached_download(url, hash="sha1:aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d")
82
83
# SHA512 verification
84
gdown.cached_download(url, hash="sha512:b109f3bbbc244eb82441917ed06d618b9008dd09b3befd1b5e07394c706a8bb980b1d7785e5976ec049b46df5f1326af5a2ea6d103fd07c95385ffab0cacbc86")
85
```
86
87
#### Custom Cache Location
88
89
```python
90
# Specify custom cache path
91
custom_path = "/tmp/my_cache/important_file.zip"
92
gdown.cached_download(url, path=custom_path, hash="sha256:abc123...")
93
```
94
95
#### Post-processing
96
97
```python
98
# Extract archive after download
99
def extract_archive(filepath):
100
print(f"Extracting {filepath}")
101
gdown.extractall(filepath, to="./extracted/")
102
103
# Download, verify, and extract
104
gdown.cached_download(
105
url="https://example.com/archive.zip",
106
hash="sha256:expected_hash_here",
107
postprocess=extract_archive
108
)
109
```
110
111
#### Integration with Download Options
112
113
```python
114
# Use all download() parameters
115
gdown.cached_download(
116
url="https://drive.google.com/uc?id=FILE_ID",
117
path="./cache/myfile.zip",
118
hash="sha256:expected_hash",
119
proxy="http://proxy:8080",
120
speed=512*1024, # 512KB/s
121
use_cookies=True,
122
fuzzy=True
123
)
124
```
125
126
### Hash Computation Utilities
127
128
```python { .api }
129
def md5sum(filename, blocksize=None) -> str:
130
"""
131
Calculate MD5 hash of file (deprecated).
132
133
Parameters:
134
- filename (str): Path to file to hash
135
- blocksize (int): Block size for reading file chunks. Default: 65536
136
137
Returns:
138
str: MD5 hexdigest string
139
140
Note: Deprecated and will be removed in future versions.
141
Use hash parameter in cached_download() instead.
142
"""
143
```
144
145
#### Usage Example
146
147
```python
148
import gdown
149
150
# Calculate MD5 hash of downloaded file (deprecated usage)
151
file_path = "downloaded_file.zip"
152
hash_value = gdown.md5sum(file_path)
153
print(f"MD5: {hash_value}")
154
155
# Preferred approach: Use hash parameter in cached_download
156
gdown.cached_download(url, hash=f"md5:{hash_value}")
157
```
158
159
## Cache Directory Structure
160
161
### Default Cache Location
162
163
Files are cached in `~/.cache/gdown/` with URL-based naming:
164
165
```
166
~/.cache/gdown/
167
├── https-COLON--SLASH--SLASH-drive.google.com-SLASH-uc-QUESTION-id-EQUAL-FILE_ID
168
├── _dl_lock # Download lock file
169
└── cookies.txt # Cookie storage
170
```
171
172
### URL to Filename Mapping
173
174
URLs are converted to filenames by replacing special characters:
175
- `/` → `-SLASH-`
176
- `:` → `-COLON-`
177
- `=` → `-EQUAL-`
178
- `?` → `-QUESTION-`
179
180
## Error Handling
181
182
```python
183
from gdown.exceptions import FileURLRetrievalError
184
185
try:
186
# Hash mismatch example
187
gdown.cached_download(
188
"https://example.com/file.zip",
189
hash="sha256:wrong_hash_value"
190
)
191
except AssertionError as e:
192
print(f"Hash verification failed: {e}")
193
# Re-download with correct hash or investigate file corruption
194
195
try:
196
# Download failure
197
gdown.cached_download("https://invalid-url.com/file.zip")
198
except FileURLRetrievalError as e:
199
print(f"Download failed: {e}")
200
```
201
202
## Supported Hash Algorithms
203
204
All algorithms from Python's `hashlib.algorithms_guaranteed`:
205
206
- **md5**: Fast but cryptographically weak (legacy support)
207
- **sha1**: Better than MD5 but still considered weak
208
- **sha256**: Recommended for most use cases
209
- **sha512**: Maximum security for critical files
210
211
## Best Practices
212
213
### Reliable Downloads
214
215
```python
216
# Always use hash verification for production
217
def reliable_download(url, expected_hash, max_retries=3):
218
for attempt in range(max_retries):
219
try:
220
return gdown.cached_download(url, hash=expected_hash)
221
except AssertionError:
222
if attempt == max_retries - 1:
223
raise
224
print(f"Hash mismatch, retrying... ({attempt + 1}/{max_retries})")
225
```
226
227
### Automation Workflows
228
229
```python
230
# Pipeline with post-processing
231
def process_dataset(url, dataset_hash):
232
# Download and verify
233
archive_path = gdown.cached_download(url, hash=dataset_hash)
234
235
# Extract
236
extract_dir = "./data/"
237
extracted_files = gdown.extractall(archive_path, to=extract_dir)
238
239
# Process files
240
for file_path in extracted_files:
241
if file_path.endswith('.csv'):
242
# Process CSV data
243
pass
244
245
return extracted_files
246
```