0
# File Utilities
1
2
File handling utilities for downloading, caching, and managing pre-trained model files. These utilities handle automatic download of model weights, configurations, and tokenizer files from remote repositories with local caching support to avoid repeated downloads.
3
4
## Capabilities
5
6
### cached_path
7
8
Main function for downloading and caching files from URLs or returning local file paths.
9
10
```python { .api }
11
def cached_path(url_or_filename, cache_dir=None):
12
"""
13
Download and cache a file from a URL or return the path if it's a local file.
14
15
Parameters:
16
- url_or_filename (str): URL to download from or local file path
17
- cache_dir (str, optional): Directory to cache downloaded files
18
Defaults to PYTORCH_TRANSFORMERS_CACHE
19
20
Returns:
21
str: Path to the cached or local file
22
23
Raises:
24
EnvironmentError: If the file cannot be found or downloaded
25
"""
26
```
27
28
**Usage Examples:**
29
30
```python
31
from pytorch_transformers import cached_path
32
33
# Download and cache a model file
34
model_url = "https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin"
35
local_path = cached_path(model_url)
36
print(f"Model cached at: {local_path}")
37
38
# Use custom cache directory
39
custom_cache = "./my_cache"
40
config_url = "https://huggingface.co/bert-base-uncased/resolve/main/config.json"
41
config_path = cached_path(config_url, cache_dir=custom_cache)
42
43
# Return local file path unchanged
44
local_file = "./my_model.bin"
45
path = cached_path(local_file) # Returns "./my_model.bin"
46
```
47
48
### Cache Directory Constants
49
50
Pre-defined cache directory paths used by the library for storing downloaded files.
51
52
```python { .api }
53
PYTORCH_TRANSFORMERS_CACHE: str
54
# Default cache directory for pytorch-transformers
55
# Typically resolves to: ~/.cache/torch/pytorch_transformers/
56
57
PYTORCH_PRETRAINED_BERT_CACHE: str
58
# Legacy cache directory for backward compatibility with pytorch-pretrained-bert
59
# Typically resolves to: ~/.pytorch_pretrained_bert/
60
```
61
62
**Usage Examples:**
63
64
```python
65
from pytorch_transformers import PYTORCH_TRANSFORMERS_CACHE, PYTORCH_PRETRAINED_BERT_CACHE
66
import os
67
68
# Check default cache locations
69
print(f"Default cache: {PYTORCH_TRANSFORMERS_CACHE}")
70
print(f"Legacy cache: {PYTORCH_PRETRAINED_BERT_CACHE}")
71
72
# List cached files
73
if os.path.exists(PYTORCH_TRANSFORMERS_CACHE):
74
cached_files = os.listdir(PYTORCH_TRANSFORMERS_CACHE)
75
print(f"Cached files: {len(cached_files)}")
76
for file in cached_files[:5]: # Show first 5 files
77
print(f" {file}")
78
79
# Clear cache (be careful!)
80
import shutil
81
# shutil.rmtree(PYTORCH_TRANSFORMERS_CACHE) # Uncomment to clear cache
82
```
83
84
### Model File Constants
85
86
Standard filenames used by the library for model components.
87
88
```python { .api }
89
WEIGHTS_NAME: str = "pytorch_model.bin"
90
# Default filename for PyTorch model weights
91
92
CONFIG_NAME: str = "config.json"
93
# Default filename for model configuration files
94
95
TF_WEIGHTS_NAME: str = "model.ckpt"
96
# Default filename for TensorFlow model weights
97
```
98
99
**Usage Examples:**
100
101
```python
102
from pytorch_transformers import WEIGHTS_NAME, CONFIG_NAME, TF_WEIGHTS_NAME
103
import os
104
105
# Check if model files exist in a directory
106
model_dir = "./my_model"
107
weights_path = os.path.join(model_dir, WEIGHTS_NAME)
108
config_path = os.path.join(model_dir, CONFIG_NAME)
109
110
if os.path.exists(weights_path):
111
print(f"Model weights found: {weights_path}")
112
113
if os.path.exists(config_path):
114
print(f"Model config found: {config_path}")
115
116
# Save model with standard names
117
model = BertModel.from_pretrained("bert-base-uncased")
118
model.save_pretrained(model_dir) # Creates pytorch_model.bin and config.json
119
```
120
121
## Caching Behavior
122
123
### Automatic Download and Caching
124
125
When you load a pre-trained model for the first time, the library automatically:
126
127
1. **Downloads** model files from Hugging Face Model Hub
128
2. **Caches** files locally to avoid future downloads
129
3. **Verifies** file integrity using checksums
130
4. **Returns** the cached file path for loading
131
132
```python
133
from pytorch_transformers import BertModel
134
135
# First time: Downloads and caches files
136
model = BertModel.from_pretrained("bert-base-uncased")
137
138
# Subsequent times: Uses cached files
139
model = BertModel.from_pretrained("bert-base-uncased") # Much faster!
140
```
141
142
### Cache Structure
143
144
The cache directory contains subdirectories for different file types:
145
146
```
147
~/.cache/torch/pytorch_transformers/
148
├── 0123abc...def/ # Hash-based subdirectory
149
│ ├── pytorch_model.bin # Model weights
150
│ ├── config.json # Model configuration
151
│ └── tokenizer.json # Tokenizer files
152
├── 4567ghi...jkl/
153
│ └── vocab.txt # Vocabulary files
154
└── ...
155
```
156
157
### Environment Variables
158
159
Control caching behavior through environment variables:
160
161
```bash
162
# Set custom cache directory
163
export PYTORCH_TRANSFORMERS_CACHE="/path/to/my/cache"
164
165
# Disable caching (download to temp directory each time)
166
export PYTORCH_TRANSFORMERS_CACHE="/tmp"
167
168
# Use offline mode (only use cached files)
169
export HF_DATASETS_OFFLINE=1
170
export TRANSFORMERS_OFFLINE=1
171
```
172
173
### Cache Management
174
175
```python
176
import os
177
import shutil
178
from pytorch_transformers import PYTORCH_TRANSFORMERS_CACHE
179
180
def get_cache_size():
181
"""Get total size of cache directory in MB."""
182
if not os.path.exists(PYTORCH_TRANSFORMERS_CACHE):
183
return 0
184
185
total_size = 0
186
for dirpath, dirnames, filenames in os.walk(PYTORCH_TRANSFORMERS_CACHE):
187
for filename in filenames:
188
filepath = os.path.join(dirpath, filename)
189
total_size += os.path.getsize(filepath)
190
191
return total_size / (1024 * 1024) # Convert to MB
192
193
def clear_cache():
194
"""Clear all cached files."""
195
if os.path.exists(PYTORCH_TRANSFORMERS_CACHE):
196
shutil.rmtree(PYTORCH_TRANSFORMERS_CACHE)
197
print(f"Cache cleared: {PYTORCH_TRANSFORMERS_CACHE}")
198
199
def list_cached_models():
200
"""List all cached model directories."""
201
if not os.path.exists(PYTORCH_TRANSFORMERS_CACHE):
202
return []
203
204
cached_dirs = []
205
for item in os.listdir(PYTORCH_TRANSFORMERS_CACHE):
206
item_path = os.path.join(PYTORCH_TRANSFORMERS_CACHE, item)
207
if os.path.isdir(item_path):
208
# Check if it contains model files
209
has_weights = os.path.exists(os.path.join(item_path, "pytorch_model.bin"))
210
has_config = os.path.exists(os.path.join(item_path, "config.json"))
211
if has_weights or has_config:
212
cached_dirs.append(item)
213
214
return cached_dirs
215
216
# Usage
217
print(f"Cache size: {get_cache_size():.1f} MB")
218
print(f"Cached models: {len(list_cached_models())}")
219
```
220
221
## Network Configuration
222
223
### Proxy Support
224
225
The caching utilities support HTTP proxies for downloading files in restricted network environments:
226
227
```python
228
import os
229
230
# Set proxy environment variables
231
os.environ['HTTP_PROXY'] = 'http://proxy.company.com:8080'
232
os.environ['HTTPS_PROXY'] = 'https://proxy.company.com:8080'
233
234
# Download will now use proxy
235
from pytorch_transformers import BertModel
236
model = BertModel.from_pretrained("bert-base-uncased")
237
```
238
239
### Timeout Configuration
240
241
```python
242
import os
243
244
# Set download timeout (in seconds)
245
os.environ['HF_HUB_DOWNLOAD_TIMEOUT'] = '300' # 5 minutes
246
247
# For very slow connections
248
os.environ['HF_HUB_DOWNLOAD_TIMEOUT'] = '1800' # 30 minutes
249
```
250
251
### Offline Mode
252
253
When working in environments without internet access:
254
255
```python
256
import os
257
258
# Enable offline mode - only use cached files
259
os.environ['HF_DATASETS_OFFLINE'] = '1'
260
os.environ['TRANSFORMERS_OFFLINE'] = '1'
261
262
try:
263
# This will only work if files are already cached
264
model = BertModel.from_pretrained("bert-base-uncased")
265
except OSError as e:
266
print(f"Model not in cache: {e}")
267
```
268
269
## Error Handling
270
271
The file utilities provide informative error messages for common issues:
272
273
```python
274
from pytorch_transformers import cached_path
275
276
try:
277
# Invalid URL
278
path = cached_path("https://invalid-url.com/model.bin")
279
except EnvironmentError as e:
280
print(f"Download failed: {e}")
281
282
try:
283
# Local file doesn't exist
284
path = cached_path("./nonexistent_model.bin")
285
except EnvironmentError as e:
286
print(f"File not found: {e}")
287
288
try:
289
# Network issues
290
path = cached_path("https://valid-url.com/model.bin")
291
except EnvironmentError as e:
292
print(f"Network error: {e}")
293
```
294
295
## Integration with Model Loading
296
297
The file utilities are automatically used by all `from_pretrained()` methods:
298
299
```python
300
# These all use cached_path internally
301
from pytorch_transformers import (
302
AutoModel, AutoTokenizer, AutoConfig,
303
BertModel, BertTokenizer, BertConfig
304
)
305
306
# Download and cache if needed
307
model = AutoModel.from_pretrained("bert-base-uncased")
308
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
309
config = AutoConfig.from_pretrained("bert-base-uncased")
310
311
# Custom cache directory for specific models
312
model = BertModel.from_pretrained(
313
"bert-base-uncased",
314
cache_dir="./my_bert_cache"
315
)
316
```