or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

auto-classes.mdbase-classes.mdbert-models.mdfile-utilities.mdgpt2-models.mdindex.mdoptimization.mdother-models.md

file-utilities.mddocs/

0

# File Utilities

1

2

File handling utilities for downloading, caching, and managing pre-trained model files. These utilities handle automatic download of model weights, configurations, and tokenizer files from remote repositories with local caching support to avoid repeated downloads.

3

4

## Capabilities

5

6

### cached_path

7

8

Main function for downloading and caching files from URLs or returning local file paths.

9

10

```python { .api }

11

def cached_path(url_or_filename, cache_dir=None):

12

"""

13

Download and cache a file from a URL or return the path if it's a local file.

14

15

Parameters:

16

- url_or_filename (str): URL to download from or local file path

17

- cache_dir (str, optional): Directory to cache downloaded files

18

Defaults to PYTORCH_TRANSFORMERS_CACHE

19

20

Returns:

21

str: Path to the cached or local file

22

23

Raises:

24

EnvironmentError: If the file cannot be found or downloaded

25

"""

26

```

27

28

**Usage Examples:**

29

30

```python

31

from pytorch_transformers import cached_path

32

33

# Download and cache a model file

34

model_url = "https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin"

35

local_path = cached_path(model_url)

36

print(f"Model cached at: {local_path}")

37

38

# Use custom cache directory

39

custom_cache = "./my_cache"

40

config_url = "https://huggingface.co/bert-base-uncased/resolve/main/config.json"

41

config_path = cached_path(config_url, cache_dir=custom_cache)

42

43

# Return local file path unchanged

44

local_file = "./my_model.bin"

45

path = cached_path(local_file) # Returns "./my_model.bin"

46

```

47

48

### Cache Directory Constants

49

50

Pre-defined cache directory paths used by the library for storing downloaded files.

51

52

```python { .api }

53

PYTORCH_TRANSFORMERS_CACHE: str

54

# Default cache directory for pytorch-transformers

55

# Typically resolves to: ~/.cache/torch/pytorch_transformers/

56

57

PYTORCH_PRETRAINED_BERT_CACHE: str

58

# Legacy cache directory for backward compatibility with pytorch-pretrained-bert

59

# Typically resolves to: ~/.pytorch_pretrained_bert/

60

```

61

62

**Usage Examples:**

63

64

```python

65

from pytorch_transformers import PYTORCH_TRANSFORMERS_CACHE, PYTORCH_PRETRAINED_BERT_CACHE

66

import os

67

68

# Check default cache locations

69

print(f"Default cache: {PYTORCH_TRANSFORMERS_CACHE}")

70

print(f"Legacy cache: {PYTORCH_PRETRAINED_BERT_CACHE}")

71

72

# List cached files

73

if os.path.exists(PYTORCH_TRANSFORMERS_CACHE):

74

cached_files = os.listdir(PYTORCH_TRANSFORMERS_CACHE)

75

print(f"Cached files: {len(cached_files)}")

76

for file in cached_files[:5]: # Show first 5 files

77

print(f" {file}")

78

79

# Clear cache (be careful!)

80

import shutil

81

# shutil.rmtree(PYTORCH_TRANSFORMERS_CACHE) # Uncomment to clear cache

82

```

83

84

### Model File Constants

85

86

Standard filenames used by the library for model components.

87

88

```python { .api }

89

WEIGHTS_NAME: str = "pytorch_model.bin"

90

# Default filename for PyTorch model weights

91

92

CONFIG_NAME: str = "config.json"

93

# Default filename for model configuration files

94

95

TF_WEIGHTS_NAME: str = "model.ckpt"

96

# Default filename for TensorFlow model weights

97

```

98

99

**Usage Examples:**

100

101

```python

102

from pytorch_transformers import WEIGHTS_NAME, CONFIG_NAME, TF_WEIGHTS_NAME

103

import os

104

105

# Check if model files exist in a directory

106

model_dir = "./my_model"

107

weights_path = os.path.join(model_dir, WEIGHTS_NAME)

108

config_path = os.path.join(model_dir, CONFIG_NAME)

109

110

if os.path.exists(weights_path):

111

print(f"Model weights found: {weights_path}")

112

113

if os.path.exists(config_path):

114

print(f"Model config found: {config_path}")

115

116

# Save model with standard names

117

model = BertModel.from_pretrained("bert-base-uncased")

118

model.save_pretrained(model_dir) # Creates pytorch_model.bin and config.json

119

```

120

121

## Caching Behavior

122

123

### Automatic Download and Caching

124

125

When you load a pre-trained model for the first time, the library automatically:

126

127

1. **Downloads** model files from Hugging Face Model Hub

128

2. **Caches** files locally to avoid future downloads

129

3. **Verifies** file integrity using checksums

130

4. **Returns** the cached file path for loading

131

132

```python

133

from pytorch_transformers import BertModel

134

135

# First time: Downloads and caches files

136

model = BertModel.from_pretrained("bert-base-uncased")

137

138

# Subsequent times: Uses cached files

139

model = BertModel.from_pretrained("bert-base-uncased") # Much faster!

140

```

141

142

### Cache Structure

143

144

The cache directory contains subdirectories for different file types:

145

146

```

147

~/.cache/torch/pytorch_transformers/

148

├── 0123abc...def/ # Hash-based subdirectory

149

│ ├── pytorch_model.bin # Model weights

150

│ ├── config.json # Model configuration

151

│ └── tokenizer.json # Tokenizer files

152

├── 4567ghi...jkl/

153

│ └── vocab.txt # Vocabulary files

154

└── ...

155

```

156

157

### Environment Variables

158

159

Control caching behavior through environment variables:

160

161

```bash

162

# Set custom cache directory

163

export PYTORCH_TRANSFORMERS_CACHE="/path/to/my/cache"

164

165

# Disable caching (download to temp directory each time)

166

export PYTORCH_TRANSFORMERS_CACHE="/tmp"

167

168

# Use offline mode (only use cached files)

169

export HF_DATASETS_OFFLINE=1

170

export TRANSFORMERS_OFFLINE=1

171

```

172

173

### Cache Management

174

175

```python

176

import os

177

import shutil

178

from pytorch_transformers import PYTORCH_TRANSFORMERS_CACHE

179

180

def get_cache_size():

181

"""Get total size of cache directory in MB."""

182

if not os.path.exists(PYTORCH_TRANSFORMERS_CACHE):

183

return 0

184

185

total_size = 0

186

for dirpath, dirnames, filenames in os.walk(PYTORCH_TRANSFORMERS_CACHE):

187

for filename in filenames:

188

filepath = os.path.join(dirpath, filename)

189

total_size += os.path.getsize(filepath)

190

191

return total_size / (1024 * 1024) # Convert to MB

192

193

def clear_cache():

194

"""Clear all cached files."""

195

if os.path.exists(PYTORCH_TRANSFORMERS_CACHE):

196

shutil.rmtree(PYTORCH_TRANSFORMERS_CACHE)

197

print(f"Cache cleared: {PYTORCH_TRANSFORMERS_CACHE}")

198

199

def list_cached_models():

200

"""List all cached model directories."""

201

if not os.path.exists(PYTORCH_TRANSFORMERS_CACHE):

202

return []

203

204

cached_dirs = []

205

for item in os.listdir(PYTORCH_TRANSFORMERS_CACHE):

206

item_path = os.path.join(PYTORCH_TRANSFORMERS_CACHE, item)

207

if os.path.isdir(item_path):

208

# Check if it contains model files

209

has_weights = os.path.exists(os.path.join(item_path, "pytorch_model.bin"))

210

has_config = os.path.exists(os.path.join(item_path, "config.json"))

211

if has_weights or has_config:

212

cached_dirs.append(item)

213

214

return cached_dirs

215

216

# Usage

217

print(f"Cache size: {get_cache_size():.1f} MB")

218

print(f"Cached models: {len(list_cached_models())}")

219

```

220

221

## Network Configuration

222

223

### Proxy Support

224

225

The caching utilities support HTTP proxies for downloading files in restricted network environments:

226

227

```python

228

import os

229

230

# Set proxy environment variables

231

os.environ['HTTP_PROXY'] = 'http://proxy.company.com:8080'

232

os.environ['HTTPS_PROXY'] = 'https://proxy.company.com:8080'

233

234

# Download will now use proxy

235

from pytorch_transformers import BertModel

236

model = BertModel.from_pretrained("bert-base-uncased")

237

```

238

239

### Timeout Configuration

240

241

```python

242

import os

243

244

# Set download timeout (in seconds)

245

os.environ['HF_HUB_DOWNLOAD_TIMEOUT'] = '300' # 5 minutes

246

247

# For very slow connections

248

os.environ['HF_HUB_DOWNLOAD_TIMEOUT'] = '1800' # 30 minutes

249

```

250

251

### Offline Mode

252

253

When working in environments without internet access:

254

255

```python

256

import os

257

258

# Enable offline mode - only use cached files

259

os.environ['HF_DATASETS_OFFLINE'] = '1'

260

os.environ['TRANSFORMERS_OFFLINE'] = '1'

261

262

try:

263

# This will only work if files are already cached

264

model = BertModel.from_pretrained("bert-base-uncased")

265

except OSError as e:

266

print(f"Model not in cache: {e}")

267

```

268

269

## Error Handling

270

271

The file utilities provide informative error messages for common issues:

272

273

```python

274

from pytorch_transformers import cached_path

275

276

try:

277

# Invalid URL

278

path = cached_path("https://invalid-url.com/model.bin")

279

except EnvironmentError as e:

280

print(f"Download failed: {e}")

281

282

try:

283

# Local file doesn't exist

284

path = cached_path("./nonexistent_model.bin")

285

except EnvironmentError as e:

286

print(f"File not found: {e}")

287

288

try:

289

# Network issues

290

path = cached_path("https://valid-url.com/model.bin")

291

except EnvironmentError as e:

292

print(f"Network error: {e}")

293

```

294

295

## Integration with Model Loading

296

297

The file utilities are automatically used by all `from_pretrained()` methods:

298

299

```python

300

# These all use cached_path internally

301

from pytorch_transformers import (

302

AutoModel, AutoTokenizer, AutoConfig,

303

BertModel, BertTokenizer, BertConfig

304

)

305

306

# Download and cache if needed

307

model = AutoModel.from_pretrained("bert-base-uncased")

308

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

309

config = AutoConfig.from_pretrained("bert-base-uncased")

310

311

# Custom cache directory for specific models

312

model = BertModel.from_pretrained(

313

"bert-base-uncased",

314

cache_dir="./my_bert_cache"

315

)

316

```