or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

classification.mdindex.mdtraining.mdutilities.mdword-vectors.md

utilities.mddocs/

0

# Utilities

1

2

FastText provides various utility functions for model optimization, text processing, pre-trained model management, and advanced model manipulation. These utilities enhance the core functionality with performance optimizations and convenience features.

3

4

## Capabilities

5

6

### Model Optimization

7

8

Optimize model size and performance through quantization and matrix manipulation.

9

10

```python { .api }

11

def quantize(input=None, qout=False, cutoff=0, retrain=False, epoch=None,

12

lr=None, thread=None, verbose=None, dsub=2, qnorm=False):

13

"""

14

Quantize model to reduce memory usage and file size.

15

16

Args:

17

input (str, optional): Path to training data for retraining

18

qout (bool): Quantize output matrix (default: False)

19

cutoff (int): Vocabulary cutoff for quantization (default: 0)

20

retrain (bool): Retrain model after quantization (default: False)

21

epoch (int, optional): Number of retraining epochs

22

lr (float, optional): Learning rate for retraining

23

thread (int, optional): Number of threads for retraining

24

verbose (int, optional): Verbosity level

25

dsub (int): Dimension of subspace for quantization (default: 2)

26

qnorm (bool): Quantize normalization (default: False)

27

28

Note:

29

Quantization reduces model accuracy but significantly decreases size.

30

Some operations (get_input_matrix, get_output_matrix) become unavailable.

31

"""

32

33

def set_matrices(input_matrix, output_matrix):

34

"""

35

Set custom input and output matrices.

36

37

Args:

38

input_matrix (numpy.ndarray): Custom input matrix of shape (vocab_size, dim), float32 type

39

output_matrix (numpy.ndarray): Custom output matrix of shape (vocab_size, dim), float32 type

40

41

Raises:

42

ValueError: If model is quantized or matrix dimensions don't match

43

44

Note:

45

Matrices are automatically converted to float32 type. Use with caution as this

46

replaces the learned representations with custom values.

47

"""

48

```

49

50

#### Usage Example

51

52

```python

53

import fasttext

54

import numpy as np

55

56

# Load and quantize model

57

model = fasttext.load_model('large_model.bin')

58

print(f"Original model size: {model.get_dimension()} dimensions")

59

60

# Basic quantization

61

model.quantize()

62

print(f"Model quantized: {model.is_quantized()}")

63

64

# Advanced quantization with retraining

65

model = fasttext.load_model('model.bin') # Reload original

66

model.quantize(

67

input='train.txt', # Retrain after quantization

68

qout=True, # Quantize output matrix

69

retrain=True, # Enable retraining

70

epoch=5, # Retraining epochs

71

lr=0.01, # Lower learning rate

72

dsub=2, # Subspace dimension

73

verbose=2 # Show progress

74

)

75

76

# Save quantized model (much smaller file)

77

model.save_model('quantized_model.ftz')

78

79

# Custom matrix manipulation (before quantization)

80

model = fasttext.load_model('model.bin')

81

if not model.is_quantized():

82

input_matrix = model.get_input_matrix()

83

output_matrix = model.get_output_matrix()

84

85

# Apply custom transformations

86

scaled_input = input_matrix * 0.8

87

normalized_output = output_matrix / np.linalg.norm(output_matrix, axis=1, keepdims=True)

88

89

# Set modified matrices

90

model.set_matrices(scaled_input, normalized_output)

91

```

92

93

### Model Persistence

94

95

Save and manage model files with different formats and compression levels.

96

97

```python { .api }

98

def save_model(path):

99

"""

100

Save model to file.

101

102

Args:

103

path (str): Output file path (.bin for uncompressed, .ftz for compressed)

104

105

Note:

106

.bin format preserves full precision and all functionality

107

.ftz format is compressed but may lose some precision

108

"""

109

```

110

111

#### Usage Example

112

113

```python

114

import fasttext

115

import os

116

117

# Train and save model

118

model = fasttext.train_unsupervised('data.txt')

119

120

# Save in different formats

121

model.save_model('model.bin') # Full precision binary

122

model.save_model('model.ftz') # Compressed format

123

124

# Check file sizes

125

bin_size = os.path.getsize('model.bin')

126

ftz_size = os.path.getsize('model.ftz')

127

compression_ratio = bin_size / ftz_size

128

129

print(f"Binary model: {bin_size / 1024 / 1024:.1f} MB")

130

print(f"Compressed model: {ftz_size / 1024 / 1024:.1f} MB")

131

print(f"Compression ratio: {compression_ratio:.1f}x")

132

133

# Save after quantization for maximum compression

134

model.quantize()

135

model.save_model('quantized_model.ftz')

136

quantized_size = os.path.getsize('quantized_model.ftz')

137

print(f"Quantized model: {quantized_size / 1024 / 1024:.1f} MB")

138

```

139

140

### Pre-trained Model Management

141

142

Download and manage pre-trained FastText models for multiple languages.

143

144

```python { .api }

145

# Import utility module

146

import fasttext.util

147

148

def download_model(lang_id, if_exists='strict'):

149

"""

150

Download pre-trained FastText model for specified language.

151

152

Args:

153

lang_id (str): Language identifier (e.g., 'en', 'fr', 'de')

154

if_exists (str): Action if model exists - 'strict', 'ignore', 'overwrite'

155

156

Returns:

157

str: Path to downloaded model file (cc.{lang_id}.300.bin)

158

159

Raises:

160

Exception: If language ID is not supported

161

162

Note:

163

Always downloads 300-dimensional models from Common Crawl vectors

164

"""

165

166

# Set of valid language IDs (157 languages supported)

167

valid_lang_ids = {"af", "sq", "als", "am", "ar", "an", "hy", "as", "ast",

168

"az", "ba", "eu", "bar", "be", "bn", "bh", "bpy", "bs",

169

"br", "bg", "my", "ca", "ceb", "bcl", "ce", "zh", "cv",

170

"co", "hr", "cs", "da", "dv", "nl", "pa", "arz", "eml",

171

"en", "myv", "eo", "et", "hif", "fi", "fr", "gl", "ka",

172

"de", "gom", "el", "gu", "ht", "he", "mrj", "hi", "hu",

173

"is", "io", "ilo", "id", "ia", "ga", "it", "ja", "jv",

174

"kn", "pam", "kk", "km", "ky", "ko", "ku", "ckb", "la",

175

"lv", "li", "lt", "lmo", "nds", "lb", "mk", "mai", "mg",

176

"ms", "ml", "mt", "gv", "mr", "mzn", "mhr", "min", "xmf",

177

"mwl", "mn", "nah", "nap", "ne", "new", "frr", "nso",

178

"no", "nn", "oc", "or", "os", "pfl", "ps", "fa", "pms",

179

"pl", "pt", "qu", "ro", "rm", "ru", "sah", "sa", "sc",

180

"sco", "gd", "sr", "sh", "scn", "sd", "si", "sk", "sl",

181

"so", "azb", "es", "su", "sw", "sv", "tl", "tg", "ta",

182

"tt", "te", "th", "bo", "tr", "tk", "uk", "hsb", "ur",

183

"ug", "uz", "vec", "vi", "vo", "wa", "war", "cy", "vls",

184

"fy", "pnb", "yi", "yo", "diq", "zea"}

185

```

186

187

#### Usage Example

188

189

```python

190

import fasttext.util

191

192

# Download English model

193

model_path = fasttext.util.download_model('en', if_exists='ignore')

194

model = fasttext.load_model(model_path)

195

196

# Download specific dimension

197

fasttext.util.download_model('fr', dimension=100)

198

fr_model = fasttext.load_model('cc.fr.100.bin')

199

200

# Check available languages

201

print(f"Available languages: {len(fasttext.util.valid_lang_ids)}")

202

print(f"Sample languages: {list(fasttext.util.valid_lang_ids)[:10]}")

203

204

# Download multiple models

205

languages = ['en', 'es', 'fr', 'de', 'it']

206

models = {}

207

208

for lang in languages:

209

try:

210

path = fasttext.util.download_model(lang, if_exists='ignore')

211

models[lang] = fasttext.load_model(path)

212

print(f"Loaded {lang} model: {models[lang].get_dimension()} dimensions")

213

except ValueError as e:

214

print(f"Failed to download {lang}: {e}")

215

216

# Use multilingual models

217

text_samples = {

218

'en': 'Hello world',

219

'es': 'Hola mundo',

220

'fr': 'Bonjour monde',

221

'de': 'Hallo Welt'

222

}

223

224

for lang, text in text_samples.items():

225

if lang in models:

226

vector = models[lang].get_sentence_vector(text)

227

print(f"{lang}: '{text}' -> vector shape {vector.shape}")

228

```

229

230

### Model Dimension Reduction

231

232

Reduce model dimensions using Principal Component Analysis for memory efficiency.

233

234

```python { .api }

235

def reduce_model(ft_model, target_dim):

236

"""

237

Reduce model dimensions using PCA.

238

239

Args:

240

ft_model: FastText model object

241

target_dim (int): Target dimension size (must be < current dimension)

242

243

Returns:

244

_FastText: New model with reduced dimensions

245

246

Note:

247

Dimension reduction may impact model quality but reduces memory usage

248

"""

249

```

250

251

#### Usage Example

252

253

```python

254

import fasttext

255

import fasttext.util

256

257

# Load high-dimensional model

258

model = fasttext.load_model('cc.en.300.bin')

259

print(f"Original dimensions: {model.get_dimension()}")

260

261

# Reduce dimensions

262

reduced_model = fasttext.util.reduce_model(model, 100)

263

print(f"Reduced dimensions: {reduced_model.get_dimension()}")

264

265

# Compare performance

266

original_neighbors = model.get_nearest_neighbors('king', k=5)

267

reduced_neighbors = reduced_model.get_nearest_neighbors('king', k=5)

268

269

print("Original model neighbors:")

270

for score, word in original_neighbors:

271

print(f" {word}: {score:.4f}")

272

273

print("Reduced model neighbors:")

274

for score, word in reduced_neighbors:

275

print(f" {word}: {score:.4f}")

276

277

# Save reduced model

278

reduced_model.save_model('cc.en.100.reduced.bin')

279

```

280

281

### Evaluation Utilities

282

283

Utility functions for model evaluation and metric calculation.

284

285

```python { .api }

286

def test(predictions, labels, k=1):

287

"""

288

Calculate precision and recall from predictions and true labels.

289

290

Args:

291

predictions (list): List of prediction tuples (labels, probabilities)

292

labels (list): List of true label lists for each sample

293

k (int): Number of top predictions to consider (default: 1)

294

295

Returns:

296

tuple: (precision, recall) at k

297

"""

298

299

def find_nearest_neighbor(query, vectors, ban_set, cossims=None):

300

"""

301

Find nearest vector to query, excluding banned items.

302

303

Args:

304

query (numpy.ndarray): Query vector

305

vectors (numpy.ndarray): Matrix of candidate vectors

306

ban_set (set): Set of indices to exclude from search

307

cossims (numpy.ndarray, optional): Pre-computed cosine similarities

308

309

Returns:

310

int: Index of nearest neighbor

311

"""

312

```

313

314

#### Usage Example

315

316

```python

317

import fasttext

318

import fasttext.util

319

import numpy as np

320

321

# Evaluate custom predictions

322

model = fasttext.load_model('classifier.bin')

323

324

# Generate predictions

325

test_texts = [

326

"Great movie, loved it!",

327

"Terrible film.",

328

"It was okay."

329

]

330

331

predictions = []

332

true_labels = [

333

['__label__positive'],

334

['__label__negative'],

335

['__label__neutral']

336

]

337

338

for text in test_texts:

339

pred_labels, pred_probs = model.predict(text, k=3)

340

predictions.append((pred_labels, pred_probs))

341

342

# Calculate metrics

343

precision, recall = fasttext.util.test(predictions, true_labels, k=1)

344

print(f"Custom evaluation - Precision: {precision:.4f}, Recall: {recall:.4f}")

345

346

# Find nearest neighbors with exclusions

347

word_vectors = model.get_input_matrix()

348

query_word = 'king'

349

query_vector = model.get_word_vector(query_word)

350

query_id = model.get_word_id(query_word)

351

352

# Exclude the query word itself and some others

353

ban_set = {query_id, model.get_word_id('the'), model.get_word_id('a')}

354

355

nearest_idx = fasttext.util.find_nearest_neighbor(

356

query_vector,

357

word_vectors,

358

ban_set

359

)

360

361

# Convert index back to word

362

vocab = model.get_words()

363

if nearest_idx < len(vocab):

364

nearest_word = vocab[nearest_idx]

365

print(f"Nearest neighbor to '{query_word}': {nearest_word}")

366

```

367

368

### Text Processing

369

370

Additional text processing utilities for consistency and preprocessing.

371

372

```python { .api }

373

def tokenize(text):

374

"""

375

Tokenize text using FastText's internal tokenizer.

376

377

Args:

378

text (str): Input text to tokenize

379

380

Returns:

381

list: List of tokens following FastText's tokenization rules

382

383

Note:

384

This ensures consistency with training data preprocessing

385

"""

386

```

387

388

#### Usage Example

389

390

```python

391

import fasttext

392

393

# Consistent tokenization

394

texts = [

395

"Hello, world! How are you?",

396

"E-mail: user@domain.com (important)",

397

"123.45 is a number, isn't it?",

398

"Visit https://example.com for more."

399

]

400

401

for text in texts:

402

tokens = fasttext.tokenize(text)

403

print(f"'{text}'")

404

print(f" Tokens: {tokens}")

405

print(f" Count: {len(tokens)}")

406

print()

407

408

# Compare with model preprocessing

409

model = fasttext.load_model('model.bin')

410

sample_text = "This is a test sentence."

411

412

# Method 1: Direct tokenization

413

tokens1 = fasttext.tokenize(sample_text)

414

415

# Method 2: Model preprocessing

416

words, labels = model.get_line(sample_text)

417

418

print(f"Direct tokenization: {tokens1}")

419

print(f"Model preprocessing: {words}")

420

print(f"Are they equal? {tokens1 == words}")

421

```

422

423

## Performance Optimization Tips

424

425

### Memory Usage

426

427

- **Quantization**: Use `quantize()` to reduce model size by 75-90%

428

- **Dimension Reduction**: Use `reduce_model()` for further memory savings

429

- **Model Format**: Use `.ftz` format for compressed storage

430

431

### Speed Optimization

432

433

- **Threading**: Set appropriate `thread` parameter during training

434

- **Batch Processing**: Process multiple texts together when possible

435

- **Caching**: Cache frequently accessed vectors and model properties

436

437

### Storage Management

438

439

- **Model Formats**:

440

- `.bin`: Full precision, all features available

441

- `.ftz`: Compressed, may lose some precision

442

- Quantized `.ftz`: Maximum compression, limited functionality

443

444

- **Pre-trained Models**: Download once and reuse across projects

445

- **Temporary Files**: Clean up downloaded models when no longer needed