0
# Utilities
1
2
FastText provides various utility functions for model optimization, text processing, pre-trained model management, and advanced model manipulation. These utilities enhance the core functionality with performance optimizations and convenience features.
3
4
## Capabilities
5
6
### Model Optimization
7
8
Optimize model size and performance through quantization and matrix manipulation.
9
10
```python { .api }
11
def quantize(input=None, qout=False, cutoff=0, retrain=False, epoch=None,
12
lr=None, thread=None, verbose=None, dsub=2, qnorm=False):
13
"""
14
Quantize model to reduce memory usage and file size.
15
16
Args:
17
input (str, optional): Path to training data for retraining
18
qout (bool): Quantize output matrix (default: False)
19
cutoff (int): Vocabulary cutoff for quantization (default: 0)
20
retrain (bool): Retrain model after quantization (default: False)
21
epoch (int, optional): Number of retraining epochs
22
lr (float, optional): Learning rate for retraining
23
thread (int, optional): Number of threads for retraining
24
verbose (int, optional): Verbosity level
25
dsub (int): Dimension of subspace for quantization (default: 2)
26
qnorm (bool): Quantize normalization (default: False)
27
28
Note:
29
Quantization reduces model accuracy but significantly decreases size.
30
Some operations (get_input_matrix, get_output_matrix) become unavailable.
31
"""
32
33
def set_matrices(input_matrix, output_matrix):
34
"""
35
Set custom input and output matrices.
36
37
Args:
38
input_matrix (numpy.ndarray): Custom input matrix of shape (vocab_size, dim), float32 type
39
output_matrix (numpy.ndarray): Custom output matrix of shape (vocab_size, dim), float32 type
40
41
Raises:
42
ValueError: If model is quantized or matrix dimensions don't match
43
44
Note:
45
Matrices are automatically converted to float32 type. Use with caution as this
46
replaces the learned representations with custom values.
47
"""
48
```
49
50
#### Usage Example
51
52
```python
53
import fasttext
54
import numpy as np
55
56
# Load and quantize model
57
model = fasttext.load_model('large_model.bin')
58
print(f"Original model size: {model.get_dimension()} dimensions")
59
60
# Basic quantization
61
model.quantize()
62
print(f"Model quantized: {model.is_quantized()}")
63
64
# Advanced quantization with retraining
65
model = fasttext.load_model('model.bin') # Reload original
66
model.quantize(
67
input='train.txt', # Retrain after quantization
68
qout=True, # Quantize output matrix
69
retrain=True, # Enable retraining
70
epoch=5, # Retraining epochs
71
lr=0.01, # Lower learning rate
72
dsub=2, # Subspace dimension
73
verbose=2 # Show progress
74
)
75
76
# Save quantized model (much smaller file)
77
model.save_model('quantized_model.ftz')
78
79
# Custom matrix manipulation (before quantization)
80
model = fasttext.load_model('model.bin')
81
if not model.is_quantized():
82
input_matrix = model.get_input_matrix()
83
output_matrix = model.get_output_matrix()
84
85
# Apply custom transformations
86
scaled_input = input_matrix * 0.8
87
normalized_output = output_matrix / np.linalg.norm(output_matrix, axis=1, keepdims=True)
88
89
# Set modified matrices
90
model.set_matrices(scaled_input, normalized_output)
91
```
92
93
### Model Persistence
94
95
Save and manage model files with different formats and compression levels.
96
97
```python { .api }
98
def save_model(path):
99
"""
100
Save model to file.
101
102
Args:
103
path (str): Output file path (.bin for uncompressed, .ftz for compressed)
104
105
Note:
106
.bin format preserves full precision and all functionality
107
.ftz format is compressed but may lose some precision
108
"""
109
```
110
111
#### Usage Example
112
113
```python
114
import fasttext
115
import os
116
117
# Train and save model
118
model = fasttext.train_unsupervised('data.txt')
119
120
# Save in different formats
121
model.save_model('model.bin') # Full precision binary
122
model.save_model('model.ftz') # Compressed format
123
124
# Check file sizes
125
bin_size = os.path.getsize('model.bin')
126
ftz_size = os.path.getsize('model.ftz')
127
compression_ratio = bin_size / ftz_size
128
129
print(f"Binary model: {bin_size / 1024 / 1024:.1f} MB")
130
print(f"Compressed model: {ftz_size / 1024 / 1024:.1f} MB")
131
print(f"Compression ratio: {compression_ratio:.1f}x")
132
133
# Save after quantization for maximum compression
134
model.quantize()
135
model.save_model('quantized_model.ftz')
136
quantized_size = os.path.getsize('quantized_model.ftz')
137
print(f"Quantized model: {quantized_size / 1024 / 1024:.1f} MB")
138
```
139
140
### Pre-trained Model Management
141
142
Download and manage pre-trained FastText models for multiple languages.
143
144
```python { .api }
145
# Import utility module
146
import fasttext.util
147
148
def download_model(lang_id, if_exists='strict'):
149
"""
150
Download pre-trained FastText model for specified language.
151
152
Args:
153
lang_id (str): Language identifier (e.g., 'en', 'fr', 'de')
154
if_exists (str): Action if model exists - 'strict', 'ignore', 'overwrite'
155
156
Returns:
157
str: Path to downloaded model file (cc.{lang_id}.300.bin)
158
159
Raises:
160
Exception: If language ID is not supported
161
162
Note:
163
Always downloads 300-dimensional models from Common Crawl vectors
164
"""
165
166
# Set of valid language IDs (157 languages supported)
167
valid_lang_ids = {"af", "sq", "als", "am", "ar", "an", "hy", "as", "ast",
168
"az", "ba", "eu", "bar", "be", "bn", "bh", "bpy", "bs",
169
"br", "bg", "my", "ca", "ceb", "bcl", "ce", "zh", "cv",
170
"co", "hr", "cs", "da", "dv", "nl", "pa", "arz", "eml",
171
"en", "myv", "eo", "et", "hif", "fi", "fr", "gl", "ka",
172
"de", "gom", "el", "gu", "ht", "he", "mrj", "hi", "hu",
173
"is", "io", "ilo", "id", "ia", "ga", "it", "ja", "jv",
174
"kn", "pam", "kk", "km", "ky", "ko", "ku", "ckb", "la",
175
"lv", "li", "lt", "lmo", "nds", "lb", "mk", "mai", "mg",
176
"ms", "ml", "mt", "gv", "mr", "mzn", "mhr", "min", "xmf",
177
"mwl", "mn", "nah", "nap", "ne", "new", "frr", "nso",
178
"no", "nn", "oc", "or", "os", "pfl", "ps", "fa", "pms",
179
"pl", "pt", "qu", "ro", "rm", "ru", "sah", "sa", "sc",
180
"sco", "gd", "sr", "sh", "scn", "sd", "si", "sk", "sl",
181
"so", "azb", "es", "su", "sw", "sv", "tl", "tg", "ta",
182
"tt", "te", "th", "bo", "tr", "tk", "uk", "hsb", "ur",
183
"ug", "uz", "vec", "vi", "vo", "wa", "war", "cy", "vls",
184
"fy", "pnb", "yi", "yo", "diq", "zea"}
185
```
186
187
#### Usage Example
188
189
```python
190
import fasttext.util
191
192
# Download English model
193
model_path = fasttext.util.download_model('en', if_exists='ignore')
194
model = fasttext.load_model(model_path)
195
196
# Download specific dimension
197
fasttext.util.download_model('fr', dimension=100)
198
fr_model = fasttext.load_model('cc.fr.100.bin')
199
200
# Check available languages
201
print(f"Available languages: {len(fasttext.util.valid_lang_ids)}")
202
print(f"Sample languages: {list(fasttext.util.valid_lang_ids)[:10]}")
203
204
# Download multiple models
205
languages = ['en', 'es', 'fr', 'de', 'it']
206
models = {}
207
208
for lang in languages:
209
try:
210
path = fasttext.util.download_model(lang, if_exists='ignore')
211
models[lang] = fasttext.load_model(path)
212
print(f"Loaded {lang} model: {models[lang].get_dimension()} dimensions")
213
except ValueError as e:
214
print(f"Failed to download {lang}: {e}")
215
216
# Use multilingual models
217
text_samples = {
218
'en': 'Hello world',
219
'es': 'Hola mundo',
220
'fr': 'Bonjour monde',
221
'de': 'Hallo Welt'
222
}
223
224
for lang, text in text_samples.items():
225
if lang in models:
226
vector = models[lang].get_sentence_vector(text)
227
print(f"{lang}: '{text}' -> vector shape {vector.shape}")
228
```
229
230
### Model Dimension Reduction
231
232
Reduce model dimensions using Principal Component Analysis for memory efficiency.
233
234
```python { .api }
235
def reduce_model(ft_model, target_dim):
236
"""
237
Reduce model dimensions using PCA.
238
239
Args:
240
ft_model: FastText model object
241
target_dim (int): Target dimension size (must be < current dimension)
242
243
Returns:
244
_FastText: New model with reduced dimensions
245
246
Note:
247
Dimension reduction may impact model quality but reduces memory usage
248
"""
249
```
250
251
#### Usage Example
252
253
```python
254
import fasttext
255
import fasttext.util
256
257
# Load high-dimensional model
258
model = fasttext.load_model('cc.en.300.bin')
259
print(f"Original dimensions: {model.get_dimension()}")
260
261
# Reduce dimensions
262
reduced_model = fasttext.util.reduce_model(model, 100)
263
print(f"Reduced dimensions: {reduced_model.get_dimension()}")
264
265
# Compare performance
266
original_neighbors = model.get_nearest_neighbors('king', k=5)
267
reduced_neighbors = reduced_model.get_nearest_neighbors('king', k=5)
268
269
print("Original model neighbors:")
270
for score, word in original_neighbors:
271
print(f" {word}: {score:.4f}")
272
273
print("Reduced model neighbors:")
274
for score, word in reduced_neighbors:
275
print(f" {word}: {score:.4f}")
276
277
# Save reduced model
278
reduced_model.save_model('cc.en.100.reduced.bin')
279
```
280
281
### Evaluation Utilities
282
283
Utility functions for model evaluation and metric calculation.
284
285
```python { .api }
286
def test(predictions, labels, k=1):
287
"""
288
Calculate precision and recall from predictions and true labels.
289
290
Args:
291
predictions (list): List of prediction tuples (labels, probabilities)
292
labels (list): List of true label lists for each sample
293
k (int): Number of top predictions to consider (default: 1)
294
295
Returns:
296
tuple: (precision, recall) at k
297
"""
298
299
def find_nearest_neighbor(query, vectors, ban_set, cossims=None):
300
"""
301
Find nearest vector to query, excluding banned items.
302
303
Args:
304
query (numpy.ndarray): Query vector
305
vectors (numpy.ndarray): Matrix of candidate vectors
306
ban_set (set): Set of indices to exclude from search
307
cossims (numpy.ndarray, optional): Pre-computed cosine similarities
308
309
Returns:
310
int: Index of nearest neighbor
311
"""
312
```
313
314
#### Usage Example
315
316
```python
317
import fasttext
318
import fasttext.util
319
import numpy as np
320
321
# Evaluate custom predictions
322
model = fasttext.load_model('classifier.bin')
323
324
# Generate predictions
325
test_texts = [
326
"Great movie, loved it!",
327
"Terrible film.",
328
"It was okay."
329
]
330
331
predictions = []
332
true_labels = [
333
['__label__positive'],
334
['__label__negative'],
335
['__label__neutral']
336
]
337
338
for text in test_texts:
339
pred_labels, pred_probs = model.predict(text, k=3)
340
predictions.append((pred_labels, pred_probs))
341
342
# Calculate metrics
343
precision, recall = fasttext.util.test(predictions, true_labels, k=1)
344
print(f"Custom evaluation - Precision: {precision:.4f}, Recall: {recall:.4f}")
345
346
# Find nearest neighbors with exclusions
347
word_vectors = model.get_input_matrix()
348
query_word = 'king'
349
query_vector = model.get_word_vector(query_word)
350
query_id = model.get_word_id(query_word)
351
352
# Exclude the query word itself and some others
353
ban_set = {query_id, model.get_word_id('the'), model.get_word_id('a')}
354
355
nearest_idx = fasttext.util.find_nearest_neighbor(
356
query_vector,
357
word_vectors,
358
ban_set
359
)
360
361
# Convert index back to word
362
vocab = model.get_words()
363
if nearest_idx < len(vocab):
364
nearest_word = vocab[nearest_idx]
365
print(f"Nearest neighbor to '{query_word}': {nearest_word}")
366
```
367
368
### Text Processing
369
370
Additional text processing utilities for consistency and preprocessing.
371
372
```python { .api }
373
def tokenize(text):
374
"""
375
Tokenize text using FastText's internal tokenizer.
376
377
Args:
378
text (str): Input text to tokenize
379
380
Returns:
381
list: List of tokens following FastText's tokenization rules
382
383
Note:
384
This ensures consistency with training data preprocessing
385
"""
386
```
387
388
#### Usage Example
389
390
```python
391
import fasttext
392
393
# Consistent tokenization
394
texts = [
395
"Hello, world! How are you?",
396
"E-mail: user@domain.com (important)",
397
"123.45 is a number, isn't it?",
398
"Visit https://example.com for more."
399
]
400
401
for text in texts:
402
tokens = fasttext.tokenize(text)
403
print(f"'{text}'")
404
print(f" Tokens: {tokens}")
405
print(f" Count: {len(tokens)}")
406
print()
407
408
# Compare with model preprocessing
409
model = fasttext.load_model('model.bin')
410
sample_text = "This is a test sentence."
411
412
# Method 1: Direct tokenization
413
tokens1 = fasttext.tokenize(sample_text)
414
415
# Method 2: Model preprocessing
416
words, labels = model.get_line(sample_text)
417
418
print(f"Direct tokenization: {tokens1}")
419
print(f"Model preprocessing: {words}")
420
print(f"Are they equal? {tokens1 == words}")
421
```
422
423
## Performance Optimization Tips
424
425
### Memory Usage
426
427
- **Quantization**: Use `quantize()` to reduce model size by 75-90%
428
- **Dimension Reduction**: Use `reduce_model()` for further memory savings
429
- **Model Format**: Use `.ftz` format for compressed storage
430
431
### Speed Optimization
432
433
- **Threading**: Set appropriate `thread` parameter during training
434
- **Batch Processing**: Process multiple texts together when possible
435
- **Caching**: Cache frequently accessed vectors and model properties
436
437
### Storage Management
438
439
- **Model Formats**:
440
- `.bin`: Full precision, all features available
441
- `.ftz`: Compressed, may lose some precision
442
- Quantized `.ftz`: Maximum compression, limited functionality
443
444
- **Pre-trained Models**: Download once and reuse across projects
445
- **Temporary Files**: Clean up downloaded models when no longer needed