Tessl Tile for pypi/fasttext@0.9.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

classification.md index.md training.md utilities.md word-vectors.md

utilities.mddocs/

0
# Utilities
1

2
FastText provides various utility functions for model optimization, text processing, pre-trained model management, and advanced model manipulation. These utilities enhance the core functionality with performance optimizations and convenience features.
3

4
## Capabilities
5

6
### Model Optimization
7

8
Optimize model size and performance through quantization and matrix manipulation.
9

10
```python { .api }
11
def quantize(input=None, qout=False, cutoff=0, retrain=False, epoch=None, 
12
            lr=None, thread=None, verbose=None, dsub=2, qnorm=False):
13
    """
14
    Quantize model to reduce memory usage and file size.
15
    
16
    Args:
17
        input (str, optional): Path to training data for retraining
18
        qout (bool): Quantize output matrix (default: False)
19
        cutoff (int): Vocabulary cutoff for quantization (default: 0)
20
        retrain (bool): Retrain model after quantization (default: False)
21
        epoch (int, optional): Number of retraining epochs
22
        lr (float, optional): Learning rate for retraining
23
        thread (int, optional): Number of threads for retraining
24
        verbose (int, optional): Verbosity level
25
        dsub (int): Dimension of subspace for quantization (default: 2)
26
        qnorm (bool): Quantize normalization (default: False)
27
    
28
    Note:
29
        Quantization reduces model accuracy but significantly decreases size.
30
        Some operations (get_input_matrix, get_output_matrix) become unavailable.
31
    """
32

33
def set_matrices(input_matrix, output_matrix):
34
    """
35
    Set custom input and output matrices.
36
    
37
    Args:
38
        input_matrix (numpy.ndarray): Custom input matrix of shape (vocab_size, dim), float32 type
39
        output_matrix (numpy.ndarray): Custom output matrix of shape (vocab_size, dim), float32 type
40
        
41
    Raises:
42
        ValueError: If model is quantized or matrix dimensions don't match
43
        
44
    Note:
45
        Matrices are automatically converted to float32 type. Use with caution as this
46
        replaces the learned representations with custom values.
47
    """
48
```
49

50
#### Usage Example
51

52
```python
53
import fasttext
54
import numpy as np
55

56
# Load and quantize model
57
model = fasttext.load_model('large_model.bin')
58
print(f"Original model size: {model.get_dimension()} dimensions")
59

60
# Basic quantization
61
model.quantize()
62
print(f"Model quantized: {model.is_quantized()}")
63

64
# Advanced quantization with retraining
65
model = fasttext.load_model('model.bin')  # Reload original
66
model.quantize(
67
    input='train.txt',  # Retrain after quantization
68
    qout=True,          # Quantize output matrix
69
    retrain=True,       # Enable retraining
70
    epoch=5,            # Retraining epochs
71
    lr=0.01,            # Lower learning rate
72
    dsub=2,             # Subspace dimension
73
    verbose=2           # Show progress
74
)
75

76
# Save quantized model (much smaller file)
77
model.save_model('quantized_model.ftz')
78

79
# Custom matrix manipulation (before quantization)
80
model = fasttext.load_model('model.bin')
81
if not model.is_quantized():
82
    input_matrix = model.get_input_matrix()
83
    output_matrix = model.get_output_matrix()
84
    
85
    # Apply custom transformations
86
    scaled_input = input_matrix * 0.8
87
    normalized_output = output_matrix / np.linalg.norm(output_matrix, axis=1, keepdims=True)
88
    
89
    # Set modified matrices
90
    model.set_matrices(scaled_input, normalized_output)
91
```
92

93
### Model Persistence
94

95
Save and manage model files with different formats and compression levels.
96

97
```python { .api }
98
def save_model(path):
99
    """
100
    Save model to file.
101
    
102
    Args:
103
        path (str): Output file path (.bin for uncompressed, .ftz for compressed)
104
        
105
    Note:
106
        .bin format preserves full precision and all functionality
107
        .ftz format is compressed but may lose some precision
108
    """
109
```
110

111
#### Usage Example
112

113
```python
114
import fasttext
115
import os
116

117
# Train and save model
118
model = fasttext.train_unsupervised('data.txt')
119

120
# Save in different formats
121
model.save_model('model.bin')        # Full precision binary
122
model.save_model('model.ftz')        # Compressed format
123

124
# Check file sizes
125
bin_size = os.path.getsize('model.bin')
126
ftz_size = os.path.getsize('model.ftz')
127
compression_ratio = bin_size / ftz_size
128

129
print(f"Binary model: {bin_size / 1024 / 1024:.1f} MB")
130
print(f"Compressed model: {ftz_size / 1024 / 1024:.1f} MB")
131
print(f"Compression ratio: {compression_ratio:.1f}x")
132

133
# Save after quantization for maximum compression
134
model.quantize()
135
model.save_model('quantized_model.ftz')
136
quantized_size = os.path.getsize('quantized_model.ftz')
137
print(f"Quantized model: {quantized_size / 1024 / 1024:.1f} MB")
138
```
139

140
### Pre-trained Model Management
141

142
Download and manage pre-trained FastText models for multiple languages.
143

144
```python { .api }
145
# Import utility module
146
import fasttext.util
147

148
def download_model(lang_id, if_exists='strict'):
149
    """
150
    Download pre-trained FastText model for specified language.
151
    
152
    Args:
153
        lang_id (str): Language identifier (e.g., 'en', 'fr', 'de')
154
        if_exists (str): Action if model exists - 'strict', 'ignore', 'overwrite'
155
        
156
    Returns:
157
        str: Path to downloaded model file (cc.{lang_id}.300.bin)
158
        
159
    Raises:
160
        Exception: If language ID is not supported
161
        
162
    Note:
163
        Always downloads 300-dimensional models from Common Crawl vectors
164
    """
165

166
# Set of valid language IDs (157 languages supported)
167
valid_lang_ids = {"af", "sq", "als", "am", "ar", "an", "hy", "as", "ast",
168
                  "az", "ba", "eu", "bar", "be", "bn", "bh", "bpy", "bs",
169
                  "br", "bg", "my", "ca", "ceb", "bcl", "ce", "zh", "cv",
170
                  "co", "hr", "cs", "da", "dv", "nl", "pa", "arz", "eml",
171
                  "en", "myv", "eo", "et", "hif", "fi", "fr", "gl", "ka",
172
                  "de", "gom", "el", "gu", "ht", "he", "mrj", "hi", "hu",
173
                  "is", "io", "ilo", "id", "ia", "ga", "it", "ja", "jv",
174
                  "kn", "pam", "kk", "km", "ky", "ko", "ku", "ckb", "la",
175
                  "lv", "li", "lt", "lmo", "nds", "lb", "mk", "mai", "mg",
176
                  "ms", "ml", "mt", "gv", "mr", "mzn", "mhr", "min", "xmf",
177
                  "mwl", "mn", "nah", "nap", "ne", "new", "frr", "nso",
178
                  "no", "nn", "oc", "or", "os", "pfl", "ps", "fa", "pms",
179
                  "pl", "pt", "qu", "ro", "rm", "ru", "sah", "sa", "sc",
180
                  "sco", "gd", "sr", "sh", "scn", "sd", "si", "sk", "sl",
181
                  "so", "azb", "es", "su", "sw", "sv", "tl", "tg", "ta",
182
                  "tt", "te", "th", "bo", "tr", "tk", "uk", "hsb", "ur",
183
                  "ug", "uz", "vec", "vi", "vo", "wa", "war", "cy", "vls",
184
                  "fy", "pnb", "yi", "yo", "diq", "zea"}
185
```
186

187
#### Usage Example
188

189
```python
190
import fasttext.util
191

192
# Download English model
193
model_path = fasttext.util.download_model('en', if_exists='ignore')
194
model = fasttext.load_model(model_path)
195

196
# Download specific dimension
197
fasttext.util.download_model('fr', dimension=100)
198
fr_model = fasttext.load_model('cc.fr.100.bin')
199

200
# Check available languages
201
print(f"Available languages: {len(fasttext.util.valid_lang_ids)}")
202
print(f"Sample languages: {list(fasttext.util.valid_lang_ids)[:10]}")
203

204
# Download multiple models
205
languages = ['en', 'es', 'fr', 'de', 'it']
206
models = {}
207

208
for lang in languages:
209
    try:
210
        path = fasttext.util.download_model(lang, if_exists='ignore')
211
        models[lang] = fasttext.load_model(path)
212
        print(f"Loaded {lang} model: {models[lang].get_dimension()} dimensions")
213
    except ValueError as e:
214
        print(f"Failed to download {lang}: {e}")
215

216
# Use multilingual models
217
text_samples = {
218
    'en': 'Hello world',
219
    'es': 'Hola mundo', 
220
    'fr': 'Bonjour monde',
221
    'de': 'Hallo Welt'
222
}
223

224
for lang, text in text_samples.items():
225
    if lang in models:
226
        vector = models[lang].get_sentence_vector(text)
227
        print(f"{lang}: '{text}' -> vector shape {vector.shape}")
228
```
229

230
### Model Dimension Reduction
231

232
Reduce model dimensions using Principal Component Analysis for memory efficiency.
233

234
```python { .api }
235
def reduce_model(ft_model, target_dim):
236
    """
237
    Reduce model dimensions using PCA.
238
    
239
    Args:
240
        ft_model: FastText model object
241
        target_dim (int): Target dimension size (must be < current dimension)
242
        
243
    Returns:
244
        _FastText: New model with reduced dimensions
245
        
246
    Note:
247
        Dimension reduction may impact model quality but reduces memory usage
248
    """
249
```
250

251
#### Usage Example
252

253
```python
254
import fasttext
255
import fasttext.util
256

257
# Load high-dimensional model
258
model = fasttext.load_model('cc.en.300.bin')
259
print(f"Original dimensions: {model.get_dimension()}")
260

261
# Reduce dimensions
262
reduced_model = fasttext.util.reduce_model(model, 100)
263
print(f"Reduced dimensions: {reduced_model.get_dimension()}")
264

265
# Compare performance
266
original_neighbors = model.get_nearest_neighbors('king', k=5)
267
reduced_neighbors = reduced_model.get_nearest_neighbors('king', k=5)
268

269
print("Original model neighbors:")
270
for score, word in original_neighbors:
271
    print(f"  {word}: {score:.4f}")
272

273
print("Reduced model neighbors:")
274
for score, word in reduced_neighbors:
275
    print(f"  {word}: {score:.4f}")
276

277
# Save reduced model
278
reduced_model.save_model('cc.en.100.reduced.bin')
279
```
280

281
### Evaluation Utilities
282

283
Utility functions for model evaluation and metric calculation.
284

285
```python { .api }
286
def test(predictions, labels, k=1):
287
    """
288
    Calculate precision and recall from predictions and true labels.
289
    
290
    Args:
291
        predictions (list): List of prediction tuples (labels, probabilities)
292
        labels (list): List of true label lists for each sample
293
        k (int): Number of top predictions to consider (default: 1)
294
        
295
    Returns:
296
        tuple: (precision, recall) at k
297
    """
298

299
def find_nearest_neighbor(query, vectors, ban_set, cossims=None):
300
    """
301
    Find nearest vector to query, excluding banned items.
302
    
303
    Args:
304
        query (numpy.ndarray): Query vector
305
        vectors (numpy.ndarray): Matrix of candidate vectors
306
        ban_set (set): Set of indices to exclude from search
307
        cossims (numpy.ndarray, optional): Pre-computed cosine similarities
308
        
309
    Returns:
310
        int: Index of nearest neighbor
311
    """
312
```
313

314
#### Usage Example
315

316
```python
317
import fasttext
318
import fasttext.util
319
import numpy as np
320

321
# Evaluate custom predictions
322
model = fasttext.load_model('classifier.bin')
323

324
# Generate predictions
325
test_texts = [
326
    "Great movie, loved it!",
327
    "Terrible film.",
328
    "It was okay."
329
]
330

331
predictions = []
332
true_labels = [
333
    ['__label__positive'],
334
    ['__label__negative'], 
335
    ['__label__neutral']
336
]
337

338
for text in test_texts:
339
    pred_labels, pred_probs = model.predict(text, k=3)
340
    predictions.append((pred_labels, pred_probs))
341

342
# Calculate metrics
343
precision, recall = fasttext.util.test(predictions, true_labels, k=1)
344
print(f"Custom evaluation - Precision: {precision:.4f}, Recall: {recall:.4f}")
345

346
# Find nearest neighbors with exclusions
347
word_vectors = model.get_input_matrix()
348
query_word = 'king'
349
query_vector = model.get_word_vector(query_word)
350
query_id = model.get_word_id(query_word)
351

352
# Exclude the query word itself and some others
353
ban_set = {query_id, model.get_word_id('the'), model.get_word_id('a')}
354

355
nearest_idx = fasttext.util.find_nearest_neighbor(
356
    query_vector, 
357
    word_vectors, 
358
    ban_set
359
)
360

361
# Convert index back to word
362
vocab = model.get_words()
363
if nearest_idx < len(vocab):
364
    nearest_word = vocab[nearest_idx]
365
    print(f"Nearest neighbor to '{query_word}': {nearest_word}")
366
```
367

368
### Text Processing
369

370
Additional text processing utilities for consistency and preprocessing.
371

372
```python { .api }
373
def tokenize(text):
374
    """
375
    Tokenize text using FastText's internal tokenizer.
376
    
377
    Args:
378
        text (str): Input text to tokenize
379
        
380
    Returns:
381
        list: List of tokens following FastText's tokenization rules
382
        
383
    Note:
384
        This ensures consistency with training data preprocessing
385
    """
386
```
387

388
#### Usage Example
389

390
```python
391
import fasttext
392

393
# Consistent tokenization
394
texts = [
395
    "Hello, world! How are you?",
396
    "E-mail: user@domain.com (important)",  
397
    "123.45 is a number, isn't it?",
398
    "Visit https://example.com for more."
399
]
400

401
for text in texts:
402
    tokens = fasttext.tokenize(text)
403
    print(f"'{text}'")
404
    print(f"  Tokens: {tokens}")
405
    print(f"  Count: {len(tokens)}")
406
    print()
407

408
# Compare with model preprocessing
409
model = fasttext.load_model('model.bin')
410
sample_text = "This is a test sentence."
411

412
# Method 1: Direct tokenization
413
tokens1 = fasttext.tokenize(sample_text)
414

415
# Method 2: Model preprocessing
416
words, labels = model.get_line(sample_text)
417

418
print(f"Direct tokenization: {tokens1}")
419
print(f"Model preprocessing: {words}")
420
print(f"Are they equal? {tokens1 == words}")
421
```
422

423
## Performance Optimization Tips
424

425
### Memory Usage
426

427
- **Quantization**: Use `quantize()` to reduce model size by 75-90%
428
- **Dimension Reduction**: Use `reduce_model()` for further memory savings
429
- **Model Format**: Use `.ftz` format for compressed storage
430

431
### Speed Optimization
432

433
- **Threading**: Set appropriate `thread` parameter during training
434
- **Batch Processing**: Process multiple texts together when possible
435
- **Caching**: Cache frequently accessed vectors and model properties
436

437
### Storage Management
438

439
- **Model Formats**: 
440
  - `.bin`: Full precision, all features available
441
  - `.ftz`: Compressed, may lose some precision
442
  - Quantized `.ftz`: Maximum compression, limited functionality
443

444
- **Pre-trained Models**: Download once and reuse across projects
445
- **Temporary Files**: Clean up downloaded models when no longer needed

Version

Tile

Files

utilities.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

utilities.mddocs/