Tessl Tile for pypi/fasttext@0.9.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

classification.md index.md training.md utilities.md word-vectors.md

classification.mddocs/

0
# Text Classification
1

2
FastText provides comprehensive text classification capabilities including prediction, evaluation, and detailed performance metrics. Supports multi-class and multi-label classification with confidence thresholds and top-k predictions.
3

4
## Capabilities
5

6
### Prediction
7

8
Classify text into predefined categories with confidence scores and threshold filtering.
9

10
```python { .api }
11
def predict(text, k=1, threshold=0.0, on_unicode_error='strict'):
12
    """
13
    Predict labels for input text.
14
    
15
    Args:
16
        text (str or list): Input text to classify or list of texts for batch prediction
17
        k (int): Number of top predictions to return (default: 1)
18
        threshold (float): Minimum prediction confidence (default: 0.0)
19
        on_unicode_error (str): Unicode error handling (default: 'strict')
20
        
21
    Returns:
22
        tuple: If text is str, returns (labels, probabilities) where labels is 
23
               a list of predicted labels and probabilities is a numpy array of scores.
24
               If text is list, returns (all_labels, all_probabilities) where each
25
               is a list containing results for each input text.
26
               
27
    Raises:
28
        ValueError: If text contains newline characters or model is not supervised
29
    """
30

31
def get_line(text, on_unicode_error='strict'):
32
    """
33
    Split text into words and labels for internal processing.
34
    
35
    Args:
36
        text (str or list): Input text or list of texts (must not contain newlines)
37
        on_unicode_error (str): Unicode error handling (default: 'strict')
38
        
39
    Returns:
40
        tuple or list: If text is str, returns (words, labels) tuple.
41
                      If text is list, returns list of (words, labels) tuples.
42
                      words is tokenized text, labels is list of any labels found
43
        
44
    Raises:
45
        ValueError: If text contains newline characters
46
        
47
    Note:
48
        Labels must start with the prefix used to create the model (__label__ by default)
49
    """
50
```
51

52
#### Usage Example
53

54
```python
55
import fasttext
56

57
# Load trained classifier
58
model = fasttext.load_model('classifier.bin')
59

60
# Single prediction
61
text = "This movie is absolutely fantastic!"
62
labels, probabilities = model.predict(text)
63
print(f"Predicted: {labels[0]} (confidence: {probabilities[0]:.4f})")
64

65
# Top-k predictions
66
labels, probabilities = model.predict(text, k=3)
67
print("Top 3 predictions:")
68
for label, prob in zip(labels, probabilities):
69
    print(f"  {label}: {prob:.4f}")
70

71
# Predictions with threshold
72
labels, probabilities = model.predict(text, k=5, threshold=0.1)
73
print(f"Predictions above 0.1 confidence: {len(labels)}")
74

75
# Batch predictions
76
texts = [
77
    "Great movie, loved it!",
78
    "Terrible film, waste of time.",
79
    "It was okay, nothing special."
80
]
81

82
for text in texts:
83
    labels, probs = model.predict(text)
84
    print(f"'{text}' -> {labels[0]} ({probs[0]:.3f})")
85

86
# Handle multilabel predictions
87
multilabel_text = "This is a great action comedy movie"
88
labels, probs = model.predict(multilabel_text, k=3, threshold=0.2)
89
print(f"Multiple labels: {labels}")
90
```
91

92
### Model Evaluation
93

94
Evaluate classifier performance on test datasets with precision, recall, and F1-score metrics.
95

96
```python { .api }
97
def test(path, k=1, threshold=0.0):
98
    """
99
    Evaluate model on test data.
100
    
101
    Args:
102
        path (str): Path to test file in training format
103
        k (int): Number of predictions to consider (default: 1)
104
        threshold (float): Minimum prediction confidence (default: 0.0)
105
        
106
    Returns:
107
        tuple: (sample_count, precision, recall) where sample_count is
108
               number of test samples, precision is P@k, recall is R@k
109
    """
110

111
def test_label(path, k=1, threshold=0.0):
112
    """
113
    Get per-label precision and recall scores.
114
    
115
    Args:
116
        path (str): Path to test file in training format
117
        k (int): Number of predictions to consider (default: 1)
118
        threshold (float): Minimum prediction confidence (default: 0.0)
119
        
120
    Returns:
121
        dict: Dictionary mapping label names to dictionaries with 'precision' and 'recall' keys
122
              Example: {'__label__positive': {'precision': 0.7, 'recall': 0.74}}
123
    """
124
```
125

126
#### Usage Example
127

128
```python
129
import fasttext
130

131
model = fasttext.load_model('classifier.bin')
132

133
# Overall evaluation
134
n_samples, precision, recall = model.test('test.txt')
135
f1_score = 2 * (precision * recall) / (precision + recall)
136

137
print(f"Test Results:")
138
print(f"  Samples: {n_samples}")
139
print(f"  Precision@1: {precision:.4f}")
140
print(f"  Recall@1: {recall:.4f}")
141
print(f"  F1-Score: {f1_score:.4f}")
142

143
# Top-k evaluation
144
n_samples, precision_k, recall_k = model.test('test.txt', k=3)
145
print(f"Precision@3: {precision_k:.4f}")
146
print(f"Recall@3: {recall_k:.4f}")
147

148
# Per-label evaluation
149
label_scores = model.test_label('test.txt')
150
print("Per-label scores:")
151
for label, (precision, recall) in label_scores.items():
152
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
153
    print(f"  {label}: P={precision:.3f}, R={recall:.3f}, F1={f1:.3f}")
154

155
# Evaluation with threshold
156
n_samples, precision_t, recall_t = model.test('test.txt', k=1, threshold=0.5)
157
print(f"With threshold 0.5 - P@1: {precision_t:.4f}, R@1: {recall_t:.4f}")
158
```
159

160
### Advanced Metrics
161

162
Access detailed evaluation metrics and precision-recall curves for comprehensive model analysis.
163

164
```python { .api }
165
def get_meter(path, k=-1):
166
    """
167
    Get evaluation meter for detailed metrics.
168
    
169
    Args:
170
        path (str): Path to test file
171
        k (int): Number of predictions to consider (default: -1 for all)
172
        
173
    Returns:
174
        _Meter: Meter object for detailed evaluation
175
    """
176
```
177

178
The `_Meter` class provides advanced metric analysis:
179

180
```python { .api }
181
class _Meter:
182
    def score_vs_true(self, label):
183
        """
184
        Get scores and true labels for a specific label.
185
        
186
        Args:
187
            label (str): Label to analyze
188
            
189
        Returns:
190
            tuple: (scores_array, true_labels_array) for ROC/PR analysis
191
        """
192
    
193
    def precision_recall_curve(self, label=None):
194
        """
195
        Get precision-recall curve data.
196
        
197
        Args:
198
            label (str, optional): Specific label or None for micro-average
199
            
200
        Returns:
201
            tuple: (precision_array, recall_array, thresholds_array)
202
        """
203
    
204
    def precision_at_recall(self, recall, label=None):
205
        """
206
        Get precision at specific recall level.
207
        
208
        Args:
209
            recall (float): Target recall level (0.0-1.0)
210
            label (str, optional): Specific label or None for micro-average
211
            
212
        Returns:
213
            float: Precision at the specified recall level
214
        """
215
    
216
    def recall_at_precision(self, precision, label=None):
217
        """
218
        Get recall at specific precision level.
219
        
220
        Args:
221
            precision (float): Target precision level (0.0-1.0)  
222
            label (str, optional): Specific label or None for micro-average
223
            
224
        Returns:
225
            float: Recall at the specified precision level
226
        """
227
```
228

229
#### Usage Example
230

231
```python
232
import fasttext
233
import matplotlib.pyplot as plt
234
import numpy as np
235

236
model = fasttext.load_model('classifier.bin')
237

238
# Get detailed evaluation meter
239
meter = model.get_meter('test.txt')
240

241
# Analyze specific label
242
label = '__label__positive'
243
scores, true_labels = meter.score_vs_true(label)
244

245
print(f"Analysis for {label}:")
246
print(f"  Score range: {scores.min():.3f} to {scores.max():.3f}")
247
print(f"  Positive samples: {true_labels.sum()}")
248
print(f"  Negative samples: {len(true_labels) - true_labels.sum()}")
249

250
# Get precision-recall curve
251
precision, recall, thresholds = meter.precision_recall_curve(label)
252

253
# Plot PR curve
254
plt.figure(figsize=(8, 6))
255
plt.plot(recall, precision, 'b-', linewidth=2)
256
plt.xlabel('Recall')
257
plt.ylabel('Precision') 
258
plt.title(f'Precision-Recall Curve for {label}')
259
plt.grid(True)
260
plt.show()
261

262
# Find optimal threshold
263
f1_scores = 2 * (precision * recall) / (precision + recall)
264
f1_scores = np.nan_to_num(f1_scores)  # Handle division by zero
265
optimal_idx = np.argmax(f1_scores)
266
optimal_threshold = thresholds[optimal_idx]
267
optimal_f1 = f1_scores[optimal_idx]
268

269
print(f"Optimal threshold: {optimal_threshold:.3f}")
270
print(f"Optimal F1-score: {optimal_f1:.3f}")
271

272
# Precision/recall at specific levels
273
precision_at_80_recall = meter.precision_at_recall(0.8, label)
274
recall_at_90_precision = meter.recall_at_precision(0.9, label)
275

276
print(f"Precision at 80% recall: {precision_at_80_recall:.3f}")
277
print(f"Recall at 90% precision: {recall_at_90_precision:.3f}")
278

279
# Multi-label analysis
280
labels = model.get_labels()
281
for label in labels[:5]:  # Analyze first 5 labels
282
    pr_at_50 = meter.precision_at_recall(0.5, label)
283
    re_at_90 = meter.recall_at_precision(0.9, label)
284
    print(f"{label}: P@50%R={pr_at_50:.3f}, R@90%P={re_at_90:.3f}")
285
```
286

287
### Text Preprocessing
288

289
Access FastText's internal text processing for consistency with training.
290

291
```python { .api }
292
def tokenize(text):
293
    """
294
    Tokenize text using FastText's internal tokenizer.
295
    
296
    Args:
297
        text (str): Input text to tokenize
298
        
299
    Returns:
300
        list: List of tokens
301
    """
302
```
303

304
#### Usage Example
305

306
```python
307
import fasttext
308

309
# Tokenize text consistently with training
310
text = "Hello, world! This is a test."
311
tokens = fasttext.tokenize(text)
312
print(f"Tokens: {tokens}")
313

314
# Compare with model prediction preprocessing
315
model = fasttext.load_model('classifier.bin')
316
words, labels = model.get_line(text)
317
print(f"Model preprocessing: {words}")
318

319
# Ensure consistency
320
custom_text = "E-mail addresses like user@domain.com are tricky!"
321
custom_tokens = fasttext.tokenize(custom_text)
322
print(f"Custom tokenization: {custom_tokens}")
323
```
324

325
## Classification Best Practices
326

327
### Data Preparation
328

329
- **Label Format**: Use `__label__` prefix for all labels
330
- **Text Cleaning**: FastText handles basic tokenization, but consider domain-specific preprocessing
331
- **Class Balance**: Consider stratified sampling for imbalanced datasets
332
- **Validation Split**: Reserve 10-20% of data for validation/hyperparameter tuning
333

334
### Model Configuration
335

336
- **Loss Functions**:
337
  - `softmax`: Multi-class classification (default)
338
  - `ns`: Negative sampling for large vocabularies
339
  - `hs`: Hierarchical softmax for efficient training
340
  - `ova`: One-vs-all for multi-label classification
341

342
- **Hyperparameters**:
343
  - `lr=0.1`: Good starting learning rate
344
  - `wordNgrams=2`: Include bigrams for better context
345
  - `minn=3, maxn=6`: Character n-grams for robustness
346
  - `dim=100-300`: Higher dimensions for complex tasks
347

348
### Evaluation Strategy
349

350
- **Metrics**: Use F1-score for imbalanced classes, accuracy for balanced
351
- **Cross-validation**: Use k-fold CV for small datasets
352
- **Threshold Optimization**: Tune prediction thresholds for optimal F1
353
- **Per-label Analysis**: Monitor per-class performance for multi-class problems

Version

Tile

Files

classification.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

classification.mddocs/