0
# Text Classification
1
2
FastText provides comprehensive text classification capabilities including prediction, evaluation, and detailed performance metrics. Supports multi-class and multi-label classification with confidence thresholds and top-k predictions.
3
4
## Capabilities
5
6
### Prediction
7
8
Classify text into predefined categories with confidence scores and threshold filtering.
9
10
```python { .api }
11
def predict(text, k=1, threshold=0.0, on_unicode_error='strict'):
12
"""
13
Predict labels for input text.
14
15
Args:
16
text (str or list): Input text to classify or list of texts for batch prediction
17
k (int): Number of top predictions to return (default: 1)
18
threshold (float): Minimum prediction confidence (default: 0.0)
19
on_unicode_error (str): Unicode error handling (default: 'strict')
20
21
Returns:
22
tuple: If text is str, returns (labels, probabilities) where labels is
23
a list of predicted labels and probabilities is a numpy array of scores.
24
If text is list, returns (all_labels, all_probabilities) where each
25
is a list containing results for each input text.
26
27
Raises:
28
ValueError: If text contains newline characters or model is not supervised
29
"""
30
31
def get_line(text, on_unicode_error='strict'):
32
"""
33
Split text into words and labels for internal processing.
34
35
Args:
36
text (str or list): Input text or list of texts (must not contain newlines)
37
on_unicode_error (str): Unicode error handling (default: 'strict')
38
39
Returns:
40
tuple or list: If text is str, returns (words, labels) tuple.
41
If text is list, returns list of (words, labels) tuples.
42
words is tokenized text, labels is list of any labels found
43
44
Raises:
45
ValueError: If text contains newline characters
46
47
Note:
48
Labels must start with the prefix used to create the model (__label__ by default)
49
"""
50
```
51
52
#### Usage Example
53
54
```python
55
import fasttext
56
57
# Load trained classifier
58
model = fasttext.load_model('classifier.bin')
59
60
# Single prediction
61
text = "This movie is absolutely fantastic!"
62
labels, probabilities = model.predict(text)
63
print(f"Predicted: {labels[0]} (confidence: {probabilities[0]:.4f})")
64
65
# Top-k predictions
66
labels, probabilities = model.predict(text, k=3)
67
print("Top 3 predictions:")
68
for label, prob in zip(labels, probabilities):
69
print(f" {label}: {prob:.4f}")
70
71
# Predictions with threshold
72
labels, probabilities = model.predict(text, k=5, threshold=0.1)
73
print(f"Predictions above 0.1 confidence: {len(labels)}")
74
75
# Batch predictions
76
texts = [
77
"Great movie, loved it!",
78
"Terrible film, waste of time.",
79
"It was okay, nothing special."
80
]
81
82
for text in texts:
83
labels, probs = model.predict(text)
84
print(f"'{text}' -> {labels[0]} ({probs[0]:.3f})")
85
86
# Handle multilabel predictions
87
multilabel_text = "This is a great action comedy movie"
88
labels, probs = model.predict(multilabel_text, k=3, threshold=0.2)
89
print(f"Multiple labels: {labels}")
90
```
91
92
### Model Evaluation
93
94
Evaluate classifier performance on test datasets with precision, recall, and F1-score metrics.
95
96
```python { .api }
97
def test(path, k=1, threshold=0.0):
98
"""
99
Evaluate model on test data.
100
101
Args:
102
path (str): Path to test file in training format
103
k (int): Number of predictions to consider (default: 1)
104
threshold (float): Minimum prediction confidence (default: 0.0)
105
106
Returns:
107
tuple: (sample_count, precision, recall) where sample_count is
108
number of test samples, precision is P@k, recall is R@k
109
"""
110
111
def test_label(path, k=1, threshold=0.0):
112
"""
113
Get per-label precision and recall scores.
114
115
Args:
116
path (str): Path to test file in training format
117
k (int): Number of predictions to consider (default: 1)
118
threshold (float): Minimum prediction confidence (default: 0.0)
119
120
Returns:
121
dict: Dictionary mapping label names to dictionaries with 'precision' and 'recall' keys
122
Example: {'__label__positive': {'precision': 0.7, 'recall': 0.74}}
123
"""
124
```
125
126
#### Usage Example
127
128
```python
129
import fasttext
130
131
model = fasttext.load_model('classifier.bin')
132
133
# Overall evaluation
134
n_samples, precision, recall = model.test('test.txt')
135
f1_score = 2 * (precision * recall) / (precision + recall)
136
137
print(f"Test Results:")
138
print(f" Samples: {n_samples}")
139
print(f" Precision@1: {precision:.4f}")
140
print(f" Recall@1: {recall:.4f}")
141
print(f" F1-Score: {f1_score:.4f}")
142
143
# Top-k evaluation
144
n_samples, precision_k, recall_k = model.test('test.txt', k=3)
145
print(f"Precision@3: {precision_k:.4f}")
146
print(f"Recall@3: {recall_k:.4f}")
147
148
# Per-label evaluation
149
label_scores = model.test_label('test.txt')
150
print("Per-label scores:")
151
for label, (precision, recall) in label_scores.items():
152
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
153
print(f" {label}: P={precision:.3f}, R={recall:.3f}, F1={f1:.3f}")
154
155
# Evaluation with threshold
156
n_samples, precision_t, recall_t = model.test('test.txt', k=1, threshold=0.5)
157
print(f"With threshold 0.5 - P@1: {precision_t:.4f}, R@1: {recall_t:.4f}")
158
```
159
160
### Advanced Metrics
161
162
Access detailed evaluation metrics and precision-recall curves for comprehensive model analysis.
163
164
```python { .api }
165
def get_meter(path, k=-1):
166
"""
167
Get evaluation meter for detailed metrics.
168
169
Args:
170
path (str): Path to test file
171
k (int): Number of predictions to consider (default: -1 for all)
172
173
Returns:
174
_Meter: Meter object for detailed evaluation
175
"""
176
```
177
178
The `_Meter` class provides advanced metric analysis:
179
180
```python { .api }
181
class _Meter:
182
def score_vs_true(self, label):
183
"""
184
Get scores and true labels for a specific label.
185
186
Args:
187
label (str): Label to analyze
188
189
Returns:
190
tuple: (scores_array, true_labels_array) for ROC/PR analysis
191
"""
192
193
def precision_recall_curve(self, label=None):
194
"""
195
Get precision-recall curve data.
196
197
Args:
198
label (str, optional): Specific label or None for micro-average
199
200
Returns:
201
tuple: (precision_array, recall_array, thresholds_array)
202
"""
203
204
def precision_at_recall(self, recall, label=None):
205
"""
206
Get precision at specific recall level.
207
208
Args:
209
recall (float): Target recall level (0.0-1.0)
210
label (str, optional): Specific label or None for micro-average
211
212
Returns:
213
float: Precision at the specified recall level
214
"""
215
216
def recall_at_precision(self, precision, label=None):
217
"""
218
Get recall at specific precision level.
219
220
Args:
221
precision (float): Target precision level (0.0-1.0)
222
label (str, optional): Specific label or None for micro-average
223
224
Returns:
225
float: Recall at the specified precision level
226
"""
227
```
228
229
#### Usage Example
230
231
```python
232
import fasttext
233
import matplotlib.pyplot as plt
234
import numpy as np
235
236
model = fasttext.load_model('classifier.bin')
237
238
# Get detailed evaluation meter
239
meter = model.get_meter('test.txt')
240
241
# Analyze specific label
242
label = '__label__positive'
243
scores, true_labels = meter.score_vs_true(label)
244
245
print(f"Analysis for {label}:")
246
print(f" Score range: {scores.min():.3f} to {scores.max():.3f}")
247
print(f" Positive samples: {true_labels.sum()}")
248
print(f" Negative samples: {len(true_labels) - true_labels.sum()}")
249
250
# Get precision-recall curve
251
precision, recall, thresholds = meter.precision_recall_curve(label)
252
253
# Plot PR curve
254
plt.figure(figsize=(8, 6))
255
plt.plot(recall, precision, 'b-', linewidth=2)
256
plt.xlabel('Recall')
257
plt.ylabel('Precision')
258
plt.title(f'Precision-Recall Curve for {label}')
259
plt.grid(True)
260
plt.show()
261
262
# Find optimal threshold
263
f1_scores = 2 * (precision * recall) / (precision + recall)
264
f1_scores = np.nan_to_num(f1_scores) # Handle division by zero
265
optimal_idx = np.argmax(f1_scores)
266
optimal_threshold = thresholds[optimal_idx]
267
optimal_f1 = f1_scores[optimal_idx]
268
269
print(f"Optimal threshold: {optimal_threshold:.3f}")
270
print(f"Optimal F1-score: {optimal_f1:.3f}")
271
272
# Precision/recall at specific levels
273
precision_at_80_recall = meter.precision_at_recall(0.8, label)
274
recall_at_90_precision = meter.recall_at_precision(0.9, label)
275
276
print(f"Precision at 80% recall: {precision_at_80_recall:.3f}")
277
print(f"Recall at 90% precision: {recall_at_90_precision:.3f}")
278
279
# Multi-label analysis
280
labels = model.get_labels()
281
for label in labels[:5]: # Analyze first 5 labels
282
pr_at_50 = meter.precision_at_recall(0.5, label)
283
re_at_90 = meter.recall_at_precision(0.9, label)
284
print(f"{label}: P@50%R={pr_at_50:.3f}, R@90%P={re_at_90:.3f}")
285
```
286
287
### Text Preprocessing
288
289
Access FastText's internal text processing for consistency with training.
290
291
```python { .api }
292
def tokenize(text):
293
"""
294
Tokenize text using FastText's internal tokenizer.
295
296
Args:
297
text (str): Input text to tokenize
298
299
Returns:
300
list: List of tokens
301
"""
302
```
303
304
#### Usage Example
305
306
```python
307
import fasttext
308
309
# Tokenize text consistently with training
310
text = "Hello, world! This is a test."
311
tokens = fasttext.tokenize(text)
312
print(f"Tokens: {tokens}")
313
314
# Compare with model prediction preprocessing
315
model = fasttext.load_model('classifier.bin')
316
words, labels = model.get_line(text)
317
print(f"Model preprocessing: {words}")
318
319
# Ensure consistency
320
custom_text = "E-mail addresses like user@domain.com are tricky!"
321
custom_tokens = fasttext.tokenize(custom_text)
322
print(f"Custom tokenization: {custom_tokens}")
323
```
324
325
## Classification Best Practices
326
327
### Data Preparation
328
329
- **Label Format**: Use `__label__` prefix for all labels
330
- **Text Cleaning**: FastText handles basic tokenization, but consider domain-specific preprocessing
331
- **Class Balance**: Consider stratified sampling for imbalanced datasets
332
- **Validation Split**: Reserve 10-20% of data for validation/hyperparameter tuning
333
334
### Model Configuration
335
336
- **Loss Functions**:
337
- `softmax`: Multi-class classification (default)
338
- `ns`: Negative sampling for large vocabularies
339
- `hs`: Hierarchical softmax for efficient training
340
- `ova`: One-vs-all for multi-label classification
341
342
- **Hyperparameters**:
343
- `lr=0.1`: Good starting learning rate
344
- `wordNgrams=2`: Include bigrams for better context
345
- `minn=3, maxn=6`: Character n-grams for robustness
346
- `dim=100-300`: Higher dimensions for complex tasks
347
348
### Evaluation Strategy
349
350
- **Metrics**: Use F1-score for imbalanced classes, accuracy for balanced
351
- **Cross-validation**: Use k-fold CV for small datasets
352
- **Threshold Optimization**: Tune prediction thresholds for optimal F1
353
- **Per-label Analysis**: Monitor per-class performance for multi-class problems