or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

classification.mdindex.mdtraining.mdutilities.mdword-vectors.md

classification.mddocs/

0

# Text Classification

1

2

FastText provides comprehensive text classification capabilities including prediction, evaluation, and detailed performance metrics. Supports multi-class and multi-label classification with confidence thresholds and top-k predictions.

3

4

## Capabilities

5

6

### Prediction

7

8

Classify text into predefined categories with confidence scores and threshold filtering.

9

10

```python { .api }

11

def predict(text, k=1, threshold=0.0, on_unicode_error='strict'):

12

"""

13

Predict labels for input text.

14

15

Args:

16

text (str or list): Input text to classify or list of texts for batch prediction

17

k (int): Number of top predictions to return (default: 1)

18

threshold (float): Minimum prediction confidence (default: 0.0)

19

on_unicode_error (str): Unicode error handling (default: 'strict')

20

21

Returns:

22

tuple: If text is str, returns (labels, probabilities) where labels is

23

a list of predicted labels and probabilities is a numpy array of scores.

24

If text is list, returns (all_labels, all_probabilities) where each

25

is a list containing results for each input text.

26

27

Raises:

28

ValueError: If text contains newline characters or model is not supervised

29

"""

30

31

def get_line(text, on_unicode_error='strict'):

32

"""

33

Split text into words and labels for internal processing.

34

35

Args:

36

text (str or list): Input text or list of texts (must not contain newlines)

37

on_unicode_error (str): Unicode error handling (default: 'strict')

38

39

Returns:

40

tuple or list: If text is str, returns (words, labels) tuple.

41

If text is list, returns list of (words, labels) tuples.

42

words is tokenized text, labels is list of any labels found

43

44

Raises:

45

ValueError: If text contains newline characters

46

47

Note:

48

Labels must start with the prefix used to create the model (__label__ by default)

49

"""

50

```

51

52

#### Usage Example

53

54

```python

55

import fasttext

56

57

# Load trained classifier

58

model = fasttext.load_model('classifier.bin')

59

60

# Single prediction

61

text = "This movie is absolutely fantastic!"

62

labels, probabilities = model.predict(text)

63

print(f"Predicted: {labels[0]} (confidence: {probabilities[0]:.4f})")

64

65

# Top-k predictions

66

labels, probabilities = model.predict(text, k=3)

67

print("Top 3 predictions:")

68

for label, prob in zip(labels, probabilities):

69

print(f" {label}: {prob:.4f}")

70

71

# Predictions with threshold

72

labels, probabilities = model.predict(text, k=5, threshold=0.1)

73

print(f"Predictions above 0.1 confidence: {len(labels)}")

74

75

# Batch predictions

76

texts = [

77

"Great movie, loved it!",

78

"Terrible film, waste of time.",

79

"It was okay, nothing special."

80

]

81

82

for text in texts:

83

labels, probs = model.predict(text)

84

print(f"'{text}' -> {labels[0]} ({probs[0]:.3f})")

85

86

# Handle multilabel predictions

87

multilabel_text = "This is a great action comedy movie"

88

labels, probs = model.predict(multilabel_text, k=3, threshold=0.2)

89

print(f"Multiple labels: {labels}")

90

```

91

92

### Model Evaluation

93

94

Evaluate classifier performance on test datasets with precision, recall, and F1-score metrics.

95

96

```python { .api }

97

def test(path, k=1, threshold=0.0):

98

"""

99

Evaluate model on test data.

100

101

Args:

102

path (str): Path to test file in training format

103

k (int): Number of predictions to consider (default: 1)

104

threshold (float): Minimum prediction confidence (default: 0.0)

105

106

Returns:

107

tuple: (sample_count, precision, recall) where sample_count is

108

number of test samples, precision is P@k, recall is R@k

109

"""

110

111

def test_label(path, k=1, threshold=0.0):

112

"""

113

Get per-label precision and recall scores.

114

115

Args:

116

path (str): Path to test file in training format

117

k (int): Number of predictions to consider (default: 1)

118

threshold (float): Minimum prediction confidence (default: 0.0)

119

120

Returns:

121

dict: Dictionary mapping label names to dictionaries with 'precision' and 'recall' keys

122

Example: {'__label__positive': {'precision': 0.7, 'recall': 0.74}}

123

"""

124

```

125

126

#### Usage Example

127

128

```python

129

import fasttext

130

131

model = fasttext.load_model('classifier.bin')

132

133

# Overall evaluation

134

n_samples, precision, recall = model.test('test.txt')

135

f1_score = 2 * (precision * recall) / (precision + recall)

136

137

print(f"Test Results:")

138

print(f" Samples: {n_samples}")

139

print(f" Precision@1: {precision:.4f}")

140

print(f" Recall@1: {recall:.4f}")

141

print(f" F1-Score: {f1_score:.4f}")

142

143

# Top-k evaluation

144

n_samples, precision_k, recall_k = model.test('test.txt', k=3)

145

print(f"Precision@3: {precision_k:.4f}")

146

print(f"Recall@3: {recall_k:.4f}")

147

148

# Per-label evaluation

149

label_scores = model.test_label('test.txt')

150

print("Per-label scores:")

151

for label, (precision, recall) in label_scores.items():

152

f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

153

print(f" {label}: P={precision:.3f}, R={recall:.3f}, F1={f1:.3f}")

154

155

# Evaluation with threshold

156

n_samples, precision_t, recall_t = model.test('test.txt', k=1, threshold=0.5)

157

print(f"With threshold 0.5 - P@1: {precision_t:.4f}, R@1: {recall_t:.4f}")

158

```

159

160

### Advanced Metrics

161

162

Access detailed evaluation metrics and precision-recall curves for comprehensive model analysis.

163

164

```python { .api }

165

def get_meter(path, k=-1):

166

"""

167

Get evaluation meter for detailed metrics.

168

169

Args:

170

path (str): Path to test file

171

k (int): Number of predictions to consider (default: -1 for all)

172

173

Returns:

174

_Meter: Meter object for detailed evaluation

175

"""

176

```

177

178

The `_Meter` class provides advanced metric analysis:

179

180

```python { .api }

181

class _Meter:

182

def score_vs_true(self, label):

183

"""

184

Get scores and true labels for a specific label.

185

186

Args:

187

label (str): Label to analyze

188

189

Returns:

190

tuple: (scores_array, true_labels_array) for ROC/PR analysis

191

"""

192

193

def precision_recall_curve(self, label=None):

194

"""

195

Get precision-recall curve data.

196

197

Args:

198

label (str, optional): Specific label or None for micro-average

199

200

Returns:

201

tuple: (precision_array, recall_array, thresholds_array)

202

"""

203

204

def precision_at_recall(self, recall, label=None):

205

"""

206

Get precision at specific recall level.

207

208

Args:

209

recall (float): Target recall level (0.0-1.0)

210

label (str, optional): Specific label or None for micro-average

211

212

Returns:

213

float: Precision at the specified recall level

214

"""

215

216

def recall_at_precision(self, precision, label=None):

217

"""

218

Get recall at specific precision level.

219

220

Args:

221

precision (float): Target precision level (0.0-1.0)

222

label (str, optional): Specific label or None for micro-average

223

224

Returns:

225

float: Recall at the specified precision level

226

"""

227

```

228

229

#### Usage Example

230

231

```python

232

import fasttext

233

import matplotlib.pyplot as plt

234

import numpy as np

235

236

model = fasttext.load_model('classifier.bin')

237

238

# Get detailed evaluation meter

239

meter = model.get_meter('test.txt')

240

241

# Analyze specific label

242

label = '__label__positive'

243

scores, true_labels = meter.score_vs_true(label)

244

245

print(f"Analysis for {label}:")

246

print(f" Score range: {scores.min():.3f} to {scores.max():.3f}")

247

print(f" Positive samples: {true_labels.sum()}")

248

print(f" Negative samples: {len(true_labels) - true_labels.sum()}")

249

250

# Get precision-recall curve

251

precision, recall, thresholds = meter.precision_recall_curve(label)

252

253

# Plot PR curve

254

plt.figure(figsize=(8, 6))

255

plt.plot(recall, precision, 'b-', linewidth=2)

256

plt.xlabel('Recall')

257

plt.ylabel('Precision')

258

plt.title(f'Precision-Recall Curve for {label}')

259

plt.grid(True)

260

plt.show()

261

262

# Find optimal threshold

263

f1_scores = 2 * (precision * recall) / (precision + recall)

264

f1_scores = np.nan_to_num(f1_scores) # Handle division by zero

265

optimal_idx = np.argmax(f1_scores)

266

optimal_threshold = thresholds[optimal_idx]

267

optimal_f1 = f1_scores[optimal_idx]

268

269

print(f"Optimal threshold: {optimal_threshold:.3f}")

270

print(f"Optimal F1-score: {optimal_f1:.3f}")

271

272

# Precision/recall at specific levels

273

precision_at_80_recall = meter.precision_at_recall(0.8, label)

274

recall_at_90_precision = meter.recall_at_precision(0.9, label)

275

276

print(f"Precision at 80% recall: {precision_at_80_recall:.3f}")

277

print(f"Recall at 90% precision: {recall_at_90_precision:.3f}")

278

279

# Multi-label analysis

280

labels = model.get_labels()

281

for label in labels[:5]: # Analyze first 5 labels

282

pr_at_50 = meter.precision_at_recall(0.5, label)

283

re_at_90 = meter.recall_at_precision(0.9, label)

284

print(f"{label}: P@50%R={pr_at_50:.3f}, R@90%P={re_at_90:.3f}")

285

```

286

287

### Text Preprocessing

288

289

Access FastText's internal text processing for consistency with training.

290

291

```python { .api }

292

def tokenize(text):

293

"""

294

Tokenize text using FastText's internal tokenizer.

295

296

Args:

297

text (str): Input text to tokenize

298

299

Returns:

300

list: List of tokens

301

"""

302

```

303

304

#### Usage Example

305

306

```python

307

import fasttext

308

309

# Tokenize text consistently with training

310

text = "Hello, world! This is a test."

311

tokens = fasttext.tokenize(text)

312

print(f"Tokens: {tokens}")

313

314

# Compare with model prediction preprocessing

315

model = fasttext.load_model('classifier.bin')

316

words, labels = model.get_line(text)

317

print(f"Model preprocessing: {words}")

318

319

# Ensure consistency

320

custom_text = "E-mail addresses like user@domain.com are tricky!"

321

custom_tokens = fasttext.tokenize(custom_text)

322

print(f"Custom tokenization: {custom_tokens}")

323

```

324

325

## Classification Best Practices

326

327

### Data Preparation

328

329

- **Label Format**: Use `__label__` prefix for all labels

330

- **Text Cleaning**: FastText handles basic tokenization, but consider domain-specific preprocessing

331

- **Class Balance**: Consider stratified sampling for imbalanced datasets

332

- **Validation Split**: Reserve 10-20% of data for validation/hyperparameter tuning

333

334

### Model Configuration

335

336

- **Loss Functions**:

337

- `softmax`: Multi-class classification (default)

338

- `ns`: Negative sampling for large vocabularies

339

- `hs`: Hierarchical softmax for efficient training

340

- `ova`: One-vs-all for multi-label classification

341

342

- **Hyperparameters**:

343

- `lr=0.1`: Good starting learning rate

344

- `wordNgrams=2`: Include bigrams for better context

345

- `minn=3, maxn=6`: Character n-grams for robustness

346

- `dim=100-300`: Higher dimensions for complex tasks

347

348

### Evaluation Strategy

349

350

- **Metrics**: Use F1-score for imbalanced classes, accuracy for balanced

351

- **Cross-validation**: Use k-fold CV for small datasets

352

- **Threshold Optimization**: Tune prediction thresholds for optimal F1

353

- **Per-label Analysis**: Monitor per-class performance for multi-class problems