0
# Model Training
1
2
FastText provides comprehensive training functions for both supervised text classification and unsupervised word embedding models. The training functions support extensive hyperparameter configuration and automatic optimization.
3
4
## Capabilities
5
6
### Supervised Training
7
8
Train text classification models using labeled data. Supports multi-class and multi-label classification with various loss functions and optimization strategies.
9
10
```python { .api }
11
def train_supervised(input, **kwargs):
12
"""
13
Train a supervised classification model.
14
15
Args:
16
input (str): Path to training file with format: __label__<label> <text>
17
lr (float): Learning rate (default: 0.1)
18
dim (int): Vector dimension (default: 100)
19
ws (int): Context window size (default: 5)
20
epoch (int): Number of training epochs (default: 5)
21
minCount (int): Minimum word count threshold (default: 1)
22
minCountLabel (int): Minimum label count threshold (default: 0)
23
minn (int): Min character n-gram length (default: 0)
24
maxn (int): Max character n-gram length (default: 0)
25
neg (int): Number of negative samples (default: 5)
26
wordNgrams (int): Word n-gram length (default: 1)
27
loss (str): Loss function - 'softmax', 'ns', 'hs', 'ova' (default: 'softmax')
28
bucket (int): Hash bucket size (default: 2000000)
29
thread (int): Number of threads (default: cpu_count-1)
30
lrUpdateRate (int): Learning rate update frequency (default: 100)
31
t (float): Sampling threshold (default: 1e-4)
32
label (str): Label prefix (default: '__label__')
33
verbose (int): Verbosity level 0-2 (default: 2)
34
pretrainedVectors (str): Path to pretrained vectors (default: '')
35
seed (int): Random seed (default: 0)
36
37
# AutoTune parameters for hyperparameter optimization
38
autotuneValidationFile (str): Path to validation file for autotune
39
autotuneMetric (str): Metric to optimize - 'f1', 'f1:labelname'
40
autotunePredictions (int): Number of predictions for autotune
41
autotuneDuration (int): Autotune duration in seconds
42
autotuneModelSize (str): Target model size - '1M', '2M', etc.
43
44
Returns:
45
_FastText: Trained model object
46
"""
47
```
48
49
#### Usage Example
50
51
```python
52
import fasttext
53
54
# Basic supervised training
55
model = fasttext.train_supervised(
56
input='train.txt',
57
lr=0.1,
58
dim=100,
59
epoch=25,
60
wordNgrams=2,
61
loss='softmax'
62
)
63
64
# Advanced training with character n-grams
65
model = fasttext.train_supervised(
66
input='train.txt',
67
lr=0.5,
68
dim=300,
69
epoch=25,
70
minn=3,
71
maxn=6,
72
wordNgrams=2,
73
loss='ova' # One-vs-all for multi-label
74
)
75
76
# Training with pretrained vectors
77
model = fasttext.train_supervised(
78
input='train.txt',
79
pretrainedVectors='wiki.en.vec',
80
epoch=15,
81
lr=0.1
82
)
83
84
# AutoTune training for optimal hyperparameters
85
model = fasttext.train_supervised(
86
input='train.txt',
87
autotuneValidationFile='valid.txt',
88
autotuneMetric='f1',
89
autotuneDuration=300 # 5 minutes
90
)
91
```
92
93
### Unsupervised Training
94
95
Train word embedding models using unlabeled text data. Supports both CBOW and Skip-gram architectures with subword information.
96
97
```python { .api }
98
def train_unsupervised(input, **kwargs):
99
"""
100
Train an unsupervised word embedding model.
101
102
Args:
103
input (str): Path to training text file
104
model (str): Model architecture - 'cbow' or 'skipgram' (default: 'skipgram')
105
lr (float): Learning rate (default: 0.05)
106
dim (int): Vector dimension (default: 100)
107
ws (int): Context window size (default: 5)
108
epoch (int): Number of training epochs (default: 5)
109
minCount (int): Minimum word count threshold (default: 5)
110
minn (int): Min character n-gram length (default: 3)
111
maxn (int): Max character n-gram length (default: 6)
112
neg (int): Number of negative samples (default: 5)
113
loss (str): Loss function - 'ns' or 'hs' (default: 'ns')
114
bucket (int): Hash bucket size (default: 2000000)
115
thread (int): Number of threads (default: cpu_count-1)
116
lrUpdateRate (int): Learning rate update frequency (default: 100)
117
t (float): Sampling threshold (default: 1e-4)
118
verbose (int): Verbosity level 0-2 (default: 2)
119
seed (int): Random seed (default: 0)
120
121
Returns:
122
_FastText: Trained model object
123
"""
124
```
125
126
#### Usage Example
127
128
```python
129
import fasttext
130
131
# Basic skip-gram training
132
model = fasttext.train_unsupervised(
133
input='data.txt',
134
model='skipgram',
135
dim=300,
136
epoch=5
137
)
138
139
# CBOW with character n-grams
140
model = fasttext.train_unsupervised(
141
input='data.txt',
142
model='cbow',
143
lr=0.05,
144
dim=100,
145
ws=5,
146
epoch=5,
147
minCount=5,
148
minn=3,
149
maxn=6,
150
loss='ns'
151
)
152
153
# High-quality embeddings with more epochs
154
model = fasttext.train_unsupervised(
155
input='large_corpus.txt',
156
model='skipgram',
157
lr=0.025,
158
dim=300,
159
ws=5,
160
epoch=50,
161
minCount=10,
162
minn=3,
163
maxn=6,
164
neg=10,
165
thread=8
166
)
167
```
168
169
### Model Loading
170
171
Load pre-trained FastText models from disk.
172
173
```python { .api }
174
def load_model(path):
175
"""
176
Load a pre-trained FastText model.
177
178
Args:
179
path (str): Path to model file (.bin or .ftz format)
180
181
Returns:
182
_FastText: Loaded model object
183
184
Raises:
185
ValueError: If model file cannot be loaded
186
FileNotFoundError: If model file does not exist
187
"""
188
```
189
190
#### Usage Example
191
192
```python
193
import fasttext
194
195
# Load binary model
196
model = fasttext.load_model('model.bin')
197
198
# Load compressed model
199
model = fasttext.load_model('model.ftz')
200
201
# Load from different directory
202
model = fasttext.load_model('/path/to/models/wiki.en.bin')
203
```
204
205
## Training Data Format
206
207
### Supervised Training Data
208
209
Training files should contain one sample per line with labels prefixed by `__label__`:
210
211
```
212
__label__positive This movie is great!
213
__label__negative Terrible film.
214
__label__neutral It was okay.
215
__label__positive __label__comedy This is a funny and great movie
216
```
217
218
### Unsupervised Training Data
219
220
Training files should contain plain text, one sentence per line:
221
222
```
223
The quick brown fox jumps over the lazy dog.
224
Natural language processing is fascinating.
225
FastText learns word representations efficiently.
226
```
227
228
## Performance Tips
229
230
- **Learning Rate**: Start with 0.1 for supervised, 0.05 for unsupervised
231
- **Dimensions**: 100-300 typical, higher for larger vocabularies
232
- **Character N-grams**: Use minn=3, maxn=6 for subword information
233
- **Word N-grams**: Use wordNgrams=1-3 for better text classification
234
- **Epochs**: 5-25 for most tasks, more for large datasets
235
- **Threads**: Set to number of CPU cores for faster training