Tessl Tile for pypi/fasttext@0.9.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

classification.md index.md training.md utilities.md word-vectors.md

training.mddocs/

0
# Model Training
1

2
FastText provides comprehensive training functions for both supervised text classification and unsupervised word embedding models. The training functions support extensive hyperparameter configuration and automatic optimization.
3

4
## Capabilities
5

6
### Supervised Training
7

8
Train text classification models using labeled data. Supports multi-class and multi-label classification with various loss functions and optimization strategies.
9

10
```python { .api }
11
def train_supervised(input, **kwargs):
12
    """
13
    Train a supervised classification model.
14
    
15
    Args:
16
        input (str): Path to training file with format: __label__<label> <text>
17
        lr (float): Learning rate (default: 0.1)
18
        dim (int): Vector dimension (default: 100)
19
        ws (int): Context window size (default: 5)
20
        epoch (int): Number of training epochs (default: 5)
21
        minCount (int): Minimum word count threshold (default: 1)
22
        minCountLabel (int): Minimum label count threshold (default: 0)
23
        minn (int): Min character n-gram length (default: 0)
24
        maxn (int): Max character n-gram length (default: 0)
25
        neg (int): Number of negative samples (default: 5)
26
        wordNgrams (int): Word n-gram length (default: 1)
27
        loss (str): Loss function - 'softmax', 'ns', 'hs', 'ova' (default: 'softmax')
28
        bucket (int): Hash bucket size (default: 2000000)
29
        thread (int): Number of threads (default: cpu_count-1)
30
        lrUpdateRate (int): Learning rate update frequency (default: 100)
31
        t (float): Sampling threshold (default: 1e-4)
32
        label (str): Label prefix (default: '__label__')
33
        verbose (int): Verbosity level 0-2 (default: 2)
34
        pretrainedVectors (str): Path to pretrained vectors (default: '')
35
        seed (int): Random seed (default: 0)
36
        
37
        # AutoTune parameters for hyperparameter optimization
38
        autotuneValidationFile (str): Path to validation file for autotune
39
        autotuneMetric (str): Metric to optimize - 'f1', 'f1:labelname'
40
        autotunePredictions (int): Number of predictions for autotune
41
        autotuneDuration (int): Autotune duration in seconds
42
        autotuneModelSize (str): Target model size - '1M', '2M', etc.
43
    
44
    Returns:
45
        _FastText: Trained model object
46
    """
47
```
48

49
#### Usage Example
50

51
```python
52
import fasttext
53

54
# Basic supervised training
55
model = fasttext.train_supervised(
56
    input='train.txt',
57
    lr=0.1,
58
    dim=100,
59
    epoch=25,
60
    wordNgrams=2,
61
    loss='softmax'
62
)
63

64
# Advanced training with character n-grams
65
model = fasttext.train_supervised(
66
    input='train.txt',
67
    lr=0.5,
68
    dim=300,
69
    epoch=25,
70
    minn=3,
71
    maxn=6,
72
    wordNgrams=2,
73
    loss='ova'  # One-vs-all for multi-label
74
)
75

76
# Training with pretrained vectors
77
model = fasttext.train_supervised(
78
    input='train.txt',
79
    pretrainedVectors='wiki.en.vec',
80
    epoch=15,
81
    lr=0.1
82
)
83

84
# AutoTune training for optimal hyperparameters
85
model = fasttext.train_supervised(
86
    input='train.txt',
87
    autotuneValidationFile='valid.txt',
88
    autotuneMetric='f1',
89
    autotuneDuration=300  # 5 minutes
90
)
91
```
92

93
### Unsupervised Training
94

95
Train word embedding models using unlabeled text data. Supports both CBOW and Skip-gram architectures with subword information.
96

97
```python { .api }
98
def train_unsupervised(input, **kwargs):
99
    """
100
    Train an unsupervised word embedding model.
101
    
102
    Args:
103
        input (str): Path to training text file
104
        model (str): Model architecture - 'cbow' or 'skipgram' (default: 'skipgram')
105
        lr (float): Learning rate (default: 0.05)
106
        dim (int): Vector dimension (default: 100)
107
        ws (int): Context window size (default: 5)
108
        epoch (int): Number of training epochs (default: 5)
109
        minCount (int): Minimum word count threshold (default: 5)
110
        minn (int): Min character n-gram length (default: 3)
111
        maxn (int): Max character n-gram length (default: 6)  
112
        neg (int): Number of negative samples (default: 5)
113
        loss (str): Loss function - 'ns' or 'hs' (default: 'ns')
114
        bucket (int): Hash bucket size (default: 2000000)
115
        thread (int): Number of threads (default: cpu_count-1)
116
        lrUpdateRate (int): Learning rate update frequency (default: 100)
117
        t (float): Sampling threshold (default: 1e-4)
118
        verbose (int): Verbosity level 0-2 (default: 2)
119
        seed (int): Random seed (default: 0)
120
    
121
    Returns:
122
        _FastText: Trained model object
123
    """
124
```
125

126
#### Usage Example
127

128
```python
129
import fasttext
130

131
# Basic skip-gram training
132
model = fasttext.train_unsupervised(
133
    input='data.txt',
134
    model='skipgram',
135
    dim=300,
136
    epoch=5
137
)
138

139
# CBOW with character n-grams
140
model = fasttext.train_unsupervised(
141
    input='data.txt',
142
    model='cbow',
143
    lr=0.05,
144
    dim=100,
145
    ws=5,
146
    epoch=5,
147
    minCount=5,
148
    minn=3,
149
    maxn=6,
150
    loss='ns'
151
)
152

153
# High-quality embeddings with more epochs
154
model = fasttext.train_unsupervised(
155
    input='large_corpus.txt',
156
    model='skipgram',
157
    lr=0.025,
158
    dim=300,
159
    ws=5,
160
    epoch=50,
161
    minCount=10,
162
    minn=3,
163
    maxn=6,
164
    neg=10,
165
    thread=8
166
)
167
```
168

169
### Model Loading
170

171
Load pre-trained FastText models from disk.
172

173
```python { .api }
174
def load_model(path):
175
    """
176
    Load a pre-trained FastText model.
177
    
178
    Args:
179
        path (str): Path to model file (.bin or .ftz format)
180
    
181
    Returns:
182
        _FastText: Loaded model object
183
        
184
    Raises:
185
        ValueError: If model file cannot be loaded
186
        FileNotFoundError: If model file does not exist
187
    """
188
```
189

190
#### Usage Example
191

192
```python
193
import fasttext
194

195
# Load binary model
196
model = fasttext.load_model('model.bin')
197

198
# Load compressed model
199
model = fasttext.load_model('model.ftz')
200

201
# Load from different directory  
202
model = fasttext.load_model('/path/to/models/wiki.en.bin')
203
```
204

205
## Training Data Format
206

207
### Supervised Training Data
208

209
Training files should contain one sample per line with labels prefixed by `__label__`:
210

211
```
212
__label__positive This movie is great!
213
__label__negative Terrible film.
214
__label__neutral It was okay.
215
__label__positive __label__comedy This is a funny and great movie
216
```
217

218
### Unsupervised Training Data
219

220
Training files should contain plain text, one sentence per line:
221

222
```
223
The quick brown fox jumps over the lazy dog.
224
Natural language processing is fascinating.
225
FastText learns word representations efficiently.
226
```
227

228
## Performance Tips
229

230
- **Learning Rate**: Start with 0.1 for supervised, 0.05 for unsupervised
231
- **Dimensions**: 100-300 typical, higher for larger vocabularies
232
- **Character N-grams**: Use minn=3, maxn=6 for subword information
233
- **Word N-grams**: Use wordNgrams=1-3 for better text classification
234
- **Epochs**: 5-25 for most tasks, more for large datasets
235
- **Threads**: Set to number of CPU cores for faster training

Version

Tile

Files

training.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

training.mddocs/