0
# FastText
1
2
FastText is a library for efficient learning of word representations and sentence classification developed by Facebook Research. The Python bindings provide comprehensive access to FastText's C++ core, enabling unsupervised word representation learning, supervised text classification, and subword information processing.
3
4
## Package Information
5
6
- **Package Name**: fasttext
7
- **Language**: Python (with C++ core)
8
- **Installation**: `pip install fasttext`
9
10
## Core Imports
11
12
```python
13
import fasttext
14
```
15
16
Main functions and model class:
17
18
```python
19
from fasttext import train_supervised, train_unsupervised, load_model, tokenize
20
```
21
22
## Basic Usage
23
24
### Training a Word Embedding Model
25
26
```python
27
import fasttext
28
29
# Train an unsupervised model on text file
30
model = fasttext.train_unsupervised('data.txt', model='skipgram')
31
32
# Get word vector
33
word_vector = model.get_word_vector('king')
34
35
# Find similar words
36
neighbors = model.get_nearest_neighbors('king')
37
print(neighbors)
38
```
39
40
### Training a Text Classification Model
41
42
```python
43
import fasttext
44
45
# Train supervised classifier
46
model = fasttext.train_supervised('train.txt')
47
48
# Predict labels for text
49
predictions = model.predict('This is a sample text')
50
print(predictions)
51
52
# Evaluate on test data
53
results = model.test('test.txt')
54
print(f"P@1: {results[1]}, R@1: {results[2]}")
55
```
56
57
### Loading Pre-trained Models
58
59
```python
60
import fasttext
61
62
# Load a pre-trained model
63
model = fasttext.load_model('model.bin')
64
65
# Get sentence vector
66
sentence_vector = model.get_sentence_vector('Hello world')
67
```
68
69
## Architecture
70
71
FastText combines several key innovations:
72
73
- **Subword Information**: Handles out-of-vocabulary words by learning representations for character n-grams
74
- **Hierarchical Softmax**: Efficient training for large vocabularies
75
- **Bag-of-Words Models**: CBOW and Skip-gram architectures for unsupervised learning
76
- **Fast Text Classification**: Linear classifiers with efficient training and inference
77
78
The Python bindings expose the complete C++ API through pybind11, providing both high-level training functions and low-level model manipulation capabilities.
79
80
## Capabilities
81
82
### Model Training
83
84
Core training functions for both supervised classification and unsupervised word embeddings with extensive hyperparameter control.
85
86
```python { .api }
87
def train_supervised(input, **kwargs):
88
"""
89
Train a supervised classification model.
90
91
Args:
92
input (str): Path to training file
93
**kwargs: Training parameters (lr, dim, epoch, etc.)
94
95
Returns:
96
FastText model object
97
"""
98
99
def train_unsupervised(input, **kwargs):
100
"""
101
Train an unsupervised word embedding model.
102
103
Args:
104
input (str): Path to training file
105
**kwargs: Training parameters (model, lr, dim, etc.)
106
107
Returns:
108
FastText model object
109
"""
110
111
def load_model(path):
112
"""
113
Load a pre-trained FastText model.
114
115
Args:
116
path (str): Path to model file
117
118
Returns:
119
FastText model object
120
"""
121
```
122
123
[Model Training](./training.md)
124
125
### Word Vector Operations
126
127
Access and manipulate word vectors, find similar words, and perform vector arithmetic operations.
128
129
```python { .api }
130
def get_word_vector(word):
131
"""Get vector representation of a word."""
132
133
def get_sentence_vector(text):
134
"""Get vector representation of a sentence."""
135
136
def get_nearest_neighbors(word, k=10):
137
"""Find k nearest neighbors of a word."""
138
139
def get_analogies(wordA, wordB, wordC, k=10):
140
"""Find analogies of the form A:B::C:?"""
141
```
142
143
[Word Vectors](./word-vectors.md)
144
145
### Text Classification
146
147
Predict labels for text, evaluate model performance, and access detailed classification metrics.
148
149
```python { .api }
150
def predict(text, k=1, threshold=0.0):
151
"""
152
Predict labels for input text.
153
154
Args:
155
text (str): Input text to classify
156
k (int): Number of top predictions to return
157
threshold (float): Minimum prediction confidence
158
159
Returns:
160
Tuple of (labels, probabilities)
161
"""
162
163
def test(path, k=1, threshold=0.0):
164
"""
165
Evaluate model on test data.
166
167
Returns:
168
Tuple of (sample_count, precision, recall)
169
"""
170
```
171
172
[Classification](./classification.md)
173
174
### Utility Functions
175
176
Helper functions for text processing, model manipulation, and downloading pre-trained models.
177
178
```python { .api }
179
def tokenize(text):
180
"""Tokenize text into list of tokens."""
181
182
def quantize(**kwargs):
183
"""Quantize model to reduce memory usage."""
184
185
# Utility module functions
186
import fasttext.util
187
fasttext.util.download_model(lang_id, if_exists='strict')
188
fasttext.util.reduce_model(model, target_dim)
189
```
190
191
[Utilities](./utilities.md)
192
193
## Constants and Enums
194
195
```python { .api }
196
# Model type enums (from C++ bindings via fasttext_pybind)
197
import fasttext
198
model_name = fasttext.model_name # Enum with values: cbow, skipgram, supervised
199
loss_name = fasttext.loss_name # Enum with values: hs, ns, softmax, ova
200
201
# Special tokens used in text processing
202
EOS = "</s>" # End of sentence token - marks sentence boundaries
203
BOW = "<" # Beginning of word token - used in subword processing
204
EOW = ">" # End of word token - used in subword processing
205
206
# Deprecated functions (raise exceptions with migration guidance)
207
cbow = fasttext.cbow # Raises exception, use train_unsupervised(model='cbow')
208
skipgram = fasttext.skipgram # Raises exception, use train_unsupervised(model='skipgram')
209
supervised = fasttext.supervised # Raises exception, use train_supervised()
210
```
211
212
## Error Handling
213
214
FastText functions accept `on_unicode_error` parameter for handling Unicode errors:
215
- `'strict'` (default): Raise exception on Unicode errors
216
- `'ignore'`: Skip invalid Unicode characters
217
- `'replace'`: Replace invalid Unicode with placeholder