FastText library for efficient learning of word representations and sentence classification
—
FastText provides comprehensive training functions for both supervised text classification and unsupervised word embedding models. The training functions support extensive hyperparameter configuration and automatic optimization.
Train text classification models using labeled data. Supports multi-class and multi-label classification with various loss functions and optimization strategies.
def train_supervised(input, **kwargs):
"""
Train a supervised classification model.
Args:
input (str): Path to training file with format: __label__<label> <text>
lr (float): Learning rate (default: 0.1)
dim (int): Vector dimension (default: 100)
ws (int): Context window size (default: 5)
epoch (int): Number of training epochs (default: 5)
minCount (int): Minimum word count threshold (default: 1)
minCountLabel (int): Minimum label count threshold (default: 0)
minn (int): Min character n-gram length (default: 0)
maxn (int): Max character n-gram length (default: 0)
neg (int): Number of negative samples (default: 5)
wordNgrams (int): Word n-gram length (default: 1)
loss (str): Loss function - 'softmax', 'ns', 'hs', 'ova' (default: 'softmax')
bucket (int): Hash bucket size (default: 2000000)
thread (int): Number of threads (default: cpu_count-1)
lrUpdateRate (int): Learning rate update frequency (default: 100)
t (float): Sampling threshold (default: 1e-4)
label (str): Label prefix (default: '__label__')
verbose (int): Verbosity level 0-2 (default: 2)
pretrainedVectors (str): Path to pretrained vectors (default: '')
seed (int): Random seed (default: 0)
# AutoTune parameters for hyperparameter optimization
autotuneValidationFile (str): Path to validation file for autotune
autotuneMetric (str): Metric to optimize - 'f1', 'f1:labelname'
autotunePredictions (int): Number of predictions for autotune
autotuneDuration (int): Autotune duration in seconds
autotuneModelSize (str): Target model size - '1M', '2M', etc.
Returns:
_FastText: Trained model object
"""import fasttext
# Basic supervised training
model = fasttext.train_supervised(
input='train.txt',
lr=0.1,
dim=100,
epoch=25,
wordNgrams=2,
loss='softmax'
)
# Advanced training with character n-grams
model = fasttext.train_supervised(
input='train.txt',
lr=0.5,
dim=300,
epoch=25,
minn=3,
maxn=6,
wordNgrams=2,
loss='ova' # One-vs-all for multi-label
)
# Training with pretrained vectors
model = fasttext.train_supervised(
input='train.txt',
pretrainedVectors='wiki.en.vec',
epoch=15,
lr=0.1
)
# AutoTune training for optimal hyperparameters
model = fasttext.train_supervised(
input='train.txt',
autotuneValidationFile='valid.txt',
autotuneMetric='f1',
autotuneDuration=300 # 5 minutes
)Train word embedding models using unlabeled text data. Supports both CBOW and Skip-gram architectures with subword information.
def train_unsupervised(input, **kwargs):
"""
Train an unsupervised word embedding model.
Args:
input (str): Path to training text file
model (str): Model architecture - 'cbow' or 'skipgram' (default: 'skipgram')
lr (float): Learning rate (default: 0.05)
dim (int): Vector dimension (default: 100)
ws (int): Context window size (default: 5)
epoch (int): Number of training epochs (default: 5)
minCount (int): Minimum word count threshold (default: 5)
minn (int): Min character n-gram length (default: 3)
maxn (int): Max character n-gram length (default: 6)
neg (int): Number of negative samples (default: 5)
loss (str): Loss function - 'ns' or 'hs' (default: 'ns')
bucket (int): Hash bucket size (default: 2000000)
thread (int): Number of threads (default: cpu_count-1)
lrUpdateRate (int): Learning rate update frequency (default: 100)
t (float): Sampling threshold (default: 1e-4)
verbose (int): Verbosity level 0-2 (default: 2)
seed (int): Random seed (default: 0)
Returns:
_FastText: Trained model object
"""import fasttext
# Basic skip-gram training
model = fasttext.train_unsupervised(
input='data.txt',
model='skipgram',
dim=300,
epoch=5
)
# CBOW with character n-grams
model = fasttext.train_unsupervised(
input='data.txt',
model='cbow',
lr=0.05,
dim=100,
ws=5,
epoch=5,
minCount=5,
minn=3,
maxn=6,
loss='ns'
)
# High-quality embeddings with more epochs
model = fasttext.train_unsupervised(
input='large_corpus.txt',
model='skipgram',
lr=0.025,
dim=300,
ws=5,
epoch=50,
minCount=10,
minn=3,
maxn=6,
neg=10,
thread=8
)Load pre-trained FastText models from disk.
def load_model(path):
"""
Load a pre-trained FastText model.
Args:
path (str): Path to model file (.bin or .ftz format)
Returns:
_FastText: Loaded model object
Raises:
ValueError: If model file cannot be loaded
FileNotFoundError: If model file does not exist
"""import fasttext
# Load binary model
model = fasttext.load_model('model.bin')
# Load compressed model
model = fasttext.load_model('model.ftz')
# Load from different directory
model = fasttext.load_model('/path/to/models/wiki.en.bin')Training files should contain one sample per line with labels prefixed by __label__:
__label__positive This movie is great!
__label__negative Terrible film.
__label__neutral It was okay.
__label__positive __label__comedy This is a funny and great movieTraining files should contain plain text, one sentence per line:
The quick brown fox jumps over the lazy dog.
Natural language processing is fascinating.
FastText learns word representations efficiently.Install with Tessl CLI
npx tessl i tessl/pypi-fasttext