CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-fasttext

FastText library for efficient learning of word representations and sentence classification

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

index.mddocs/

FastText

FastText is a library for efficient learning of word representations and sentence classification developed by Facebook Research. The Python bindings provide comprehensive access to FastText's C++ core, enabling unsupervised word representation learning, supervised text classification, and subword information processing.

Package Information

  • Package Name: fasttext
  • Language: Python (with C++ core)
  • Installation: pip install fasttext

Core Imports

import fasttext

Main functions and model class:

from fasttext import train_supervised, train_unsupervised, load_model, tokenize

Basic Usage

Training a Word Embedding Model

import fasttext

# Train an unsupervised model on text file
model = fasttext.train_unsupervised('data.txt', model='skipgram')

# Get word vector
word_vector = model.get_word_vector('king')

# Find similar words
neighbors = model.get_nearest_neighbors('king')
print(neighbors)

Training a Text Classification Model

import fasttext

# Train supervised classifier
model = fasttext.train_supervised('train.txt')

# Predict labels for text
predictions = model.predict('This is a sample text')
print(predictions)

# Evaluate on test data
results = model.test('test.txt')
print(f"P@1: {results[1]}, R@1: {results[2]}")

Loading Pre-trained Models

import fasttext

# Load a pre-trained model
model = fasttext.load_model('model.bin')

# Get sentence vector
sentence_vector = model.get_sentence_vector('Hello world')

Architecture

FastText combines several key innovations:

  • Subword Information: Handles out-of-vocabulary words by learning representations for character n-grams
  • Hierarchical Softmax: Efficient training for large vocabularies
  • Bag-of-Words Models: CBOW and Skip-gram architectures for unsupervised learning
  • Fast Text Classification: Linear classifiers with efficient training and inference

The Python bindings expose the complete C++ API through pybind11, providing both high-level training functions and low-level model manipulation capabilities.

Capabilities

Model Training

Core training functions for both supervised classification and unsupervised word embeddings with extensive hyperparameter control.

def train_supervised(input, **kwargs):
    """
    Train a supervised classification model.
    
    Args:
        input (str): Path to training file
        **kwargs: Training parameters (lr, dim, epoch, etc.)
    
    Returns:
        FastText model object
    """

def train_unsupervised(input, **kwargs):
    """
    Train an unsupervised word embedding model.
    
    Args:
        input (str): Path to training file
        **kwargs: Training parameters (model, lr, dim, etc.)
    
    Returns:
        FastText model object
    """

def load_model(path):
    """
    Load a pre-trained FastText model.
    
    Args:
        path (str): Path to model file
    
    Returns:
        FastText model object
    """

Model Training

Word Vector Operations

Access and manipulate word vectors, find similar words, and perform vector arithmetic operations.

def get_word_vector(word):
    """Get vector representation of a word."""

def get_sentence_vector(text):
    """Get vector representation of a sentence."""

def get_nearest_neighbors(word, k=10):
    """Find k nearest neighbors of a word."""

def get_analogies(wordA, wordB, wordC, k=10):
    """Find analogies of the form A:B::C:?"""

Word Vectors

Text Classification

Predict labels for text, evaluate model performance, and access detailed classification metrics.

def predict(text, k=1, threshold=0.0):
    """
    Predict labels for input text.
    
    Args:
        text (str): Input text to classify
        k (int): Number of top predictions to return
        threshold (float): Minimum prediction confidence
    
    Returns:
        Tuple of (labels, probabilities)
    """

def test(path, k=1, threshold=0.0):
    """
    Evaluate model on test data.
    
    Returns:
        Tuple of (sample_count, precision, recall)
    """

Classification

Utility Functions

Helper functions for text processing, model manipulation, and downloading pre-trained models.

def tokenize(text):
    """Tokenize text into list of tokens."""

def quantize(**kwargs):
    """Quantize model to reduce memory usage."""

# Utility module functions
import fasttext.util
fasttext.util.download_model(lang_id, if_exists='strict')
fasttext.util.reduce_model(model, target_dim)

Utilities

Constants and Enums

# Model type enums (from C++ bindings via fasttext_pybind)
import fasttext
model_name = fasttext.model_name  # Enum with values: cbow, skipgram, supervised
loss_name = fasttext.loss_name    # Enum with values: hs, ns, softmax, ova

# Special tokens used in text processing
EOS = "</s>"       # End of sentence token - marks sentence boundaries
BOW = "<"          # Beginning of word token - used in subword processing
EOW = ">"          # End of word token - used in subword processing

# Deprecated functions (raise exceptions with migration guidance)
cbow = fasttext.cbow              # Raises exception, use train_unsupervised(model='cbow')
skipgram = fasttext.skipgram      # Raises exception, use train_unsupervised(model='skipgram') 
supervised = fasttext.supervised  # Raises exception, use train_supervised()

Error Handling

FastText functions accept on_unicode_error parameter for handling Unicode errors:

  • 'strict' (default): Raise exception on Unicode errors
  • 'ignore': Skip invalid Unicode characters
  • 'replace': Replace invalid Unicode with placeholder

docs

classification.md

index.md

training.md

utilities.md

word-vectors.md

tile.json