tessl/pypi-fasttext

FastText library for efficient learning of word representations and sentence classification

—

Pending

Overview

Eval results

Files

FastText

Name: tessl/pypi-fasttext
Author: tessl

FastText is a library for efficient learning of word representations and sentence classification developed by Facebook Research. The Python bindings provide comprehensive access to FastText's C++ core, enabling unsupervised word representation learning, supervised text classification, and subword information processing.

Package Information

Package Name: fasttext
Language: Python (with C++ core)
Installation: pip install fasttext

Core Imports

import fasttext

Main functions and model class:

from fasttext import train_supervised, train_unsupervised, load_model, tokenize

Basic Usage

Training a Word Embedding Model

import fasttext

# Train an unsupervised model on text file
model = fasttext.train_unsupervised('data.txt', model='skipgram')

# Get word vector
word_vector = model.get_word_vector('king')

# Find similar words
neighbors = model.get_nearest_neighbors('king')
print(neighbors)

Training a Text Classification Model

import fasttext

# Train supervised classifier
model = fasttext.train_supervised('train.txt')

# Predict labels for text
predictions = model.predict('This is a sample text')
print(predictions)

# Evaluate on test data
results = model.test('test.txt')
print(f"P@1: {results[1]}, R@1: {results[2]}")

Loading Pre-trained Models

import fasttext

# Load a pre-trained model
model = fasttext.load_model('model.bin')

# Get sentence vector
sentence_vector = model.get_sentence_vector('Hello world')

Architecture

FastText combines several key innovations:

Subword Information: Handles out-of-vocabulary words by learning representations for character n-grams
Hierarchical Softmax: Efficient training for large vocabularies
Bag-of-Words Models: CBOW and Skip-gram architectures for unsupervised learning
Fast Text Classification: Linear classifiers with efficient training and inference

The Python bindings expose the complete C++ API through pybind11, providing both high-level training functions and low-level model manipulation capabilities.

Capabilities

Model Training

Core training functions for both supervised classification and unsupervised word embeddings with extensive hyperparameter control.

def train_supervised(input, **kwargs):
    """
    Train a supervised classification model.
    
    Args:
        input (str): Path to training file
        **kwargs: Training parameters (lr, dim, epoch, etc.)
    
    Returns:
        FastText model object
    """

def train_unsupervised(input, **kwargs):
    """
    Train an unsupervised word embedding model.
    
    Args:
        input (str): Path to training file
        **kwargs: Training parameters (model, lr, dim, etc.)
    
    Returns:
        FastText model object
    """

def load_model(path):
    """
    Load a pre-trained FastText model.
    
    Args:
        path (str): Path to model file
    
    Returns:
        FastText model object
    """

Model Training

Word Vector Operations

Access and manipulate word vectors, find similar words, and perform vector arithmetic operations.

def get_word_vector(word):
    """Get vector representation of a word."""

def get_sentence_vector(text):
    """Get vector representation of a sentence."""

def get_nearest_neighbors(word, k=10):
    """Find k nearest neighbors of a word."""

def get_analogies(wordA, wordB, wordC, k=10):
    """Find analogies of the form A:B::C:?"""

Word Vectors

Text Classification

Predict labels for text, evaluate model performance, and access detailed classification metrics.

def predict(text, k=1, threshold=0.0):
    """
    Predict labels for input text.
    
    Args:
        text (str): Input text to classify
        k (int): Number of top predictions to return
        threshold (float): Minimum prediction confidence
    
    Returns:
        Tuple of (labels, probabilities)
    """

def test(path, k=1, threshold=0.0):
    """
    Evaluate model on test data.
    
    Returns:
        Tuple of (sample_count, precision, recall)
    """

Classification

Utility Functions

Helper functions for text processing, model manipulation, and downloading pre-trained models.

def tokenize(text):
    """Tokenize text into list of tokens."""

def quantize(**kwargs):
    """Quantize model to reduce memory usage."""

# Utility module functions
import fasttext.util
fasttext.util.download_model(lang_id, if_exists='strict')
fasttext.util.reduce_model(model, target_dim)

Utilities

Constants and Enums

# Model type enums (from C++ bindings via fasttext_pybind)
import fasttext
model_name = fasttext.model_name  # Enum with values: cbow, skipgram, supervised
loss_name = fasttext.loss_name    # Enum with values: hs, ns, softmax, ova

# Special tokens used in text processing
EOS = "</s>"       # End of sentence token - marks sentence boundaries
BOW = "<"          # Beginning of word token - used in subword processing
EOW = ">"          # End of word token - used in subword processing

# Deprecated functions (raise exceptions with migration guidance)
cbow = fasttext.cbow              # Raises exception, use train_unsupervised(model='cbow')
skipgram = fasttext.skipgram      # Raises exception, use train_unsupervised(model='skipgram') 
supervised = fasttext.supervised  # Raises exception, use train_supervised()

Error Handling

FastText functions accept on_unicode_error parameter for handling Unicode errors:

'strict' (default): Raise exception on Unicode errors
'ignore': Skip invalid Unicode characters
'replace': Replace invalid Unicode with placeholder

Install with Tessl CLI

npx tessl i tessl/pypi-fasttext

Workspace: tessl
Visibility: Public
Created: 6 months ago
Last updated: about 1 month ago
Describes: pkg:pypi/fasttext@0.9.x
Publish Source: CLI
Badge