tessl/pypi-gensim

Python library for topic modelling, document indexing and similarity retrieval with large corpora

1.02x

Overview

Eval results

Files

Incremental Embedding Lifecycle

Name: tessl/pypi-gensim
Rating: 0.78 (1 reviews)
Author: tessl

Build a small module that trains a lightweight word embedding model, persists it, reloads it in read-only form, and performs incremental updates without losing previously learned vectors.

Capabilities

Create and persist initial model

Training on a small list of tokenized sentences writes a checkpoint file to the requested path and returns the on-disk path actually used. @test
Reloading the saved checkpoint exposes cosine similarity for words that appeared in the initial corpus without retraining. @test

Memory-mapped inference

Loading the checkpoint in memory-mapped/read-only mode answers similarity queries while keeping the checkpoint file unmodified. @test

Incremental updates

Supplying new sentences that introduce unseen tokens updates the model and saves a new checkpoint; vectors for new tokens become available while previously learned tokens remain queryable. @test

Lifecycle logging

Each training, loading, and update operation appends a timestamped lifecycle entry retrievable via the API. @test

Implementation

@generates

API

from typing import Iterable, List, Any, Sequence

def train_checkpoint(sentences: Iterable[Sequence[str]], checkpoint_path: str, vector_size: int = 50, window: int = 5) -> str:
    """Train from scratch on tokenized sentences, persist a checkpoint, and return the path used."""

def load_for_inference(checkpoint_path: str, mmap: bool = True) -> Any:
    """Load a saved checkpoint for read-only inference; supports memory mapping when mmap is True."""

def update_with_sentences(model: Any, new_sentences: Iterable[Sequence[str]], checkpoint_path: str) -> str:
    """Incrementally update an existing model with additional sentences, persist a new checkpoint, and return its path."""

def similarity(model: Any, word_a: str, word_b: str) -> float:
    """Return cosine similarity for two tokens from the current model."""

def lifecycle_log(model: Any) -> List[str]:
    """Return lifecycle entries (most recent first) describing train/load/update steps with timestamps."""

Dependencies { .dependencies }

gensim { .dependency }

Provides persistence, incremental training, and lifecycle logging utilities for word embedding models.

Install with Tessl CLI

npx tessl i tessl/pypi-gensim