CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/pypi-gensim

tessl install tessl/pypi-gensim@4.3.0

Python library for topic modelling, document indexing and similarity retrieval with large corpora

Agent Success

Agent success rate when using this tile

78%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.03x

Baseline

Agent success rate without this tile

76%

task.mdevals/scenario-4/

Embedding Retrieval Toolkit

Builds word embeddings from short sentences, exposes similarity queries, and derives sentence-level vectors for downstream retrieval.

Capabilities

Train embeddings from sentences

  • Given tokenized sentences and hyperparameters, calling train builds embeddings where every token meeting min_count is in the vocabulary, each vector length equals vector_size, and repeated runs with the same seed and corpus keep similarity orderings stable. @test

Word similarity lookup

  • After training on a small corpus about royalty plus distractors (e.g., [['king', 'queen'], ['king', 'prince'], ['queen', 'princess'], ['river', 'flow']]), most_similar("king", topn=3) returns queen with a positive similarity score and ranks it ahead of unrelated tokens such as river. @test

Sentence vector inference

  • Calling infer_sentence_vector on ["spicy", "taco"] yields a dense list of floating-point numbers whose length matches vector_size, and no entry is NaN or infinity. @test

Sentence similarity comparison

  • When comparing two royalty-themed sentences (e.g., ["king", "and", "queen", "rule"] vs ["prince", "and", "princess", "rule"]) against an unrelated nature sentence (e.g., ["river", "rocks", "flow"]), sentence_similarity reports the related pair at least 0.2 higher than the unrelated pair. @test

Implementation

@generates

API

from typing import Iterable, List, Sequence, Tuple

class EmbeddingService:
    def __init__(self, vector_size: int = 50, window: int = 2, min_count: int = 1, seed: int = 42): ...
    def train(self, sentences: Iterable[Sequence[str]], epochs: int = 15) -> None: ...
    def most_similar(self, word: str, topn: int = 5) -> List[Tuple[str, float]]: ...
    def infer_sentence_vector(self, sentence: Sequence[str]) -> List[float]: ...
    def sentence_similarity(self, sentence_a: Sequence[str], sentence_b: Sequence[str]) -> float: ...

Dependencies { .dependencies }

gensim { .dependency }

Provides tools for training and querying word and document embeddings.

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/gensim@4.3.x
tile.json