CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/pypi-gensim

tessl install tessl/pypi-gensim@4.3.0

Python library for topic modelling, document indexing and similarity retrieval with large corpora

Agent Success

Agent success rate when using this tile

78%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.03x

Baseline

Agent success rate without this tile

76%

task.mdevals/scenario-6/

Weighted Corpus Toolkit

Build a small utility that learns weighting models over a tokenized corpus and exposes helpers to transform documents, rank corpus items, and optionally compress vectors.

Capabilities

Train and transform corpus

  • Fitting on a two-document corpus [["alpha", "beta", "beta"], ["beta", "gamma"]] builds the vocabulary and allows transforming any document into a weighted vector using either TF-IDF or log-entropy. Transforming ["alpha", "beta"] in TF-IDF mode yields a vector sorted by descending weight where the rarer token "alpha" weighs more than "beta", and unseen tokens are omitted. @test
  • Using the same corpus with log-entropy weighting and normalization enabled, transforming ["beta", "gamma"] produces a vector whose L2 norm is approximately 1.0 (within a tiny tolerance). @test

BM25 ranking

  • After fitting on [["apple", "apple", "banana"], ["apple", "banana", "banana"]], evaluating the query ["apple", "apple"] returns document indices with scores ordered from most to least relevant, and the first document (with two "apple" tokens) ranks above the second. @test

Random projection

  • If projection_dim is set to 2, projecting the TF-IDF vector for ["alpha", "gamma"] produces a dense vector of length 2 whose values remain identical across repeated calls with the same input. @test

Top term inspection

  • Calling top_terms on the TF-IDF vector for ["alpha", "beta"] with limit=2 returns two (token, weight) pairs ordered by weight, allowing the highest-weighted tokens to be surfaced for any transformed document. @test

Implementation

@generates

API

from typing import Iterable, List, Tuple, Optional, Union

class WeightedCorpusToolkit:
    def __init__(self, weighting: str = "tfidf", projection_dim: Optional[int] = None, normalize: bool = True): ...

    def fit(self, documents: Iterable[Iterable[str]]) -> None: ...

    def transform(
        self,
        document: Iterable[str],
        weighting: Optional[str] = None,
        project: bool = False
    ) -> Union[List[Tuple[int, float]], List[float]]: ...

    def rank(self, query: Iterable[str]) -> List[Tuple[int, float]]: ...

    def top_terms(
        self,
        document: Iterable[str],
        weighting: Optional[str] = None,
        limit: int = 5
    ) -> List[Tuple[str, float]]: ...

Dependencies { .dependencies }

gensim { .dependency }

Provides vector-space weighting and transforms over bag-of-words corpora.

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/gensim@4.3.x
tile.json