CtrlK

Community Documentation Log in Get started

tessl/pypi-gensim

tessl install tessl/pypi-gensim@4.3.0

Python library for topic modelling, document indexing and similarity retrieval with large corpora

Agent Success

Agent success rate when using this tile

78%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.03x

Baseline

Agent success rate without this tile

76%

Weighted Corpus Toolkit

Build a small utility that learns weighting models over a tokenized corpus and exposes helpers to transform documents, rank corpus items, and optionally compress vectors.

Capabilities

Train and transform corpus

Fitting on a two-document corpus [["alpha", "beta", "beta"], ["beta", "gamma"]] builds the vocabulary and allows transforming any document into a weighted vector using either TF-IDF or log-entropy. Transforming ["alpha", "beta"] in TF-IDF mode yields a vector sorted by descending weight where the rarer token "alpha" weighs more than "beta", and unseen tokens are omitted. @test
Using the same corpus with log-entropy weighting and normalization enabled, transforming ["beta", "gamma"] produces a vector whose L2 norm is approximately 1.0 (within a tiny tolerance). @test

BM25 ranking

After fitting on [["apple", "apple", "banana"], ["apple", "banana", "banana"]], evaluating the query ["apple", "apple"] returns document indices with scores ordered from most to least relevant, and the first document (with two "apple" tokens) ranks above the second. @test

Random projection

If projection_dim is set to 2, projecting the TF-IDF vector for ["alpha", "gamma"] produces a dense vector of length 2 whose values remain identical across repeated calls with the same input. @test

Top term inspection

Calling top_terms on the TF-IDF vector for ["alpha", "beta"] with limit=2 returns two (token, weight) pairs ordered by weight, allowing the highest-weighted tokens to be surfaced for any transformed document. @test

Implementation

@generates

API

from typing import Iterable, List, Tuple, Optional, Union

class WeightedCorpusToolkit:
    def __init__(self, weighting: str = "tfidf", projection_dim: Optional[int] = None, normalize: bool = True): ...

    def fit(self, documents: Iterable[Iterable[str]]) -> None: ...

    def transform(
        self,
        document: Iterable[str],
        weighting: Optional[str] = None,
        project: bool = False
    ) -> Union[List[Tuple[int, float]], List[float]]: ...

    def rank(self, query: Iterable[str]) -> List[Tuple[int, float]]: ...

    def top_terms(
        self,
        document: Iterable[str],
        weighting: Optional[str] = None,
        limit: int = 5
    ) -> List[Tuple[str, float]]: ...

Dependencies { .dependencies }

gensim { .dependency }

Provides vector-space weighting and transforms over bag-of-words corpora.

tessl/pypi-gensim

task.mdevals/scenario-6/

Weighted Corpus Toolkit

Capabilities

Train and transform corpus

BM25 ranking

Random projection

Top term inspection

Implementation

API

Dependencies { .dependencies }

gensim { .dependency }

Version

tessl/pypi-gensim

task.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-6/

Weighted Corpus Toolkit

Capabilities

Train and transform corpus

BM25 ranking

Random projection

Top term inspection

Implementation

API

Dependencies { .dependencies }

gensim { .dependency }

Version

task.mdevals/scenario-6/