tessl install tessl/pypi-gensim@4.3.0Python library for topic modelling, document indexing and similarity retrieval with large corpora
Agent Success
Agent success rate when using this tile
78%
Improvement
Agent success rate improvement when using this tile compared to baseline
1.03x
Baseline
Agent success rate without this tile
76%
Build a small utility that learns weighting models over a tokenized corpus and exposes helpers to transform documents, rank corpus items, and optionally compress vectors.
[["alpha", "beta", "beta"], ["beta", "gamma"]] builds the vocabulary and allows transforming any document into a weighted vector using either TF-IDF or log-entropy. Transforming ["alpha", "beta"] in TF-IDF mode yields a vector sorted by descending weight where the rarer token "alpha" weighs more than "beta", and unseen tokens are omitted. @test["beta", "gamma"] produces a vector whose L2 norm is approximately 1.0 (within a tiny tolerance). @test[["apple", "apple", "banana"], ["apple", "banana", "banana"]], evaluating the query ["apple", "apple"] returns document indices with scores ordered from most to least relevant, and the first document (with two "apple" tokens) ranks above the second. @testprojection_dim is set to 2, projecting the TF-IDF vector for ["alpha", "gamma"] produces a dense vector of length 2 whose values remain identical across repeated calls with the same input. @testtop_terms on the TF-IDF vector for ["alpha", "beta"] with limit=2 returns two (token, weight) pairs ordered by weight, allowing the highest-weighted tokens to be surfaced for any transformed document. @test@generates
from typing import Iterable, List, Tuple, Optional, Union
class WeightedCorpusToolkit:
def __init__(self, weighting: str = "tfidf", projection_dim: Optional[int] = None, normalize: bool = True): ...
def fit(self, documents: Iterable[Iterable[str]]) -> None: ...
def transform(
self,
document: Iterable[str],
weighting: Optional[str] = None,
project: bool = False
) -> Union[List[Tuple[int, float]], List[float]]: ...
def rank(self, query: Iterable[str]) -> List[Tuple[int, float]]: ...
def top_terms(
self,
document: Iterable[str],
weighting: Optional[str] = None,
limit: int = 5
) -> List[Tuple[str, float]]: ...Provides vector-space weighting and transforms over bag-of-words corpora.