tessl install tessl/pypi-gensim@4.3.0Python library for topic modelling, document indexing and similarity retrieval with large corpora
Agent Success
Agent success rate when using this tile
78%
Improvement
Agent success rate improvement when using this tile compared to baseline
1.03x
Baseline
Agent success rate without this tile
76%
Builds and persists a bag-of-words corpus from tokenized documents, with filtering options and reloadable corpus streaming.
[["graph", "graph", "tree"], ["root", "tree", "leaf"], ["graph", "root"]], creating the manager assigns contiguous token IDs in order of first appearance and encode_corpus() returns [(0, 2), (1, 1)], [(2, 1), (1, 1), (3, 1)], and [(0, 1), (2, 1)] for the three documents @testfilter_tokens(min_docs=2, max_doc_proportion=0.8), the token that appears in only one document is removed from the mapping, and encode_corpus() drops it so the first document encodes as [(0, 2), (1, 1)] while the second encodes as [(2, 1), (1, 1)] @testload(), iterating the reloaded corpus yields the same encoded triples as before and preserves token IDs @test@generates
from typing import Iterable, List, Tuple, Dict
class CorpusManager:
def __init__(self, documents: Iterable[Iterable[str]]): ...
def encode_corpus(self) -> List[List[Tuple[int, int]]]: ...
def filter_tokens(self, min_docs: int, max_doc_proportion: float) -> None: ...
def save(self, directory: str) -> None: ...
@classmethod
def load(cls, directory: str) -> "CorpusManager": ...
@property
def token_ids(self) -> Dict[str, int]: ...Provides streaming vocabulary management and sparse corpus serialization utilities. @satisfied-by