CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/pypi-gensim

tessl install tessl/pypi-gensim@4.3.0

Python library for topic modelling, document indexing and similarity retrieval with large corpora

Agent Success

Agent success rate when using this tile

78%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.03x

Baseline

Agent success rate without this tile

76%

task.mdevals/scenario-8/

Bag-of-Words Corpus Manager

Builds and persists a bag-of-words corpus from tokenized documents, with filtering options and reloadable corpus streaming.

Capabilities

Build vocabulary and encode corpus

  • Given documents [["graph", "graph", "tree"], ["root", "tree", "leaf"], ["graph", "root"]], creating the manager assigns contiguous token IDs in order of first appearance and encode_corpus() returns [(0, 2), (1, 1)], [(2, 1), (1, 1), (3, 1)], and [(0, 1), (2, 1)] for the three documents @test

Filter tokens by document frequency

  • After building the same corpus and applying filter_tokens(min_docs=2, max_doc_proportion=0.8), the token that appears in only one document is removed from the mapping, and encode_corpus() drops it so the first document encodes as [(0, 2), (1, 1)] while the second encodes as [(2, 1), (1, 1)] @test

Persist and reload corpus

  • When saving the filtered vocabulary and encoded corpus to a directory (using a human-readable token map and a matrix-style sparse corpus file) and reloading via load(), iterating the reloaded corpus yields the same encoded triples as before and preserves token IDs @test

Implementation

@generates

API

from typing import Iterable, List, Tuple, Dict

class CorpusManager:
    def __init__(self, documents: Iterable[Iterable[str]]): ...
    def encode_corpus(self) -> List[List[Tuple[int, int]]]: ...
    def filter_tokens(self, min_docs: int, max_doc_proportion: float) -> None: ...
    def save(self, directory: str) -> None: ...
    @classmethod
    def load(cls, directory: str) -> "CorpusManager": ...
    @property
    def token_ids(self) -> Dict[str, int]: ...

Dependencies { .dependencies }

gensim { .dependency }

Provides streaming vocabulary management and sparse corpus serialization utilities. @satisfied-by

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/gensim@4.3.x
tile.json