CtrlK

Community Documentation Log in Get started

tessl/pypi-gensim

tessl install tessl/pypi-gensim@4.3.0

Python library for topic modelling, document indexing and similarity retrieval with large corpora

Agent Success

Agent success rate when using this tile

78%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.03x

Baseline

Agent success rate without this tile

76%

Streaming Text Cleaner

Transforms raw documents into normalized token streams for downstream models without loading everything into memory.

Capabilities

Normalize streaming documents

Processing ["Café---Mocha!!!", "Numbers 123 and dots..."] yields [["cafe", "mocha"], ["numbers", "dots"]], preserving document order while lowercasing, deaccenting, stripping punctuation/digits, and dropping tokens outside the configured length bounds. @test

Stopword filtering

Given docs ["This is a simple simple thing"] and extra stopwords {"simple"}, the tokens become ["thing"] after combining custom and default stopwords with length limits applied. @test

Custom pre-token filters

Accepting a callable that removes hashtags, processing ["check #topic scaling"] returns ["check", "scaling"], with filters applied before tokenization. @test

Persist cleaned corpus

Writing cleaned tokens for ["Stream once", "Stream twice!"] to a file produces two lines: stream once and stream twice, and reports that two documents were written. @test

Implementation

@generates

API

from typing import Callable, Collection, Iterable, List, Optional, Sequence

def iter_clean_tokens(
    docs: Iterable[str],
    *,
    min_len: int = 2,
    max_len: int = 15,
    stopwords: Optional[Collection[str]] = None,
    extra_filters: Optional[Sequence[Callable[[str], str]]] = None,
) -> Iterable[List[str]]:
    """
    Yield per-document lists of normalized tokens from a potentially large iterable of text.
    Tokens must be lowercase, ASCII-normalized, stripped of punctuation and digits,
    and respect min_len/max_len. Stopwords combine defaults with provided stopwords.
    extra_filters run in order on raw text before tokenization.
    """

def write_clean_corpus(
    docs: Iterable[str],
    output_path: str,
    *,
    delimiter: str = " ",
    **kwargs,
) -> int:
    """
    Write cleaned tokens for each doc to output_path, one line per document,
    using iter_clean_tokens(**kwargs). Returns the number of documents written.
    """

Dependencies { .dependencies }

gensim { .dependency }

Provides streaming-friendly text preprocessing helpers such as token normalization, deaccenting, and stopword removal.

tessl/pypi-gensim

task.mdevals/scenario-5/

Streaming Text Cleaner

Capabilities

Normalize streaming documents

Stopword filtering

Custom pre-token filters

Persist cleaned corpus

Implementation

API

Dependencies { .dependencies }

gensim { .dependency }

Version

tessl/pypi-gensim

task.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-5/

Streaming Text Cleaner

Capabilities

Normalize streaming documents

Stopword filtering

Custom pre-token filters

Persist cleaned corpus

Implementation

API

Dependencies { .dependencies }

gensim { .dependency }

Version

task.mdevals/scenario-5/