CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/pypi-gensim

tessl install tessl/pypi-gensim@4.3.0

Python library for topic modelling, document indexing and similarity retrieval with large corpora

Agent Success

Agent success rate when using this tile

78%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.03x

Baseline

Agent success rate without this tile

76%

task.mdevals/scenario-5/

Streaming Text Cleaner

Transforms raw documents into normalized token streams for downstream models without loading everything into memory.

Capabilities

Normalize streaming documents

  • Processing ["Café---Mocha!!!", "Numbers 123 and dots..."] yields [["cafe", "mocha"], ["numbers", "dots"]], preserving document order while lowercasing, deaccenting, stripping punctuation/digits, and dropping tokens outside the configured length bounds. @test

Stopword filtering

  • Given docs ["This is a simple simple thing"] and extra stopwords {"simple"}, the tokens become ["thing"] after combining custom and default stopwords with length limits applied. @test

Custom pre-token filters

  • Accepting a callable that removes hashtags, processing ["check #topic scaling"] returns ["check", "scaling"], with filters applied before tokenization. @test

Persist cleaned corpus

  • Writing cleaned tokens for ["Stream once", "Stream twice!"] to a file produces two lines: stream once and stream twice, and reports that two documents were written. @test

Implementation

@generates

API

from typing import Callable, Collection, Iterable, List, Optional, Sequence

def iter_clean_tokens(
    docs: Iterable[str],
    *,
    min_len: int = 2,
    max_len: int = 15,
    stopwords: Optional[Collection[str]] = None,
    extra_filters: Optional[Sequence[Callable[[str], str]]] = None,
) -> Iterable[List[str]]:
    """
    Yield per-document lists of normalized tokens from a potentially large iterable of text.
    Tokens must be lowercase, ASCII-normalized, stripped of punctuation and digits,
    and respect min_len/max_len. Stopwords combine defaults with provided stopwords.
    extra_filters run in order on raw text before tokenization.
    """

def write_clean_corpus(
    docs: Iterable[str],
    output_path: str,
    *,
    delimiter: str = " ",
    **kwargs,
) -> int:
    """
    Write cleaned tokens for each doc to output_path, one line per document,
    using iter_clean_tokens(**kwargs). Returns the number of documents written.
    """

Dependencies { .dependencies }

gensim { .dependency }

Provides streaming-friendly text preprocessing helpers such as token normalization, deaccenting, and stopword removal.

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/gensim@4.3.x
tile.json