tessl/pypi-gensim

Python library for topic modelling, document indexing and similarity retrieval with large corpora

1.02x

Overview

Eval results

Files

Streaming Text Cleaner

Name: tessl/pypi-gensim
Rating: 0.78 (1 reviews)
Author: tessl

Transforms raw documents into normalized token streams for downstream models without loading everything into memory.

Capabilities

Normalize streaming documents

Processing ["Café---Mocha!!!", "Numbers 123 and dots..."] yields [["cafe", "mocha"], ["numbers", "dots"]], preserving document order while lowercasing, deaccenting, stripping punctuation/digits, and dropping tokens outside the configured length bounds. @test

Stopword filtering

Given docs ["This is a simple simple thing"] and extra stopwords {"simple"}, the tokens become ["thing"] after combining custom and default stopwords with length limits applied. @test

Custom pre-token filters

Accepting a callable that removes hashtags, processing ["check #topic scaling"] returns ["check", "scaling"], with filters applied before tokenization. @test

Persist cleaned corpus

Writing cleaned tokens for ["Stream once", "Stream twice!"] to a file produces two lines: stream once and stream twice, and reports that two documents were written. @test

Implementation

@generates

API

from typing import Callable, Collection, Iterable, List, Optional, Sequence

def iter_clean_tokens(
    docs: Iterable[str],
    *,
    min_len: int = 2,
    max_len: int = 15,
    stopwords: Optional[Collection[str]] = None,
    extra_filters: Optional[Sequence[Callable[[str], str]]] = None,
) -> Iterable[List[str]]:
    """
    Yield per-document lists of normalized tokens from a potentially large iterable of text.
    Tokens must be lowercase, ASCII-normalized, stripped of punctuation and digits,
    and respect min_len/max_len. Stopwords combine defaults with provided stopwords.
    extra_filters run in order on raw text before tokenization.
    """

def write_clean_corpus(
    docs: Iterable[str],
    output_path: str,
    *,
    delimiter: str = " ",
    **kwargs,
) -> int:
    """
    Write cleaned tokens for each doc to output_path, one line per document,
    using iter_clean_tokens(**kwargs). Returns the number of documents written.
    """

Dependencies { .dependencies }

gensim { .dependency }

Provides streaming-friendly text preprocessing helpers such as token normalization, deaccenting, and stopword removal.

Install with Tessl CLI

npx tessl i tessl/pypi-gensim