tessl install tessl/pypi-gensim@4.3.0Python library for topic modelling, document indexing and similarity retrieval with large corpora
Agent Success
Agent success rate when using this tile
78%
Improvement
Agent success rate improvement when using this tile compared to baseline
1.03x
Baseline
Agent success rate without this tile
76%
Transforms raw documents into normalized token streams for downstream models without loading everything into memory.
["Café---Mocha!!!", "Numbers 123 and dots..."] yields [["cafe", "mocha"], ["numbers", "dots"]], preserving document order while lowercasing, deaccenting, stripping punctuation/digits, and dropping tokens outside the configured length bounds. @test["This is a simple simple thing"] and extra stopwords {"simple"}, the tokens become ["thing"] after combining custom and default stopwords with length limits applied. @test["check #topic scaling"] returns ["check", "scaling"], with filters applied before tokenization. @test["Stream once", "Stream twice!"] to a file produces two lines: stream once and stream twice, and reports that two documents were written. @test@generates
from typing import Callable, Collection, Iterable, List, Optional, Sequence
def iter_clean_tokens(
docs: Iterable[str],
*,
min_len: int = 2,
max_len: int = 15,
stopwords: Optional[Collection[str]] = None,
extra_filters: Optional[Sequence[Callable[[str], str]]] = None,
) -> Iterable[List[str]]:
"""
Yield per-document lists of normalized tokens from a potentially large iterable of text.
Tokens must be lowercase, ASCII-normalized, stripped of punctuation and digits,
and respect min_len/max_len. Stopwords combine defaults with provided stopwords.
extra_filters run in order on raw text before tokenization.
"""
def write_clean_corpus(
docs: Iterable[str],
output_path: str,
*,
delimiter: str = " ",
**kwargs,
) -> int:
"""
Write cleaned tokens for each doc to output_path, one line per document,
using iter_clean_tokens(**kwargs). Returns the number of documents written.
"""Provides streaming-friendly text preprocessing helpers such as token normalization, deaccenting, and stopword removal.