tessl install tessl/pypi-gensim@4.3.0Python library for topic modelling, document indexing and similarity retrieval with large corpora
Agent Success
Agent success rate when using this tile
78%
Improvement
Agent success rate improvement when using this tile compared to baseline
1.03x
Baseline
Agent success rate without this tile
76%
{
"context": "Evaluates whether the solution uses gensim's streaming text preprocessing to normalize documents, filter stopwords, support custom filters, and write cleaned corpora without loading everything into memory. Focus is on leveraging built-in tokenization, filtering, and stopword utilities rather than manual reimplementation.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Streaming tokenization",
"description": "iter_clean_tokens relies on gensim.utils.simple_preprocess (or equivalent gensim tokenizer) with deaccenting and min_len/max_len to yield per-document tokens lazily.",
"max_score": 30
},
{
"name": "Stopword merge",
"description": "Default stopwords come from gensim.parsing.preprocessing.STOPWORDS (or remove_stopwords) and are combined with provided stopwords before filtering tokens.",
"max_score": 20
},
{
"name": "Built-in cleaners",
"description": "Punctuation/numeric stripping and length filtering use gensim preprocessing filters (e.g., strip_punctuation, strip_numeric, strip_short) or preprocess_string rather than hand-rolled regexes.",
"max_score": 20
},
{
"name": "Custom filters hook",
"description": "extra_filters are threaded through gensim's preprocessing pipeline (e.g., preprocess_string with appended callables) so caller-provided filters run in order before tokenization.",
"max_score": 15
},
{
"name": "Streaming write",
"description": "write_clean_corpus consumes the generator incrementally and writes joined tokens to disk without materializing the full corpus in memory, matching the requested delimiter and returning the document count.",
"max_score": 15
}
]
}