CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-gensim

Python library for topic modelling, document indexing and similarity retrieval with large corpora

78

1.02x
Overview
Eval results
Files

rubric.jsonevals/scenario-2/

{
  "context": "Evaluates whether the solution uses gensim's streaming text preprocessing to normalize documents, filter stopwords, support custom filters, and write cleaned corpora without loading everything into memory. Focus is on leveraging built-in tokenization, filtering, and stopword utilities rather than manual reimplementation.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Streaming tokenization",
      "description": "iter_clean_tokens relies on gensim.utils.simple_preprocess (or equivalent gensim tokenizer) with deaccenting and min_len/max_len to yield per-document tokens lazily.",
      "max_score": 30
    },
    {
      "name": "Stopword merge",
      "description": "Default stopwords come from gensim.parsing.preprocessing.STOPWORDS (or remove_stopwords) and are combined with provided stopwords before filtering tokens.",
      "max_score": 20
    },
    {
      "name": "Built-in cleaners",
      "description": "Punctuation/numeric stripping and length filtering use gensim preprocessing filters (e.g., strip_punctuation, strip_numeric, strip_short) or preprocess_string rather than hand-rolled regexes.",
      "max_score": 20
    },
    {
      "name": "Custom filters hook",
      "description": "extra_filters are threaded through gensim's preprocessing pipeline (e.g., preprocess_string with appended callables) so caller-provided filters run in order before tokenization.",
      "max_score": 15
    },
    {
      "name": "Streaming write",
      "description": "write_clean_corpus consumes the generator incrementally and writes joined tokens to disk without materializing the full corpus in memory, matching the requested delimiter and returning the document count.",
      "max_score": 15
    }
  ]
}

Install with Tessl CLI

npx tessl i tessl/pypi-gensim

tile.json