CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/pypi-gensim

tessl install tessl/pypi-gensim@4.3.0

Python library for topic modelling, document indexing and similarity retrieval with large corpora

Agent Success

Agent success rate when using this tile

78%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.03x

Baseline

Agent success rate without this tile

76%

rubric.jsonevals/scenario-5/

{
  "context": "Evaluates whether the solution uses gensim's streaming text preprocessing to normalize documents, filter stopwords, support custom filters, and write cleaned corpora without loading everything into memory. Focus is on leveraging built-in tokenization, filtering, and stopword utilities rather than manual reimplementation.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Streaming tokenization",
      "description": "iter_clean_tokens relies on gensim.utils.simple_preprocess (or equivalent gensim tokenizer) with deaccenting and min_len/max_len to yield per-document tokens lazily.",
      "max_score": 30
    },
    {
      "name": "Stopword merge",
      "description": "Default stopwords come from gensim.parsing.preprocessing.STOPWORDS (or remove_stopwords) and are combined with provided stopwords before filtering tokens.",
      "max_score": 20
    },
    {
      "name": "Built-in cleaners",
      "description": "Punctuation/numeric stripping and length filtering use gensim preprocessing filters (e.g., strip_punctuation, strip_numeric, strip_short) or preprocess_string rather than hand-rolled regexes.",
      "max_score": 20
    },
    {
      "name": "Custom filters hook",
      "description": "extra_filters are threaded through gensim's preprocessing pipeline (e.g., preprocess_string with appended callables) so caller-provided filters run in order before tokenization.",
      "max_score": 15
    },
    {
      "name": "Streaming write",
      "description": "write_clean_corpus consumes the generator incrementally and writes joined tokens to disk without materializing the full corpus in memory, matching the requested delimiter and returning the document count.",
      "max_score": 15
    }
  ]
}

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/gensim@4.3.x
tile.json