CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/pypi-gensim

tessl install tessl/pypi-gensim@4.3.0

Python library for topic modelling, document indexing and similarity retrieval with large corpora

Agent Success

Agent success rate when using this tile

78%

Improvement

Agent success rate improvement when using this tile compared to baseline

1.03x

Baseline

Agent success rate without this tile

76%

task.mdevals/scenario-2/

Topic Coherence Reporter

Build a helper that scores candidate topics against a small reference corpus using coherence metrics to identify which topics align with the documents.

Capabilities

Corpus preparation

  • Preparing the texts ["Cat, cat! Dog3", "Dog plays DOG."] yields normalized tokens [["cat", "cat", "dog"], ["dog", "plays", "dog"]]; the vocabulary contains exactly {"cat", "dog", "plays"}; the first bag-of-words entry counts "cat" twice and "dog" once. @test

Coherence scoring

  • Given the reference texts ["human machine interface for lab abc computer applications", "a survey of user opinion of computer system response time", "the eps user interface management system", "system and human system engineering testing of eps", "user response time from eps interface"] and topics [["system", "user", "interface"], ["pineapple", "kiwi", "mango"]], the returned structure includes both u_mass and c_v sections, each with two per-topic scores matching the topic order and an average equal to the mean of those values. The coherent topic (["system", "user", "interface"]) yields higher u_mass and c_v scores than the unrelated topic. @test

Top-word control

  • With the same reference texts, scoring topics [["system", "user", "interface", "response", "eps", "zzz"]] and [["system", "user", "interface"]] using topn=3 returns identical per-topic coherence values for those two topics, confirming that trailing words beyond the limit are ignored. @test

Implementation

@generates

API

from typing import Any, Dict, List, Tuple

def prepare_corpus(texts: List[str]) -> Tuple[List[List[str]], Any, Any]:
    """
    Normalizes raw documents into token lists, a vocabulary mapping, and a bag-of-words corpus suitable for coherence metrics.
    Tokens must be lowercased with punctuation removed and numeric-only tokens dropped.
    """

def score_topics(topics: List[List[str]], texts: List[str], topn: int = 10) -> Dict[str, Any]:
    """
    Computes topic coherence against reference texts using the top words from each topic.
    Returns a structure with two metrics ('u_mass' and 'c_v'), each containing:
      - 'per_topic': list of floats aligned with the input topics
      - 'average': mean of the per-topic values
    Also returns 'ranked_topics': list of {'topic': index, 'c_v': score} sorted by descending c_v.
    """

Dependencies { .dependencies }

gensim { .dependency }

Provides topic coherence evaluation utilities and text preprocessing helpers.

Version

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/gensim@4.3.x
tile.json