CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-gensim

Python library for topic modelling, document indexing and similarity retrieval with large corpora

78

1.02x
Overview
Eval results
Files

task.mdevals/scenario-9/

Topic Coherence Reporter

Build a helper that scores candidate topics against a small reference corpus using coherence metrics to identify which topics align with the documents.

Capabilities

Corpus preparation

  • Preparing the texts ["Cat, cat! Dog3", "Dog plays DOG."] yields normalized tokens [["cat", "cat", "dog"], ["dog", "plays", "dog"]]; the vocabulary contains exactly {"cat", "dog", "plays"}; the first bag-of-words entry counts "cat" twice and "dog" once. @test

Coherence scoring

  • Given the reference texts ["human machine interface for lab abc computer applications", "a survey of user opinion of computer system response time", "the eps user interface management system", "system and human system engineering testing of eps", "user response time from eps interface"] and topics [["system", "user", "interface"], ["pineapple", "kiwi", "mango"]], the returned structure includes both u_mass and c_v sections, each with two per-topic scores matching the topic order and an average equal to the mean of those values. The coherent topic (["system", "user", "interface"]) yields higher u_mass and c_v scores than the unrelated topic. @test

Top-word control

  • With the same reference texts, scoring topics [["system", "user", "interface", "response", "eps", "zzz"]] and [["system", "user", "interface"]] using topn=3 returns identical per-topic coherence values for those two topics, confirming that trailing words beyond the limit are ignored. @test

Implementation

@generates

API

from typing import Any, Dict, List, Tuple

def prepare_corpus(texts: List[str]) -> Tuple[List[List[str]], Any, Any]:
    """
    Normalizes raw documents into token lists, a vocabulary mapping, and a bag-of-words corpus suitable for coherence metrics.
    Tokens must be lowercased with punctuation removed and numeric-only tokens dropped.
    """

def score_topics(topics: List[List[str]], texts: List[str], topn: int = 10) -> Dict[str, Any]:
    """
    Computes topic coherence against reference texts using the top words from each topic.
    Returns a structure with two metrics ('u_mass' and 'c_v'), each containing:
      - 'per_topic': list of floats aligned with the input topics
      - 'average': mean of the per-topic values
    Also returns 'ranked_topics': list of {'topic': index, 'c_v': score} sorted by descending c_v.
    """

Dependencies { .dependencies }

gensim { .dependency }

Provides topic coherence evaluation utilities and text preprocessing helpers.

Install with Tessl CLI

npx tessl i tessl/pypi-gensim

tile.json