tessl/pypi-gensim

Python library for topic modelling, document indexing and similarity retrieval with large corpora

1.02x

Overview

Eval results

Files

Topic Coherence Reporter

Name: tessl/pypi-gensim
Rating: 0.78 (1 reviews)
Author: tessl

Build a helper that scores candidate topics against a small reference corpus using coherence metrics to identify which topics align with the documents.

Capabilities

Corpus preparation

Preparing the texts ["Cat, cat! Dog3", "Dog plays DOG."] yields normalized tokens [["cat", "cat", "dog"], ["dog", "plays", "dog"]]; the vocabulary contains exactly {"cat", "dog", "plays"}; the first bag-of-words entry counts "cat" twice and "dog" once. @test

Coherence scoring

Given the reference texts ["human machine interface for lab abc computer applications", "a survey of user opinion of computer system response time", "the eps user interface management system", "system and human system engineering testing of eps", "user response time from eps interface"] and topics [["system", "user", "interface"], ["pineapple", "kiwi", "mango"]], the returned structure includes both u_mass and c_v sections, each with two per-topic scores matching the topic order and an average equal to the mean of those values. The coherent topic (["system", "user", "interface"]) yields higher u_mass and c_v scores than the unrelated topic. @test

Top-word control

With the same reference texts, scoring topics [["system", "user", "interface", "response", "eps", "zzz"]] and [["system", "user", "interface"]] using topn=3 returns identical per-topic coherence values for those two topics, confirming that trailing words beyond the limit are ignored. @test

Implementation

@generates

API

from typing import Any, Dict, List, Tuple

def prepare_corpus(texts: List[str]) -> Tuple[List[List[str]], Any, Any]:
    """
    Normalizes raw documents into token lists, a vocabulary mapping, and a bag-of-words corpus suitable for coherence metrics.
    Tokens must be lowercased with punctuation removed and numeric-only tokens dropped.
    """

def score_topics(topics: List[List[str]], texts: List[str], topn: int = 10) -> Dict[str, Any]:
    """
    Computes topic coherence against reference texts using the top words from each topic.
    Returns a structure with two metrics ('u_mass' and 'c_v'), each containing:
      - 'per_topic': list of floats aligned with the input topics
      - 'average': mean of the per-topic values
    Also returns 'ranked_topics': list of {'topic': index, 'c_v': score} sorted by descending c_v.
    """

Dependencies { .dependencies }

gensim { .dependency }

Provides topic coherence evaluation utilities and text preprocessing helpers.

Install with Tessl CLI

npx tessl i tessl/pypi-gensim