tessl install tessl/pypi-gensim@4.3.0Python library for topic modelling, document indexing and similarity retrieval with large corpora
Agent Success
Agent success rate when using this tile
78%
Improvement
Agent success rate improvement when using this tile compared to baseline
1.03x
Baseline
Agent success rate without this tile
76%
Build a helper that scores candidate topics against a small reference corpus using coherence metrics to identify which topics align with the documents.
["Cat, cat! Dog3", "Dog plays DOG."] yields normalized tokens [["cat", "cat", "dog"], ["dog", "plays", "dog"]]; the vocabulary contains exactly {"cat", "dog", "plays"}; the first bag-of-words entry counts "cat" twice and "dog" once. @test["human machine interface for lab abc computer applications", "a survey of user opinion of computer system response time", "the eps user interface management system", "system and human system engineering testing of eps", "user response time from eps interface"] and topics [["system", "user", "interface"], ["pineapple", "kiwi", "mango"]], the returned structure includes both u_mass and c_v sections, each with two per-topic scores matching the topic order and an average equal to the mean of those values. The coherent topic (["system", "user", "interface"]) yields higher u_mass and c_v scores than the unrelated topic. @test[["system", "user", "interface", "response", "eps", "zzz"]] and [["system", "user", "interface"]] using topn=3 returns identical per-topic coherence values for those two topics, confirming that trailing words beyond the limit are ignored. @test@generates
from typing import Any, Dict, List, Tuple
def prepare_corpus(texts: List[str]) -> Tuple[List[List[str]], Any, Any]:
"""
Normalizes raw documents into token lists, a vocabulary mapping, and a bag-of-words corpus suitable for coherence metrics.
Tokens must be lowercased with punctuation removed and numeric-only tokens dropped.
"""
def score_topics(topics: List[List[str]], texts: List[str], topn: int = 10) -> Dict[str, Any]:
"""
Computes topic coherence against reference texts using the top words from each topic.
Returns a structure with two metrics ('u_mass' and 'c_v'), each containing:
- 'per_topic': list of floats aligned with the input topics
- 'average': mean of the per-topic values
Also returns 'ranked_topics': list of {'topic': index, 'c_v': score} sorted by descending c_v.
"""Provides topic coherence evaluation utilities and text preprocessing helpers.