Python library for topic modelling, document indexing and similarity retrieval with large corpora
npx @tessl/cli install tessl/pypi-gensim@4.3.0A comprehensive Python library for natural language processing and information retrieval that specializes in topic modeling, document indexing, and similarity retrieval for large text corpora. Gensim provides memory-efficient implementations of popular algorithms like Word2Vec, Doc2Vec, FastText, Latent Dirichlet Allocation (LDA), and Latent Semantic Analysis (LSA) with optimized C/C++ extensions for production-scale applications.
pip install gensimimport gensimAccess main modules:
from gensim import corpora, models, similarities
from gensim.models import Word2Vec, LdaModel, Doc2Vec
from gensim.corpora import Dictionary
import gensim.downloader as apifrom gensim import corpora
from gensim.models import LdaModel
from gensim.parsing.preprocessing import preprocess_string
import gensim.downloader as api
# Load a dataset
dataset = api.load("text8") # Wikipedia dataset
# Create a dictionary and corpus
dictionary = corpora.Dictionary(dataset)
corpus = [dictionary.doc2bow(text) for text in dataset]
# Train an LDA model
lda_model = LdaModel(
corpus=corpus,
id2word=dictionary,
num_topics=10,
passes=10
)
# Get topics
topics = lda_model.print_topics(num_words=5)
for topic in topics:
print(topic)
# Load pre-trained word vectors
word_vectors = api.load("glove-twitter-25")
similar_words = word_vectors.most_similar("python", topn=5)
print(similar_words)Gensim follows a modular architecture built around three core concepts:
This design enables memory-efficient processing of corpora larger than available RAM through streaming and online algorithms. The library integrates deeply with NumPy and SciPy for mathematical operations and provides optional Cython extensions for performance-critical components.
Core machine learning models including topic models (LDA, HDP), word embeddings (Word2Vec, FastText, Doc2Vec), and dimensionality reduction techniques (LSI, TF-IDF). These models support streaming training and can process datasets larger than memory.
# Topic Models
class LdaModel: ...
class HdpModel: ...
class LdaMulticore: ...
# Word Embeddings
class Word2Vec: ...
class Doc2Vec: ...
class FastText: ...
class KeyedVectors: ...
# Dimensionality Reduction
class LsiModel: ...
class TfidfModel: ...
class RpModel: ...NLP Models and Transformations
Comprehensive corpus I/O supporting 13+ formats including Matrix Market, SVMlight, and Wikipedia dumps. Provides dictionary management for word-to-ID mappings with frequency statistics and corpus preprocessing utilities.
# Core Corpus Classes
class Dictionary: ...
class MmCorpus: ...
class TextCorpus: ...
class WikiCorpus: ...
# Additional Formats
class BleiCorpus: ...
class SvmLightCorpus: ...
class UciCorpus: ...Efficient similarity calculations for documents and terms including cosine similarity, soft cosine similarity with term relationships, and Word Mover's Distance. Supports both dense and sparse similarity matrices with sharded indexing for large corpora.
# Document Similarity
class Similarity: ...
class MatrixSimilarity: ...
class SoftCosineSimilarity: ...
class WmdSimilarity: ...
# Term Similarity
class WordEmbeddingSimilarityIndex: ...
class SparseTermSimilarityMatrix: ...Comprehensive text preprocessing pipeline with stemming, stopword removal, tokenization, and text cleaning functions. Supports customizable preprocessing chains for document preparation.
# Preprocessing Functions
def preprocess_string(s: str, filters: list = None) -> list: ...
def remove_stopwords(s: str) -> str: ...
def strip_punctuation(s: str) -> str: ...
def stem_text(text: str) -> str: ...
# Stemming Classes
class PorterStemmer: ...Linear algebra operations, vector manipulations, and distance metrics optimized for NLP tasks. Includes BLAS integration, sparse/dense matrix conversions, and statistical measures like KL divergence and Jensen-Shannon distance.
# Vector Operations
def unitvec(vec): ...
def cossim(vec1, vec2): ...
def veclen(vec): ...
# Matrix Operations
def corpus2csc(corpus): ...
def sparse2full(vec, length): ...
# Distance Metrics
def kullback_leibler(vec1, vec2): ...
def jensen_shannon(vec1, vec2): ...Convenient API for downloading pre-trained models and datasets including Word2Vec, GloVe, FastText models, and text corpora. Handles caching, version management, and integrity verification.
def load(name: str, return_path: bool = False): ...
def info(name: str = None): ...# Base Interfaces
class CorpusABC:
def __iter__(self): ...
def __len__(self): ...
class TransformationABC:
def __getitem__(self, bow): ...
class SimilarityABC:
def __getitem__(self, query): ...
# Common Types
BowDocument = list[tuple[int, float]] # Bag-of-words document representation
Corpus = Iterable[BowDocument] # Stream of documents
Dictionary = dict[str, int] # Word to ID mapping