or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

corpus-management.mddata-downloading.mdindex.mdmathematical-utilities.mdnlp-models.mdsimilarity-computations.mdtext-preprocessing.md
tile.json

tessl/pypi-gensim

Python library for topic modelling, document indexing and similarity retrieval with large corpora

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/gensim@4.3.x

To install, run

npx @tessl/cli install tessl/pypi-gensim@4.3.0

index.mddocs/

Gensim

A comprehensive Python library for natural language processing and information retrieval that specializes in topic modeling, document indexing, and similarity retrieval for large text corpora. Gensim provides memory-efficient implementations of popular algorithms like Word2Vec, Doc2Vec, FastText, Latent Dirichlet Allocation (LDA), and Latent Semantic Analysis (LSA) with optimized C/C++ extensions for production-scale applications.

Package Information

  • Package Name: gensim
  • Language: Python
  • Installation: pip install gensim
  • Version: 4.3.3
  • License: LGPL-2.1

Core Imports

import gensim

Access main modules:

from gensim import corpora, models, similarities
from gensim.models import Word2Vec, LdaModel, Doc2Vec
from gensim.corpora import Dictionary
import gensim.downloader as api

Basic Usage

from gensim import corpora
from gensim.models import LdaModel
from gensim.parsing.preprocessing import preprocess_string
import gensim.downloader as api

# Load a dataset
dataset = api.load("text8")  # Wikipedia dataset

# Create a dictionary and corpus
dictionary = corpora.Dictionary(dataset)
corpus = [dictionary.doc2bow(text) for text in dataset]

# Train an LDA model
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=10,
    passes=10
)

# Get topics
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

# Load pre-trained word vectors
word_vectors = api.load("glove-twitter-25")
similar_words = word_vectors.most_similar("python", topn=5)
print(similar_words)

Architecture

Gensim follows a modular architecture built around three core concepts:

  • Corpora: Streaming document collections with various I/O formats (Matrix Market, SVMlight, etc.)
  • Models: Transformation algorithms that convert documents between vector representations
  • Similarities: Efficient similarity queries for large document collections

This design enables memory-efficient processing of corpora larger than available RAM through streaming and online algorithms. The library integrates deeply with NumPy and SciPy for mathematical operations and provides optional Cython extensions for performance-critical components.

Capabilities

NLP Models and Transformations

Core machine learning models including topic models (LDA, HDP), word embeddings (Word2Vec, FastText, Doc2Vec), and dimensionality reduction techniques (LSI, TF-IDF). These models support streaming training and can process datasets larger than memory.

# Topic Models
class LdaModel: ...
class HdpModel: ...
class LdaMulticore: ...

# Word Embeddings  
class Word2Vec: ...
class Doc2Vec: ...
class FastText: ...
class KeyedVectors: ...

# Dimensionality Reduction
class LsiModel: ...
class TfidfModel: ...
class RpModel: ...

NLP Models and Transformations

Corpus Management

Comprehensive corpus I/O supporting 13+ formats including Matrix Market, SVMlight, and Wikipedia dumps. Provides dictionary management for word-to-ID mappings with frequency statistics and corpus preprocessing utilities.

# Core Corpus Classes
class Dictionary: ...
class MmCorpus: ...
class TextCorpus: ...
class WikiCorpus: ...

# Additional Formats
class BleiCorpus: ...
class SvmLightCorpus: ...
class UciCorpus: ...

Corpus Management

Similarity Computations

Efficient similarity calculations for documents and terms including cosine similarity, soft cosine similarity with term relationships, and Word Mover's Distance. Supports both dense and sparse similarity matrices with sharded indexing for large corpora.

# Document Similarity
class Similarity: ...
class MatrixSimilarity: ...  
class SoftCosineSimilarity: ...
class WmdSimilarity: ...

# Term Similarity
class WordEmbeddingSimilarityIndex: ...
class SparseTermSimilarityMatrix: ...

Similarity Computations

Text Preprocessing

Comprehensive text preprocessing pipeline with stemming, stopword removal, tokenization, and text cleaning functions. Supports customizable preprocessing chains for document preparation.

# Preprocessing Functions
def preprocess_string(s: str, filters: list = None) -> list: ...
def remove_stopwords(s: str) -> str: ...
def strip_punctuation(s: str) -> str: ...
def stem_text(text: str) -> str: ...

# Stemming Classes
class PorterStemmer: ...

Text Preprocessing

Mathematical Utilities

Linear algebra operations, vector manipulations, and distance metrics optimized for NLP tasks. Includes BLAS integration, sparse/dense matrix conversions, and statistical measures like KL divergence and Jensen-Shannon distance.

# Vector Operations
def unitvec(vec): ...
def cossim(vec1, vec2): ...
def veclen(vec): ...

# Matrix Operations  
def corpus2csc(corpus): ...
def sparse2full(vec, length): ...

# Distance Metrics
def kullback_leibler(vec1, vec2): ...
def jensen_shannon(vec1, vec2): ...

Mathematical Utilities

Data Downloading

Convenient API for downloading pre-trained models and datasets including Word2Vec, GloVe, FastText models, and text corpora. Handles caching, version management, and integrity verification.

def load(name: str, return_path: bool = False): ...
def info(name: str = None): ...

Data Downloading

Types

# Base Interfaces
class CorpusABC:
    def __iter__(self): ...
    def __len__(self): ...

class TransformationABC:
    def __getitem__(self, bow): ...

class SimilarityABC:
    def __getitem__(self, query): ...

# Common Types
BowDocument = list[tuple[int, float]]  # Bag-of-words document representation
Corpus = Iterable[BowDocument]  # Stream of documents
Dictionary = dict[str, int]  # Word to ID mapping