Python library for topic modelling, document indexing and similarity retrieval with large corpora
npx @tessl/cli install tessl/pypi-gensim@4.3.00
# Gensim
1
2
A comprehensive Python library for natural language processing and information retrieval that specializes in topic modeling, document indexing, and similarity retrieval for large text corpora. Gensim provides memory-efficient implementations of popular algorithms like Word2Vec, Doc2Vec, FastText, Latent Dirichlet Allocation (LDA), and Latent Semantic Analysis (LSA) with optimized C/C++ extensions for production-scale applications.
3
4
## Package Information
5
6
- **Package Name**: gensim
7
- **Language**: Python
8
- **Installation**: `pip install gensim`
9
- **Version**: 4.3.3
10
- **License**: LGPL-2.1
11
12
## Core Imports
13
14
```python
15
import gensim
16
```
17
18
Access main modules:
19
20
```python
21
from gensim import corpora, models, similarities
22
from gensim.models import Word2Vec, LdaModel, Doc2Vec
23
from gensim.corpora import Dictionary
24
import gensim.downloader as api
25
```
26
27
## Basic Usage
28
29
```python
30
from gensim import corpora
31
from gensim.models import LdaModel
32
from gensim.parsing.preprocessing import preprocess_string
33
import gensim.downloader as api
34
35
# Load a dataset
36
dataset = api.load("text8") # Wikipedia dataset
37
38
# Create a dictionary and corpus
39
dictionary = corpora.Dictionary(dataset)
40
corpus = [dictionary.doc2bow(text) for text in dataset]
41
42
# Train an LDA model
43
lda_model = LdaModel(
44
corpus=corpus,
45
id2word=dictionary,
46
num_topics=10,
47
passes=10
48
)
49
50
# Get topics
51
topics = lda_model.print_topics(num_words=5)
52
for topic in topics:
53
print(topic)
54
55
# Load pre-trained word vectors
56
word_vectors = api.load("glove-twitter-25")
57
similar_words = word_vectors.most_similar("python", topn=5)
58
print(similar_words)
59
```
60
61
## Architecture
62
63
Gensim follows a modular architecture built around three core concepts:
64
65
- **Corpora**: Streaming document collections with various I/O formats (Matrix Market, SVMlight, etc.)
66
- **Models**: Transformation algorithms that convert documents between vector representations
67
- **Similarities**: Efficient similarity queries for large document collections
68
69
This design enables memory-efficient processing of corpora larger than available RAM through streaming and online algorithms. The library integrates deeply with NumPy and SciPy for mathematical operations and provides optional Cython extensions for performance-critical components.
70
71
## Capabilities
72
73
### NLP Models and Transformations
74
75
Core machine learning models including topic models (LDA, HDP), word embeddings (Word2Vec, FastText, Doc2Vec), and dimensionality reduction techniques (LSI, TF-IDF). These models support streaming training and can process datasets larger than memory.
76
77
```python { .api }
78
# Topic Models
79
class LdaModel: ...
80
class HdpModel: ...
81
class LdaMulticore: ...
82
83
# Word Embeddings
84
class Word2Vec: ...
85
class Doc2Vec: ...
86
class FastText: ...
87
class KeyedVectors: ...
88
89
# Dimensionality Reduction
90
class LsiModel: ...
91
class TfidfModel: ...
92
class RpModel: ...
93
```
94
95
[NLP Models and Transformations](./nlp-models.md)
96
97
### Corpus Management
98
99
Comprehensive corpus I/O supporting 13+ formats including Matrix Market, SVMlight, and Wikipedia dumps. Provides dictionary management for word-to-ID mappings with frequency statistics and corpus preprocessing utilities.
100
101
```python { .api }
102
# Core Corpus Classes
103
class Dictionary: ...
104
class MmCorpus: ...
105
class TextCorpus: ...
106
class WikiCorpus: ...
107
108
# Additional Formats
109
class BleiCorpus: ...
110
class SvmLightCorpus: ...
111
class UciCorpus: ...
112
```
113
114
[Corpus Management](./corpus-management.md)
115
116
### Similarity Computations
117
118
Efficient similarity calculations for documents and terms including cosine similarity, soft cosine similarity with term relationships, and Word Mover's Distance. Supports both dense and sparse similarity matrices with sharded indexing for large corpora.
119
120
```python { .api }
121
# Document Similarity
122
class Similarity: ...
123
class MatrixSimilarity: ...
124
class SoftCosineSimilarity: ...
125
class WmdSimilarity: ...
126
127
# Term Similarity
128
class WordEmbeddingSimilarityIndex: ...
129
class SparseTermSimilarityMatrix: ...
130
```
131
132
[Similarity Computations](./similarity-computations.md)
133
134
### Text Preprocessing
135
136
Comprehensive text preprocessing pipeline with stemming, stopword removal, tokenization, and text cleaning functions. Supports customizable preprocessing chains for document preparation.
137
138
```python { .api }
139
# Preprocessing Functions
140
def preprocess_string(s: str, filters: list = None) -> list: ...
141
def remove_stopwords(s: str) -> str: ...
142
def strip_punctuation(s: str) -> str: ...
143
def stem_text(text: str) -> str: ...
144
145
# Stemming Classes
146
class PorterStemmer: ...
147
```
148
149
[Text Preprocessing](./text-preprocessing.md)
150
151
### Mathematical Utilities
152
153
Linear algebra operations, vector manipulations, and distance metrics optimized for NLP tasks. Includes BLAS integration, sparse/dense matrix conversions, and statistical measures like KL divergence and Jensen-Shannon distance.
154
155
```python { .api }
156
# Vector Operations
157
def unitvec(vec): ...
158
def cossim(vec1, vec2): ...
159
def veclen(vec): ...
160
161
# Matrix Operations
162
def corpus2csc(corpus): ...
163
def sparse2full(vec, length): ...
164
165
# Distance Metrics
166
def kullback_leibler(vec1, vec2): ...
167
def jensen_shannon(vec1, vec2): ...
168
```
169
170
[Mathematical Utilities](./mathematical-utilities.md)
171
172
### Data Downloading
173
174
Convenient API for downloading pre-trained models and datasets including Word2Vec, GloVe, FastText models, and text corpora. Handles caching, version management, and integrity verification.
175
176
```python { .api }
177
def load(name: str, return_path: bool = False): ...
178
def info(name: str = None): ...
179
```
180
181
[Data Downloading](./data-downloading.md)
182
183
## Types
184
185
```python { .api }
186
# Base Interfaces
187
class CorpusABC:
188
def __iter__(self): ...
189
def __len__(self): ...
190
191
class TransformationABC:
192
def __getitem__(self, bow): ...
193
194
class SimilarityABC:
195
def __getitem__(self, query): ...
196
197
# Common Types
198
BowDocument = list[tuple[int, float]] # Bag-of-words document representation
199
Corpus = Iterable[BowDocument] # Stream of documents
200
Dictionary = dict[str, int] # Word to ID mapping
201
```