Tessl Tile for pypi/gensim@4.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/pypi-gensim

Python library for topic modelling, document indexing and similarity retrieval with large corpora

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/gensim@4.3.x

To install, run

npx @tessl/cli install tessl/pypi-gensim@4.3.0

0
# Gensim
1

2
A comprehensive Python library for natural language processing and information retrieval that specializes in topic modeling, document indexing, and similarity retrieval for large text corpora. Gensim provides memory-efficient implementations of popular algorithms like Word2Vec, Doc2Vec, FastText, Latent Dirichlet Allocation (LDA), and Latent Semantic Analysis (LSA) with optimized C/C++ extensions for production-scale applications.
3

4
## Package Information
5

6
- **Package Name**: gensim
7
- **Language**: Python
8
- **Installation**: `pip install gensim`
9
- **Version**: 4.3.3
10
- **License**: LGPL-2.1
11

12
## Core Imports
13

14
```python
15
import gensim
16
```
17

18
Access main modules:
19

20
```python
21
from gensim import corpora, models, similarities
22
from gensim.models import Word2Vec, LdaModel, Doc2Vec
23
from gensim.corpora import Dictionary
24
import gensim.downloader as api
25
```
26

27
## Basic Usage
28

29
```python
30
from gensim import corpora
31
from gensim.models import LdaModel
32
from gensim.parsing.preprocessing import preprocess_string
33
import gensim.downloader as api
34

35
# Load a dataset
36
dataset = api.load("text8")  # Wikipedia dataset
37

38
# Create a dictionary and corpus
39
dictionary = corpora.Dictionary(dataset)
40
corpus = [dictionary.doc2bow(text) for text in dataset]
41

42
# Train an LDA model
43
lda_model = LdaModel(
44
    corpus=corpus,
45
    id2word=dictionary,
46
    num_topics=10,
47
    passes=10
48
)
49

50
# Get topics
51
topics = lda_model.print_topics(num_words=5)
52
for topic in topics:
53
    print(topic)
54

55
# Load pre-trained word vectors
56
word_vectors = api.load("glove-twitter-25")
57
similar_words = word_vectors.most_similar("python", topn=5)
58
print(similar_words)
59
```
60

61
## Architecture
62

63
Gensim follows a modular architecture built around three core concepts:
64

65
- **Corpora**: Streaming document collections with various I/O formats (Matrix Market, SVMlight, etc.)
66
- **Models**: Transformation algorithms that convert documents between vector representations
67
- **Similarities**: Efficient similarity queries for large document collections
68

69
This design enables memory-efficient processing of corpora larger than available RAM through streaming and online algorithms. The library integrates deeply with NumPy and SciPy for mathematical operations and provides optional Cython extensions for performance-critical components.
70

71
## Capabilities
72

73
### NLP Models and Transformations
74

75
Core machine learning models including topic models (LDA, HDP), word embeddings (Word2Vec, FastText, Doc2Vec), and dimensionality reduction techniques (LSI, TF-IDF). These models support streaming training and can process datasets larger than memory.
76

77
```python { .api }
78
# Topic Models
79
class LdaModel: ...
80
class HdpModel: ...
81
class LdaMulticore: ...
82

83
# Word Embeddings  
84
class Word2Vec: ...
85
class Doc2Vec: ...
86
class FastText: ...
87
class KeyedVectors: ...
88

89
# Dimensionality Reduction
90
class LsiModel: ...
91
class TfidfModel: ...
92
class RpModel: ...
93
```
94

95
[NLP Models and Transformations](./nlp-models.md)
96

97
### Corpus Management
98

99
Comprehensive corpus I/O supporting 13+ formats including Matrix Market, SVMlight, and Wikipedia dumps. Provides dictionary management for word-to-ID mappings with frequency statistics and corpus preprocessing utilities.
100

101
```python { .api }
102
# Core Corpus Classes
103
class Dictionary: ...
104
class MmCorpus: ...
105
class TextCorpus: ...
106
class WikiCorpus: ...
107

108
# Additional Formats
109
class BleiCorpus: ...
110
class SvmLightCorpus: ...
111
class UciCorpus: ...
112
```
113

114
[Corpus Management](./corpus-management.md)
115

116
### Similarity Computations
117

118
Efficient similarity calculations for documents and terms including cosine similarity, soft cosine similarity with term relationships, and Word Mover's Distance. Supports both dense and sparse similarity matrices with sharded indexing for large corpora.
119

120
```python { .api }
121
# Document Similarity
122
class Similarity: ...
123
class MatrixSimilarity: ...  
124
class SoftCosineSimilarity: ...
125
class WmdSimilarity: ...
126

127
# Term Similarity
128
class WordEmbeddingSimilarityIndex: ...
129
class SparseTermSimilarityMatrix: ...
130
```
131

132
[Similarity Computations](./similarity-computations.md)
133

134
### Text Preprocessing
135

136
Comprehensive text preprocessing pipeline with stemming, stopword removal, tokenization, and text cleaning functions. Supports customizable preprocessing chains for document preparation.
137

138
```python { .api }
139
# Preprocessing Functions
140
def preprocess_string(s: str, filters: list = None) -> list: ...
141
def remove_stopwords(s: str) -> str: ...
142
def strip_punctuation(s: str) -> str: ...
143
def stem_text(text: str) -> str: ...
144

145
# Stemming Classes
146
class PorterStemmer: ...
147
```
148

149
[Text Preprocessing](./text-preprocessing.md)
150

151
### Mathematical Utilities
152

153
Linear algebra operations, vector manipulations, and distance metrics optimized for NLP tasks. Includes BLAS integration, sparse/dense matrix conversions, and statistical measures like KL divergence and Jensen-Shannon distance.
154

155
```python { .api }
156
# Vector Operations
157
def unitvec(vec): ...
158
def cossim(vec1, vec2): ...
159
def veclen(vec): ...
160

161
# Matrix Operations  
162
def corpus2csc(corpus): ...
163
def sparse2full(vec, length): ...
164

165
# Distance Metrics
166
def kullback_leibler(vec1, vec2): ...
167
def jensen_shannon(vec1, vec2): ...
168
```
169

170
[Mathematical Utilities](./mathematical-utilities.md)
171

172
### Data Downloading
173

174
Convenient API for downloading pre-trained models and datasets including Word2Vec, GloVe, FastText models, and text corpora. Handles caching, version management, and integrity verification.
175

176
```python { .api }
177
def load(name: str, return_path: bool = False): ...
178
def info(name: str = None): ...
179
```
180

181
[Data Downloading](./data-downloading.md)
182

183
## Types
184

185
```python { .api }
186
# Base Interfaces
187
class CorpusABC:
188
    def __iter__(self): ...
189
    def __len__(self): ...
190

191
class TransformationABC:
192
    def __getitem__(self, bow): ...
193

194
class SimilarityABC:
195
    def __getitem__(self, query): ...
196

197
# Common Types
198
BowDocument = list[tuple[int, float]]  # Bag-of-words document representation
199
Corpus = Iterable[BowDocument]  # Stream of documents
200
Dictionary = dict[str, int]  # Word to ID mapping
201
```