or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

corpus-management.mddata-downloading.mdindex.mdmathematical-utilities.mdnlp-models.mdsimilarity-computations.mdtext-preprocessing.md

index.mddocs/

0

# Gensim

1

2

A comprehensive Python library for natural language processing and information retrieval that specializes in topic modeling, document indexing, and similarity retrieval for large text corpora. Gensim provides memory-efficient implementations of popular algorithms like Word2Vec, Doc2Vec, FastText, Latent Dirichlet Allocation (LDA), and Latent Semantic Analysis (LSA) with optimized C/C++ extensions for production-scale applications.

3

4

## Package Information

5

6

- **Package Name**: gensim

7

- **Language**: Python

8

- **Installation**: `pip install gensim`

9

- **Version**: 4.3.3

10

- **License**: LGPL-2.1

11

12

## Core Imports

13

14

```python

15

import gensim

16

```

17

18

Access main modules:

19

20

```python

21

from gensim import corpora, models, similarities

22

from gensim.models import Word2Vec, LdaModel, Doc2Vec

23

from gensim.corpora import Dictionary

24

import gensim.downloader as api

25

```

26

27

## Basic Usage

28

29

```python

30

from gensim import corpora

31

from gensim.models import LdaModel

32

from gensim.parsing.preprocessing import preprocess_string

33

import gensim.downloader as api

34

35

# Load a dataset

36

dataset = api.load("text8") # Wikipedia dataset

37

38

# Create a dictionary and corpus

39

dictionary = corpora.Dictionary(dataset)

40

corpus = [dictionary.doc2bow(text) for text in dataset]

41

42

# Train an LDA model

43

lda_model = LdaModel(

44

corpus=corpus,

45

id2word=dictionary,

46

num_topics=10,

47

passes=10

48

)

49

50

# Get topics

51

topics = lda_model.print_topics(num_words=5)

52

for topic in topics:

53

print(topic)

54

55

# Load pre-trained word vectors

56

word_vectors = api.load("glove-twitter-25")

57

similar_words = word_vectors.most_similar("python", topn=5)

58

print(similar_words)

59

```

60

61

## Architecture

62

63

Gensim follows a modular architecture built around three core concepts:

64

65

- **Corpora**: Streaming document collections with various I/O formats (Matrix Market, SVMlight, etc.)

66

- **Models**: Transformation algorithms that convert documents between vector representations

67

- **Similarities**: Efficient similarity queries for large document collections

68

69

This design enables memory-efficient processing of corpora larger than available RAM through streaming and online algorithms. The library integrates deeply with NumPy and SciPy for mathematical operations and provides optional Cython extensions for performance-critical components.

70

71

## Capabilities

72

73

### NLP Models and Transformations

74

75

Core machine learning models including topic models (LDA, HDP), word embeddings (Word2Vec, FastText, Doc2Vec), and dimensionality reduction techniques (LSI, TF-IDF). These models support streaming training and can process datasets larger than memory.

76

77

```python { .api }

78

# Topic Models

79

class LdaModel: ...

80

class HdpModel: ...

81

class LdaMulticore: ...

82

83

# Word Embeddings

84

class Word2Vec: ...

85

class Doc2Vec: ...

86

class FastText: ...

87

class KeyedVectors: ...

88

89

# Dimensionality Reduction

90

class LsiModel: ...

91

class TfidfModel: ...

92

class RpModel: ...

93

```

94

95

[NLP Models and Transformations](./nlp-models.md)

96

97

### Corpus Management

98

99

Comprehensive corpus I/O supporting 13+ formats including Matrix Market, SVMlight, and Wikipedia dumps. Provides dictionary management for word-to-ID mappings with frequency statistics and corpus preprocessing utilities.

100

101

```python { .api }

102

# Core Corpus Classes

103

class Dictionary: ...

104

class MmCorpus: ...

105

class TextCorpus: ...

106

class WikiCorpus: ...

107

108

# Additional Formats

109

class BleiCorpus: ...

110

class SvmLightCorpus: ...

111

class UciCorpus: ...

112

```

113

114

[Corpus Management](./corpus-management.md)

115

116

### Similarity Computations

117

118

Efficient similarity calculations for documents and terms including cosine similarity, soft cosine similarity with term relationships, and Word Mover's Distance. Supports both dense and sparse similarity matrices with sharded indexing for large corpora.

119

120

```python { .api }

121

# Document Similarity

122

class Similarity: ...

123

class MatrixSimilarity: ...

124

class SoftCosineSimilarity: ...

125

class WmdSimilarity: ...

126

127

# Term Similarity

128

class WordEmbeddingSimilarityIndex: ...

129

class SparseTermSimilarityMatrix: ...

130

```

131

132

[Similarity Computations](./similarity-computations.md)

133

134

### Text Preprocessing

135

136

Comprehensive text preprocessing pipeline with stemming, stopword removal, tokenization, and text cleaning functions. Supports customizable preprocessing chains for document preparation.

137

138

```python { .api }

139

# Preprocessing Functions

140

def preprocess_string(s: str, filters: list = None) -> list: ...

141

def remove_stopwords(s: str) -> str: ...

142

def strip_punctuation(s: str) -> str: ...

143

def stem_text(text: str) -> str: ...

144

145

# Stemming Classes

146

class PorterStemmer: ...

147

```

148

149

[Text Preprocessing](./text-preprocessing.md)

150

151

### Mathematical Utilities

152

153

Linear algebra operations, vector manipulations, and distance metrics optimized for NLP tasks. Includes BLAS integration, sparse/dense matrix conversions, and statistical measures like KL divergence and Jensen-Shannon distance.

154

155

```python { .api }

156

# Vector Operations

157

def unitvec(vec): ...

158

def cossim(vec1, vec2): ...

159

def veclen(vec): ...

160

161

# Matrix Operations

162

def corpus2csc(corpus): ...

163

def sparse2full(vec, length): ...

164

165

# Distance Metrics

166

def kullback_leibler(vec1, vec2): ...

167

def jensen_shannon(vec1, vec2): ...

168

```

169

170

[Mathematical Utilities](./mathematical-utilities.md)

171

172

### Data Downloading

173

174

Convenient API for downloading pre-trained models and datasets including Word2Vec, GloVe, FastText models, and text corpora. Handles caching, version management, and integrity verification.

175

176

```python { .api }

177

def load(name: str, return_path: bool = False): ...

178

def info(name: str = None): ...

179

```

180

181

[Data Downloading](./data-downloading.md)

182

183

## Types

184

185

```python { .api }

186

# Base Interfaces

187

class CorpusABC:

188

def __iter__(self): ...

189

def __len__(self): ...

190

191

class TransformationABC:

192

def __getitem__(self, bow): ...

193

194

class SimilarityABC:

195

def __getitem__(self, query): ...

196

197

# Common Types

198

BowDocument = list[tuple[int, float]] # Bag-of-words document representation

199

Corpus = Iterable[BowDocument] # Stream of documents

200

Dictionary = dict[str, int] # Word to ID mapping

201

```