0
# Global Vectors (GloVe)
1
2
Matrix factorization-based word embeddings that efficiently combine global statistical information with local context windows. GloVe captures word co-occurrence statistics across large corpora, providing high-quality word representations for downstream NLP tasks.
3
4
## Capabilities
5
6
### GloVe Model
7
8
GlobalVectors implementation based on the Stanford GloVe algorithm, extending SequenceVectors with co-occurrence matrix optimization.
9
10
```java { .api }
11
/**
12
* GlobalVectors standalone implementation for DL4j
13
* Based on original Stanford GloVe algorithm
14
*/
15
public class Glove extends SequenceVectors<VocabWord> {
16
// Protected constructor - use Builder to create instances
17
}
18
```
19
20
### GloVe Builder
21
22
Comprehensive builder for GloVe configuration with algorithm-specific parameters for co-occurrence processing and matrix factorization.
23
24
```java { .api }
25
/**
26
* Builder for GloVe configuration and construction
27
*/
28
public static class Glove.Builder extends SequenceVectors.Builder<VocabWord> {
29
30
/**
31
* Build the configured GloVe instance
32
* @return Configured GloVe model ready for training
33
*/
34
public Glove build();
35
36
/**
37
* Set sequence iterator for training data
38
* @param iterator SequenceIterator providing tokenized sequences
39
* @return Builder instance for method chaining
40
*/
41
public Builder iterate(SequenceIterator<VocabWord> iterator);
42
43
/**
44
* Set mini-batch size for training
45
* @param batchSize Number of co-occurrence entries per batch
46
* @return Builder instance for method chaining
47
*/
48
public Builder batchSize(int batchSize);
49
50
/**
51
* Set number of training iterations (same as epochs in GloVe)
52
* @param iterations Number of training iterations
53
* @return Builder instance for method chaining
54
*/
55
public Builder iterations(int iterations);
56
57
/**
58
* Set number of epochs for training
59
* @param numEpochs Number of training epochs
60
* @return Builder instance for method chaining
61
*/
62
public Builder epochs(int numEpochs);
63
64
/**
65
* Enable AdaGrad optimizer (always enabled for GloVe)
66
* @param reallyUse AdaGrad usage flag (forced to true)
67
* @return Builder instance for method chaining
68
*/
69
public Builder useAdaGrad(boolean reallyUse);
70
71
/**
72
* Set vector dimensionality
73
* @param layerSize Number of dimensions for output vectors
74
* @return Builder instance for method chaining
75
*/
76
public Builder layerSize(int layerSize);
77
78
/**
79
* Set learning rate for optimization
80
* @param learningRate Learning rate for gradient descent
81
* @return Builder instance for method chaining
82
*/
83
public Builder learningRate(double learningRate);
84
85
/**
86
* Set minimum word frequency threshold
87
* @param minWordFrequency Words below this frequency are excluded
88
* @return Builder instance for method chaining
89
*/
90
public Builder minWordFrequency(int minWordFrequency);
91
92
/**
93
* Set minimum learning rate threshold
94
* @param minLearningRate Minimum learning rate value
95
* @return Builder instance for method chaining
96
*/
97
public Builder minLearningRate(double minLearningRate);
98
99
/**
100
* Set whether to reset model before building
101
* @param reallyReset Whether to clear model state
102
* @return Builder instance for method chaining
103
*/
104
public Builder resetModel(boolean reallyReset);
105
106
/**
107
* Set external vocabulary cache
108
* @param vocabCache VocabCache instance to use
109
* @return Builder instance for method chaining
110
*/
111
public Builder vocabCache(VocabCache<VocabWord> vocabCache);
112
113
/**
114
* Set external weight lookup table
115
* @param lookupTable WeightLookupTable instance to use
116
* @return Builder instance for method chaining
117
*/
118
public Builder lookupTable(WeightLookupTable<VocabWord> lookupTable);
119
120
/**
121
* Set subsampling parameter (deprecated for GloVe)
122
* @param sampling Subsampling rate (not used in GloVe)
123
* @return Builder instance for method chaining
124
*/
125
@Deprecated
126
public Builder sampling(double sampling);
127
128
/**
129
* Set negative sampling parameter (deprecated for GloVe)
130
* @param negative Negative sampling rate (not used in GloVe)
131
* @return Builder instance for method chaining
132
*/
133
@Deprecated
134
public Builder negativeSample(double negative);
135
136
/**
137
* Set stop words list
138
* @param stopList List of stop words to exclude
139
* @return Builder instance for method chaining
140
*/
141
public Builder stopWords(List<String> stopList);
142
143
/**
144
* Force elements representation training (always true for GloVe)
145
* @param trainElements Whether to train element representations
146
* @return Builder instance for method chaining
147
*/
148
public Builder trainElementsRepresentation(boolean trainElements);
149
150
/**
151
* Force sequence representation training (deprecated for GloVe)
152
* @param trainSequences Whether to train sequence representations
153
* @return Builder instance for method chaining
154
*/
155
@Deprecated
156
public Builder trainSequencesRepresentation(boolean trainSequences);
157
158
/**
159
* Set stop words collection
160
* @param stopList Collection of VocabWord stop words
161
* @return Builder instance for method chaining
162
*/
163
public Builder stopWords(Collection<VocabWord> stopList);
164
165
/**
166
* Set context window size
167
* @param windowSize Context window size for co-occurrence calculation
168
* @return Builder instance for method chaining
169
*/
170
public Builder windowSize(int windowSize);
171
172
/**
173
* Set random seed for reproducibility
174
* @param randomSeed Random seed value
175
* @return Builder instance for method chaining
176
*/
177
public Builder seed(long randomSeed);
178
179
/**
180
* Set number of worker threads
181
* @param numWorkers Number of parallel worker threads
182
* @return Builder instance for method chaining
183
*/
184
public Builder workers(int numWorkers);
185
186
/**
187
* Set TokenizerFactory for training
188
* @param tokenizerFactory TokenizerFactory for text tokenization
189
* @return Builder instance for method chaining
190
*/
191
public Builder tokenizerFactory(TokenizerFactory tokenizerFactory);
192
193
/**
194
* Set cutoff parameter in weighting function
195
* @param xMax Cutoff value in weighting function (default: 100.0)
196
* @return Builder instance for method chaining
197
*/
198
public Builder xMax(double xMax);
199
200
/**
201
* Set whether co-occurrences should be built in both directions
202
* @param reallySymmetric Whether to build symmetric co-occurrence matrix
203
* @return Builder instance for method chaining
204
*/
205
public Builder symmetric(boolean reallySymmetric);
206
207
/**
208
* Set whether co-occurrences should be shuffled between epochs
209
* @param reallyShuffle Whether to shuffle co-occurrence list
210
* @return Builder instance for method chaining
211
*/
212
public Builder shuffle(boolean reallyShuffle);
213
214
/**
215
* Set alpha parameter in weighting function exponent
216
* @param alpha Exponent parameter in weighting function (default: 0.75)
217
* @return Builder instance for method chaining
218
*/
219
public Builder alpha(double alpha);
220
221
/**
222
* Set sentence iterator for training
223
* @param iterator SentenceIterator providing training sentences
224
* @return Builder instance for method chaining
225
*/
226
public Builder iterate(SentenceIterator iterator);
227
228
/**
229
* Set document iterator for training
230
* @param iterator DocumentIterator providing training documents
231
* @return Builder instance for method chaining
232
*/
233
public Builder iterate(DocumentIterator iterator);
234
235
/**
236
* Set model utilities for vector operations
237
* @param modelUtils ModelUtils instance for similarity calculations
238
* @return Builder instance for method chaining
239
*/
240
public Builder modelUtils(ModelUtils<VocabWord> modelUtils);
241
242
/**
243
* Set event listeners for training progress
244
* @param vectorsListeners Collection of training event listeners
245
* @return Builder instance for method chaining
246
*/
247
public Builder setVectorsListeners(Collection<VectorsListener<VocabWord>> vectorsListeners);
248
249
/**
250
* Set maximum memory available for co-occurrence map building
251
* @param gbytes Memory limit in gigabytes
252
* @return Builder instance for method chaining
253
*/
254
public Builder maxMemory(int gbytes);
255
256
/**
257
* Set unknown element for out-of-vocabulary words
258
* @param element VocabWord element for unknown words
259
* @return Builder instance for method chaining
260
*/
261
public Builder unknownElement(VocabWord element);
262
263
/**
264
* Enable or disable unknown word handling
265
* @param reallyUse Whether to use UNK token
266
* @return Builder instance for method chaining
267
*/
268
public Builder useUnknown(boolean reallyUse);
269
}
270
```
271
272
**Usage Examples:**
273
274
```java
275
import org.deeplearning4j.models.glove.Glove;
276
import org.deeplearning4j.text.sentenceiterator.CollectionSentenceIterator;
277
import org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory;
278
279
// Basic GloVe training
280
Collection<String> corpus = Arrays.asList(
281
"Global vectors for word representation are effective",
282
"Matrix factorization captures global statistics",
283
"Local context windows provide semantic information"
284
);
285
286
Glove glove = new Glove.Builder()
287
.learningRate(0.05)
288
.epochs(50)
289
.xMax(100.0)
290
.alpha(0.75)
291
.layerSize(100)
292
.iterate(new CollectionSentenceIterator(corpus))
293
.tokenizerFactory(new DefaultTokenizerFactory())
294
.build();
295
296
glove.fit();
297
298
// Use trained model
299
double similarity = glove.similarity("global", "matrix");
300
Collection<String> nearest = glove.wordsNearest("vectors", 10);
301
302
// Advanced GloVe configuration
303
Glove advancedGlove = new Glove.Builder()
304
.learningRate(0.075)
305
.epochs(100)
306
.layerSize(300)
307
.xMax(150.0)
308
.alpha(0.8)
309
.symmetric(true)
310
.shuffle(true)
311
.windowSize(10)
312
.minWordFrequency(5)
313
.workers(4)
314
.maxMemory(8) // 8GB memory limit
315
.iterate(new CollectionSentenceIterator(largeCorpus))
316
.tokenizerFactory(new DefaultTokenizerFactory())
317
.build();
318
319
advancedGlove.fit();
320
321
// Extract word vectors
322
INDArray wordVector = advancedGlove.getWordVectorMatrix("representation");
323
System.out.println("Vector for 'representation': " + wordVector);
324
```
325
326
### GloVe Algorithm Parameters
327
328
Key parameters specific to the GloVe algorithm that control co-occurrence matrix construction and optimization:
329
330
- **xMax**: Cutoff in weighting function that determines the maximum value for weighting co-occurrence pairs
331
- **alpha**: Exponent parameter in the weighting function that controls the decay of weights for high-frequency pairs
332
- **symmetric**: Whether to build co-occurrence matrix in both directions from target words
333
- **shuffle**: Whether to shuffle co-occurrence pairs between training epochs for better convergence
334
- **maxMemory**: Memory limit for co-occurrence matrix construction to prevent out-of-memory errors
335
336
The GloVe weighting function uses these parameters as: `f(X_ij) = min(1, (X_ij / xMax)^alpha)` where X_ij is the co-occurrence count between words i and j.