0
# Bag of Words Vectorization
1
2
Traditional text vectorization methods including TF-IDF and bag-of-words representations for document classification, information retrieval, and feature extraction tasks. Provides sparse vector representations that complement dense neural embeddings.
3
4
## Capabilities
5
6
### Text Vectorization Interface
7
8
Base interface for text vectorization implementations providing consistent API across different vectorization strategies.
9
10
```java { .api }
11
/**
12
* Text vectorization interface for converting text to numerical representations
13
*/
14
public interface TextVectorizer {
15
// Core vectorization interface - implementations provide specific vectorization methods
16
}
17
```
18
19
### Bag of Words Vectorizer
20
21
Classic bag-of-words vectorization creating sparse representations based on word frequency counts.
22
23
```java { .api }
24
/**
25
* Bag of Words vectorization implementation
26
* Creates sparse vector representations based on word frequency counts
27
*/
28
public class BagOfWordsVectorizer implements TextVectorizer {
29
// Bag of words implementation with configurable vocabulary and normalization
30
}
31
```
32
33
### TF-IDF Vectorizer
34
35
Term Frequency-Inverse Document Frequency vectorization for weighted sparse representations emphasizing discriminative terms.
36
37
```java { .api }
38
/**
39
* TF-IDF vectorization implementation
40
* Creates weighted sparse vectors emphasizing discriminative terms
41
*/
42
public class TfidfVectorizer implements TextVectorizer {
43
// TF-IDF implementation with configurable term weighting schemes
44
}
45
```
46
47
### Base Text Vectorizer
48
49
Abstract base implementation providing common functionality for text vectorization implementations.
50
51
```java { .api }
52
/**
53
* Abstract base class for text vectorizers
54
* Provides common functionality and configuration patterns
55
*/
56
public class BaseTextVectorizer implements TextVectorizer {
57
// Common vectorization infrastructure and utilities
58
}
59
```
60
61
### Vectorizer Builder
62
63
Builder pattern for configuring text vectorizers with various parameters and data sources.
64
65
```java { .api }
66
/**
67
* Builder for text vectorizer configuration
68
* Supports various vectorization strategies and parameters
69
*/
70
public class Builder {
71
// Configurable builder for text vectorization components
72
}
73
```
74
75
### Input Stream Creator
76
77
Utility for creating input streams from various text sources for vectorization processing.
78
79
```java { .api }
80
/**
81
* Default input stream creator for text sources
82
* Handles various input formats and encodings
83
*/
84
public class DefaultInputStreamCreator {
85
// Input stream creation utilities for text processing
86
}
87
```
88
89
**Usage Examples:**
90
91
```java
92
import org.deeplearning4j.bagofwords.vectorizer.*;
93
94
// Basic bag of words vectorization
95
Collection<String> documents = Arrays.asList(
96
"The quick brown fox jumps over the lazy dog",
97
"A fast brown animal leaps over the sleeping canine",
98
"Natural language processing with machine learning"
99
);
100
101
// Configure bag of words vectorizer
102
BagOfWordsVectorizer bowVectorizer = new BagOfWordsVectorizer();
103
// Additional configuration would be done here based on actual API
104
105
// TF-IDF vectorization for document similarity
106
TfidfVectorizer tfidfVectorizer = new TfidfVectorizer();
107
// Configure TF-IDF parameters based on actual implementation
108
109
// Example usage pattern (actual API may vary):
110
// INDArray vectors = bowVectorizer.vectorize(documents);
111
// INDArray tfidfVectors = tfidfVectorizer.vectorize(documents);
112
113
// Builder pattern usage example
114
Builder vectorizerBuilder = new Builder()
115
// Configure vectorization parameters
116
// .vocabulary(customVocabulary)
117
// .minWordFrequency(2)
118
// .maxFeatures(10000)
119
;
120
121
// TextVectorizer vectorizer = vectorizerBuilder.build();
122
```
123
124
## Integration with Neural Models
125
126
Bag of words and TF-IDF vectorizers can be used as:
127
128
- **Feature extraction**: Converting text to numerical features for traditional ML algorithms
129
- **Preprocessing**: Initial text representation before neural processing
130
- **Baseline comparison**: Comparing neural embeddings against classical methods
131
- **Hybrid approaches**: Combining sparse and dense representations for improved performance
132
- **Cold start solutions**: Handling new documents without neural model inference overhead
133
134
The sparse representations from these vectorizers complement the dense embeddings from Word2Vec, GloVe, and ParagraphVectors, providing multiple perspectives on text data for different use cases and performance requirements.