Tessl Tile for maven/org.deeplearning4j/deeplearning4j-nlp@0.9.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

bag-of-words.md dataset-loading.md document-embeddings.md glove.md index.md text-processing.md word-embeddings.md

bag-of-words.mddocs/

0
# Bag of Words Vectorization
1

2
Traditional text vectorization methods including TF-IDF and bag-of-words representations for document classification, information retrieval, and feature extraction tasks. Provides sparse vector representations that complement dense neural embeddings.
3

4
## Capabilities
5

6
### Text Vectorization Interface
7

8
Base interface for text vectorization implementations providing consistent API across different vectorization strategies.
9

10
```java { .api }
11
/**
12
 * Text vectorization interface for converting text to numerical representations
13
 */
14
public interface TextVectorizer {
15
    // Core vectorization interface - implementations provide specific vectorization methods
16
}
17
```
18

19
### Bag of Words Vectorizer
20

21
Classic bag-of-words vectorization creating sparse representations based on word frequency counts.
22

23
```java { .api }
24
/**
25
 * Bag of Words vectorization implementation
26
 * Creates sparse vector representations based on word frequency counts
27
 */
28
public class BagOfWordsVectorizer implements TextVectorizer {
29
    // Bag of words implementation with configurable vocabulary and normalization
30
}
31
```
32

33
### TF-IDF Vectorizer
34

35
Term Frequency-Inverse Document Frequency vectorization for weighted sparse representations emphasizing discriminative terms.
36

37
```java { .api }
38
/**
39
 * TF-IDF vectorization implementation
40
 * Creates weighted sparse vectors emphasizing discriminative terms
41
 */
42
public class TfidfVectorizer implements TextVectorizer {
43
    // TF-IDF implementation with configurable term weighting schemes
44
}
45
```
46

47
### Base Text Vectorizer
48

49
Abstract base implementation providing common functionality for text vectorization implementations.
50

51
```java { .api }
52
/**
53
 * Abstract base class for text vectorizers
54
 * Provides common functionality and configuration patterns
55
 */
56
public class BaseTextVectorizer implements TextVectorizer {
57
    // Common vectorization infrastructure and utilities
58
}
59
```
60

61
### Vectorizer Builder
62

63
Builder pattern for configuring text vectorizers with various parameters and data sources.
64

65
```java { .api }
66
/**
67
 * Builder for text vectorizer configuration
68
 * Supports various vectorization strategies and parameters
69
 */
70
public class Builder {
71
    // Configurable builder for text vectorization components
72
}
73
```
74

75
### Input Stream Creator
76

77
Utility for creating input streams from various text sources for vectorization processing.
78

79
```java { .api }
80
/**
81
 * Default input stream creator for text sources
82
 * Handles various input formats and encodings
83
 */
84
public class DefaultInputStreamCreator {
85
    // Input stream creation utilities for text processing
86
}
87
```
88

89
**Usage Examples:**
90

91
```java
92
import org.deeplearning4j.bagofwords.vectorizer.*;
93

94
// Basic bag of words vectorization
95
Collection<String> documents = Arrays.asList(
96
    "The quick brown fox jumps over the lazy dog",
97
    "A fast brown animal leaps over the sleeping canine",
98
    "Natural language processing with machine learning"
99
);
100

101
// Configure bag of words vectorizer
102
BagOfWordsVectorizer bowVectorizer = new BagOfWordsVectorizer();
103
// Additional configuration would be done here based on actual API
104

105
// TF-IDF vectorization for document similarity
106
TfidfVectorizer tfidfVectorizer = new TfidfVectorizer();
107
// Configure TF-IDF parameters based on actual implementation
108

109
// Example usage pattern (actual API may vary):
110
// INDArray vectors = bowVectorizer.vectorize(documents);
111
// INDArray tfidfVectors = tfidfVectorizer.vectorize(documents);
112

113
// Builder pattern usage example
114
Builder vectorizerBuilder = new Builder()
115
    // Configure vectorization parameters
116
    // .vocabulary(customVocabulary)
117
    // .minWordFrequency(2)
118
    // .maxFeatures(10000)
119
    ;
120

121
// TextVectorizer vectorizer = vectorizerBuilder.build();
122
```
123

124
## Integration with Neural Models
125

126
Bag of words and TF-IDF vectorizers can be used as:
127

128
- **Feature extraction**: Converting text to numerical features for traditional ML algorithms
129
- **Preprocessing**: Initial text representation before neural processing  
130
- **Baseline comparison**: Comparing neural embeddings against classical methods
131
- **Hybrid approaches**: Combining sparse and dense representations for improved performance
132
- **Cold start solutions**: Handling new documents without neural model inference overhead
133

134
The sparse representations from these vectorizers complement the dense embeddings from Word2Vec, GloVe, and ParagraphVectors, providing multiple perspectives on text data for different use cases and performance requirements.

Version

Tile

Files

bag-of-words.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

bag-of-words.mddocs/