or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

bag-of-words.mddataset-loading.mddocument-embeddings.mdglove.mdindex.mdtext-processing.mdword-embeddings.md

bag-of-words.mddocs/

0

# Bag of Words Vectorization

1

2

Traditional text vectorization methods including TF-IDF and bag-of-words representations for document classification, information retrieval, and feature extraction tasks. Provides sparse vector representations that complement dense neural embeddings.

3

4

## Capabilities

5

6

### Text Vectorization Interface

7

8

Base interface for text vectorization implementations providing consistent API across different vectorization strategies.

9

10

```java { .api }

11

/**

12

* Text vectorization interface for converting text to numerical representations

13

*/

14

public interface TextVectorizer {

15

// Core vectorization interface - implementations provide specific vectorization methods

16

}

17

```

18

19

### Bag of Words Vectorizer

20

21

Classic bag-of-words vectorization creating sparse representations based on word frequency counts.

22

23

```java { .api }

24

/**

25

* Bag of Words vectorization implementation

26

* Creates sparse vector representations based on word frequency counts

27

*/

28

public class BagOfWordsVectorizer implements TextVectorizer {

29

// Bag of words implementation with configurable vocabulary and normalization

30

}

31

```

32

33

### TF-IDF Vectorizer

34

35

Term Frequency-Inverse Document Frequency vectorization for weighted sparse representations emphasizing discriminative terms.

36

37

```java { .api }

38

/**

39

* TF-IDF vectorization implementation

40

* Creates weighted sparse vectors emphasizing discriminative terms

41

*/

42

public class TfidfVectorizer implements TextVectorizer {

43

// TF-IDF implementation with configurable term weighting schemes

44

}

45

```

46

47

### Base Text Vectorizer

48

49

Abstract base implementation providing common functionality for text vectorization implementations.

50

51

```java { .api }

52

/**

53

* Abstract base class for text vectorizers

54

* Provides common functionality and configuration patterns

55

*/

56

public class BaseTextVectorizer implements TextVectorizer {

57

// Common vectorization infrastructure and utilities

58

}

59

```

60

61

### Vectorizer Builder

62

63

Builder pattern for configuring text vectorizers with various parameters and data sources.

64

65

```java { .api }

66

/**

67

* Builder for text vectorizer configuration

68

* Supports various vectorization strategies and parameters

69

*/

70

public class Builder {

71

// Configurable builder for text vectorization components

72

}

73

```

74

75

### Input Stream Creator

76

77

Utility for creating input streams from various text sources for vectorization processing.

78

79

```java { .api }

80

/**

81

* Default input stream creator for text sources

82

* Handles various input formats and encodings

83

*/

84

public class DefaultInputStreamCreator {

85

// Input stream creation utilities for text processing

86

}

87

```

88

89

**Usage Examples:**

90

91

```java

92

import org.deeplearning4j.bagofwords.vectorizer.*;

93

94

// Basic bag of words vectorization

95

Collection<String> documents = Arrays.asList(

96

"The quick brown fox jumps over the lazy dog",

97

"A fast brown animal leaps over the sleeping canine",

98

"Natural language processing with machine learning"

99

);

100

101

// Configure bag of words vectorizer

102

BagOfWordsVectorizer bowVectorizer = new BagOfWordsVectorizer();

103

// Additional configuration would be done here based on actual API

104

105

// TF-IDF vectorization for document similarity

106

TfidfVectorizer tfidfVectorizer = new TfidfVectorizer();

107

// Configure TF-IDF parameters based on actual implementation

108

109

// Example usage pattern (actual API may vary):

110

// INDArray vectors = bowVectorizer.vectorize(documents);

111

// INDArray tfidfVectors = tfidfVectorizer.vectorize(documents);

112

113

// Builder pattern usage example

114

Builder vectorizerBuilder = new Builder()

115

// Configure vectorization parameters

116

// .vocabulary(customVocabulary)

117

// .minWordFrequency(2)

118

// .maxFeatures(10000)

119

;

120

121

// TextVectorizer vectorizer = vectorizerBuilder.build();

122

```

123

124

## Integration with Neural Models

125

126

Bag of words and TF-IDF vectorizers can be used as:

127

128

- **Feature extraction**: Converting text to numerical features for traditional ML algorithms

129

- **Preprocessing**: Initial text representation before neural processing

130

- **Baseline comparison**: Comparing neural embeddings against classical methods

131

- **Hybrid approaches**: Combining sparse and dense representations for improved performance

132

- **Cold start solutions**: Handling new documents without neural model inference overhead

133

134

The sparse representations from these vectorizers complement the dense embeddings from Word2Vec, GloVe, and ParagraphVectors, providing multiple perspectives on text data for different use cases and performance requirements.