or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

cli-interface.mdcolor-generation.mdcore-generation.mdindex.mdtext-processing.md

text-processing.mddocs/

0

# Text Processing and Tokenization

1

2

Advanced text analysis capabilities for word cloud generation, including intelligent tokenization, stopword filtering, plural normalization, and statistical bigram detection for meaningful phrase extraction.

3

4

## Capabilities

5

6

### Unigram and Bigram Extraction

7

8

Extract individual words and statistically significant two-word phrases from text, with automatic filtering and collocation detection.

9

10

```python { .api }

11

def unigrams_and_bigrams(words, stopwords, normalize_plurals=True, collocation_threshold=30):

12

"""

13

Extract unigrams and statistically significant bigrams from word list.

14

15

Processes a list of word tokens to identify meaningful single words and two-word

16

phrases based on statistical collocation analysis using Dunning likelihood ratios.

17

Filters out stopwords and optionally normalizes plural forms.

18

19

Parameters:

20

- words (list): List of word tokens from text

21

- stopwords (set): Set of stopwords to filter out

22

- normalize_plurals (bool): Whether to merge plural forms with singular (default: True)

23

- collocation_threshold (int): Minimum collocation score for bigram inclusion (default: 30)

24

25

Returns:

26

- dict: Dictionary mapping words/phrases to their frequencies, with bigrams

27

included only if they exceed the collocation threshold

28

"""

29

```

30

31

### Token Processing and Normalization

32

33

Normalize word tokens by handling case variations and plural forms for consistent word cloud representation.

34

35

```python { .api }

36

def process_tokens(words, normalize_plurals=True):

37

"""

38

Normalize word cases and optionally remove plural forms.

39

40

Processes word tokens to establish canonical forms: most frequent case

41

representation is used for each word, and plural forms are optionally

42

merged with singular forms based on simple heuristics.

43

44

Parameters:

45

- words (iterable): Iterable of word strings to process

46

- normalize_plurals (bool): Whether to merge plurals with singular forms (default: True)

47

48

Returns:

49

- tuple: (counts_dict, standard_forms_dict) where:

50

- counts_dict (dict): Word frequencies with canonical case forms

51

- standard_forms_dict (dict): Mapping from lowercase to canonical case

52

"""

53

```

54

55

### Statistical Collocation Scoring

56

57

Internal function for calculating statistical significance of word pairs using Dunning likelihood ratios.

58

59

```python { .api }

60

def score(count_bigram, count1, count2, n_words):

61

"""

62

Calculate Dunning likelihood collocation score for word pairs.

63

64

Computes statistical significance of bigram co-occurrence using likelihood

65

ratio test to identify meaningful phrases versus random word combinations.

66

67

Parameters:

68

- count_bigram (int): Frequency of the bigram

69

- count1 (int): Frequency of first word

70

- count2 (int): Frequency of second word

71

- n_words (int): Total number of words in corpus

72

73

Returns:

74

- float: Collocation score (higher values indicate stronger association)

75

"""

76

```

77

78

## STOPWORDS Constant

79

80

Pre-defined set of common English words to filter from word cloud generation.

81

82

```python { .api }

83

STOPWORDS: set[str] # Comprehensive set of English stopwords

84

```

85

86

The `STOPWORDS` constant contains common English words (articles, prepositions, pronouns, etc.) that are typically filtered out of word clouds to focus on meaningful content words.

87

88

## Usage Examples

89

90

### Basic Tokenization

91

92

```python

93

from wordcloud.tokenization import unigrams_and_bigrams, process_tokens

94

from wordcloud import STOPWORDS

95

96

# Process text tokens

97

words = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

98

word_counts = unigrams_and_bigrams(words, STOPWORDS)

99

print(word_counts) # {'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, ...}

100

```

101

102

### Custom Stopwords

103

104

```python

105

from wordcloud.tokenization import unigrams_and_bigrams

106

from wordcloud import STOPWORDS

107

108

# Add custom stopwords

109

custom_stopwords = STOPWORDS.copy()

110

custom_stopwords.update(['custom', 'specific', 'terms'])

111

112

# Process with custom stopwords

113

words = text.split()

114

word_counts = unigrams_and_bigrams(words, custom_stopwords)

115

```

116

117

### Bigram Detection

118

119

```python

120

from wordcloud.tokenization import unigrams_and_bigrams

121

from wordcloud import STOPWORDS

122

123

# Text with potential bigrams

124

text = "machine learning algorithms are powerful tools for data science"

125

words = text.split()

126

127

# Extract with bigram detection

128

word_counts = unigrams_and_bigrams(

129

words,

130

STOPWORDS,

131

normalize_plurals=True,

132

collocation_threshold=10 # Lower threshold for more bigrams

133

)

134

135

# May include bigrams like "machine learning" or "data science"

136

print(word_counts)

137

```

138

139

### Token Normalization

140

141

```python

142

from wordcloud.tokenization import process_tokens

143

144

# Words with case variations and plurals

145

words = ["Python", "python", "PYTHON", "pythons", "cats", "cat", "Dogs", "dog"]

146

147

# Normalize tokens

148

counts, standard_forms = process_tokens(words, normalize_plurals=True)

149

150

print(counts) # {'Python': 4, 'cat': 2, 'Dogs': 2}

151

print(standard_forms) # {'python': 'Python', 'cat': 'cat', 'dog': 'Dogs', ...}

152

```

153

154

### Integration with WordCloud

155

156

```python

157

from wordcloud import WordCloud, STOPWORDS

158

import re

159

160

# Custom text processing with WordCloud

161

def custom_preprocess(text):

162

# Custom tokenization

163

words = re.findall(r'\b[a-zA-Z]{3,}\b', text.lower())

164

165

# Add domain-specific stopwords

166

custom_stops = STOPWORDS.copy()

167

custom_stops.update(['said', 'would', 'could'])

168

169

return words, custom_stops

170

171

# Use with WordCloud

172

text = "Your text data here..."

173

words, stopwords = custom_preprocess(text)

174

175

# WordCloud will use its internal processing, but you can also

176

# use the tokenization functions directly for more control

177

from wordcloud.tokenization import unigrams_and_bigrams

178

frequencies = unigrams_and_bigrams(words, stopwords)

179

180

wc = WordCloud().generate_from_frequencies(frequencies)

181

```

182

183

### Controlling Plural Normalization

184

185

```python

186

from wordcloud.tokenization import process_tokens

187

188

# Keep plurals separate

189

words = ["cat", "cats", "dog", "dogs", "analysis", "analyses"]

190

counts_separate, _ = process_tokens(words, normalize_plurals=False)

191

print(counts_separate) # {'cat': 1, 'cats': 1, 'dog': 1, 'dogs': 1, ...}

192

193

# Merge plurals (default behavior)

194

counts_merged, _ = process_tokens(words, normalize_plurals=True)

195

print(counts_merged) # {'cat': 2, 'dog': 2, 'analysis': 2}

196

```

197

198

### Advanced Collocation Control

199

200

```python

201

from wordcloud.tokenization import unigrams_and_bigrams

202

from wordcloud import STOPWORDS

203

204

text = "New York City is a great place to visit in New York state"

205

words = text.split()

206

207

# High threshold - fewer bigrams

208

strict_counts = unigrams_and_bigrams(words, STOPWORDS, collocation_threshold=50)

209

210

# Low threshold - more bigrams

211

loose_counts = unigrams_and_bigrams(words, STOPWORDS, collocation_threshold=5)

212

213

# Compare results

214

print("Strict:", strict_counts)

215

print("Loose:", loose_counts) # May include "New York" as bigram

216

```