0
# Text Processing and Tokenization
1
2
Advanced text analysis capabilities for word cloud generation, including intelligent tokenization, stopword filtering, plural normalization, and statistical bigram detection for meaningful phrase extraction.
3
4
## Capabilities
5
6
### Unigram and Bigram Extraction
7
8
Extract individual words and statistically significant two-word phrases from text, with automatic filtering and collocation detection.
9
10
```python { .api }
11
def unigrams_and_bigrams(words, stopwords, normalize_plurals=True, collocation_threshold=30):
12
"""
13
Extract unigrams and statistically significant bigrams from word list.
14
15
Processes a list of word tokens to identify meaningful single words and two-word
16
phrases based on statistical collocation analysis using Dunning likelihood ratios.
17
Filters out stopwords and optionally normalizes plural forms.
18
19
Parameters:
20
- words (list): List of word tokens from text
21
- stopwords (set): Set of stopwords to filter out
22
- normalize_plurals (bool): Whether to merge plural forms with singular (default: True)
23
- collocation_threshold (int): Minimum collocation score for bigram inclusion (default: 30)
24
25
Returns:
26
- dict: Dictionary mapping words/phrases to their frequencies, with bigrams
27
included only if they exceed the collocation threshold
28
"""
29
```
30
31
### Token Processing and Normalization
32
33
Normalize word tokens by handling case variations and plural forms for consistent word cloud representation.
34
35
```python { .api }
36
def process_tokens(words, normalize_plurals=True):
37
"""
38
Normalize word cases and optionally remove plural forms.
39
40
Processes word tokens to establish canonical forms: most frequent case
41
representation is used for each word, and plural forms are optionally
42
merged with singular forms based on simple heuristics.
43
44
Parameters:
45
- words (iterable): Iterable of word strings to process
46
- normalize_plurals (bool): Whether to merge plurals with singular forms (default: True)
47
48
Returns:
49
- tuple: (counts_dict, standard_forms_dict) where:
50
- counts_dict (dict): Word frequencies with canonical case forms
51
- standard_forms_dict (dict): Mapping from lowercase to canonical case
52
"""
53
```
54
55
### Statistical Collocation Scoring
56
57
Internal function for calculating statistical significance of word pairs using Dunning likelihood ratios.
58
59
```python { .api }
60
def score(count_bigram, count1, count2, n_words):
61
"""
62
Calculate Dunning likelihood collocation score for word pairs.
63
64
Computes statistical significance of bigram co-occurrence using likelihood
65
ratio test to identify meaningful phrases versus random word combinations.
66
67
Parameters:
68
- count_bigram (int): Frequency of the bigram
69
- count1 (int): Frequency of first word
70
- count2 (int): Frequency of second word
71
- n_words (int): Total number of words in corpus
72
73
Returns:
74
- float: Collocation score (higher values indicate stronger association)
75
"""
76
```
77
78
## STOPWORDS Constant
79
80
Pre-defined set of common English words to filter from word cloud generation.
81
82
```python { .api }
83
STOPWORDS: set[str] # Comprehensive set of English stopwords
84
```
85
86
The `STOPWORDS` constant contains common English words (articles, prepositions, pronouns, etc.) that are typically filtered out of word clouds to focus on meaningful content words.
87
88
## Usage Examples
89
90
### Basic Tokenization
91
92
```python
93
from wordcloud.tokenization import unigrams_and_bigrams, process_tokens
94
from wordcloud import STOPWORDS
95
96
# Process text tokens
97
words = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
98
word_counts = unigrams_and_bigrams(words, STOPWORDS)
99
print(word_counts) # {'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, ...}
100
```
101
102
### Custom Stopwords
103
104
```python
105
from wordcloud.tokenization import unigrams_and_bigrams
106
from wordcloud import STOPWORDS
107
108
# Add custom stopwords
109
custom_stopwords = STOPWORDS.copy()
110
custom_stopwords.update(['custom', 'specific', 'terms'])
111
112
# Process with custom stopwords
113
words = text.split()
114
word_counts = unigrams_and_bigrams(words, custom_stopwords)
115
```
116
117
### Bigram Detection
118
119
```python
120
from wordcloud.tokenization import unigrams_and_bigrams
121
from wordcloud import STOPWORDS
122
123
# Text with potential bigrams
124
text = "machine learning algorithms are powerful tools for data science"
125
words = text.split()
126
127
# Extract with bigram detection
128
word_counts = unigrams_and_bigrams(
129
words,
130
STOPWORDS,
131
normalize_plurals=True,
132
collocation_threshold=10 # Lower threshold for more bigrams
133
)
134
135
# May include bigrams like "machine learning" or "data science"
136
print(word_counts)
137
```
138
139
### Token Normalization
140
141
```python
142
from wordcloud.tokenization import process_tokens
143
144
# Words with case variations and plurals
145
words = ["Python", "python", "PYTHON", "pythons", "cats", "cat", "Dogs", "dog"]
146
147
# Normalize tokens
148
counts, standard_forms = process_tokens(words, normalize_plurals=True)
149
150
print(counts) # {'Python': 4, 'cat': 2, 'Dogs': 2}
151
print(standard_forms) # {'python': 'Python', 'cat': 'cat', 'dog': 'Dogs', ...}
152
```
153
154
### Integration with WordCloud
155
156
```python
157
from wordcloud import WordCloud, STOPWORDS
158
import re
159
160
# Custom text processing with WordCloud
161
def custom_preprocess(text):
162
# Custom tokenization
163
words = re.findall(r'\b[a-zA-Z]{3,}\b', text.lower())
164
165
# Add domain-specific stopwords
166
custom_stops = STOPWORDS.copy()
167
custom_stops.update(['said', 'would', 'could'])
168
169
return words, custom_stops
170
171
# Use with WordCloud
172
text = "Your text data here..."
173
words, stopwords = custom_preprocess(text)
174
175
# WordCloud will use its internal processing, but you can also
176
# use the tokenization functions directly for more control
177
from wordcloud.tokenization import unigrams_and_bigrams
178
frequencies = unigrams_and_bigrams(words, stopwords)
179
180
wc = WordCloud().generate_from_frequencies(frequencies)
181
```
182
183
### Controlling Plural Normalization
184
185
```python
186
from wordcloud.tokenization import process_tokens
187
188
# Keep plurals separate
189
words = ["cat", "cats", "dog", "dogs", "analysis", "analyses"]
190
counts_separate, _ = process_tokens(words, normalize_plurals=False)
191
print(counts_separate) # {'cat': 1, 'cats': 1, 'dog': 1, 'dogs': 1, ...}
192
193
# Merge plurals (default behavior)
194
counts_merged, _ = process_tokens(words, normalize_plurals=True)
195
print(counts_merged) # {'cat': 2, 'dog': 2, 'analysis': 2}
196
```
197
198
### Advanced Collocation Control
199
200
```python
201
from wordcloud.tokenization import unigrams_and_bigrams
202
from wordcloud import STOPWORDS
203
204
text = "New York City is a great place to visit in New York state"
205
words = text.split()
206
207
# High threshold - fewer bigrams
208
strict_counts = unigrams_and_bigrams(words, STOPWORDS, collocation_threshold=50)
209
210
# Low threshold - more bigrams
211
loose_counts = unigrams_and_bigrams(words, STOPWORDS, collocation_threshold=5)
212
213
# Compare results
214
print("Strict:", strict_counts)
215
print("Loose:", loose_counts) # May include "New York" as bigram
216
```