Tessl Tile for pypi/wordcloud@1.9.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

cli-interface.md color-generation.md core-generation.md index.md text-processing.md

text-processing.mddocs/

0
# Text Processing and Tokenization
1

2
Advanced text analysis capabilities for word cloud generation, including intelligent tokenization, stopword filtering, plural normalization, and statistical bigram detection for meaningful phrase extraction.
3

4
## Capabilities
5

6
### Unigram and Bigram Extraction
7

8
Extract individual words and statistically significant two-word phrases from text, with automatic filtering and collocation detection.
9

10
```python { .api }
11
def unigrams_and_bigrams(words, stopwords, normalize_plurals=True, collocation_threshold=30):
12
    """
13
    Extract unigrams and statistically significant bigrams from word list.
14

15
    Processes a list of word tokens to identify meaningful single words and two-word
16
    phrases based on statistical collocation analysis using Dunning likelihood ratios.
17
    Filters out stopwords and optionally normalizes plural forms.
18

19
    Parameters:
20
    - words (list): List of word tokens from text
21
    - stopwords (set): Set of stopwords to filter out
22
    - normalize_plurals (bool): Whether to merge plural forms with singular (default: True)
23
    - collocation_threshold (int): Minimum collocation score for bigram inclusion (default: 30)
24

25
    Returns:
26
    - dict: Dictionary mapping words/phrases to their frequencies, with bigrams
27
            included only if they exceed the collocation threshold
28
    """
29
```
30

31
### Token Processing and Normalization
32

33
Normalize word tokens by handling case variations and plural forms for consistent word cloud representation.
34

35
```python { .api }
36
def process_tokens(words, normalize_plurals=True):
37
    """
38
    Normalize word cases and optionally remove plural forms.
39

40
    Processes word tokens to establish canonical forms: most frequent case
41
    representation is used for each word, and plural forms are optionally
42
    merged with singular forms based on simple heuristics.
43

44
    Parameters:
45
    - words (iterable): Iterable of word strings to process
46
    - normalize_plurals (bool): Whether to merge plurals with singular forms (default: True)
47

48
    Returns:
49
    - tuple: (counts_dict, standard_forms_dict) where:
50
        - counts_dict (dict): Word frequencies with canonical case forms
51
        - standard_forms_dict (dict): Mapping from lowercase to canonical case
52
    """
53
```
54

55
### Statistical Collocation Scoring
56

57
Internal function for calculating statistical significance of word pairs using Dunning likelihood ratios.
58

59
```python { .api }
60
def score(count_bigram, count1, count2, n_words):
61
    """
62
    Calculate Dunning likelihood collocation score for word pairs.
63

64
    Computes statistical significance of bigram co-occurrence using likelihood
65
    ratio test to identify meaningful phrases versus random word combinations.
66

67
    Parameters:
68
    - count_bigram (int): Frequency of the bigram
69
    - count1 (int): Frequency of first word
70
    - count2 (int): Frequency of second word  
71
    - n_words (int): Total number of words in corpus
72

73
    Returns:
74
    - float: Collocation score (higher values indicate stronger association)
75
    """
76
```
77

78
## STOPWORDS Constant
79

80
Pre-defined set of common English words to filter from word cloud generation.
81

82
```python { .api }
83
STOPWORDS: set[str]  # Comprehensive set of English stopwords
84
```
85

86
The `STOPWORDS` constant contains common English words (articles, prepositions, pronouns, etc.) that are typically filtered out of word clouds to focus on meaningful content words.
87

88
## Usage Examples
89

90
### Basic Tokenization
91

92
```python
93
from wordcloud.tokenization import unigrams_and_bigrams, process_tokens
94
from wordcloud import STOPWORDS
95

96
# Process text tokens
97
words = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
98
word_counts = unigrams_and_bigrams(words, STOPWORDS)
99
print(word_counts)  # {'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, ...}
100
```
101

102
### Custom Stopwords
103

104
```python
105
from wordcloud.tokenization import unigrams_and_bigrams
106
from wordcloud import STOPWORDS
107

108
# Add custom stopwords
109
custom_stopwords = STOPWORDS.copy()
110
custom_stopwords.update(['custom', 'specific', 'terms'])
111

112
# Process with custom stopwords
113
words = text.split()
114
word_counts = unigrams_and_bigrams(words, custom_stopwords)
115
```
116

117
### Bigram Detection
118

119
```python
120
from wordcloud.tokenization import unigrams_and_bigrams
121
from wordcloud import STOPWORDS
122

123
# Text with potential bigrams
124
text = "machine learning algorithms are powerful tools for data science"
125
words = text.split()
126

127
# Extract with bigram detection
128
word_counts = unigrams_and_bigrams(
129
    words, 
130
    STOPWORDS, 
131
    normalize_plurals=True, 
132
    collocation_threshold=10  # Lower threshold for more bigrams
133
)
134

135
# May include bigrams like "machine learning" or "data science"
136
print(word_counts)
137
```
138

139
### Token Normalization
140

141
```python
142
from wordcloud.tokenization import process_tokens
143

144
# Words with case variations and plurals
145
words = ["Python", "python", "PYTHON", "pythons", "cats", "cat", "Dogs", "dog"]
146

147
# Normalize tokens
148
counts, standard_forms = process_tokens(words, normalize_plurals=True)
149

150
print(counts)          # {'Python': 4, 'cat': 2, 'Dogs': 2}
151
print(standard_forms)  # {'python': 'Python', 'cat': 'cat', 'dog': 'Dogs', ...}
152
```
153

154
### Integration with WordCloud
155

156
```python
157
from wordcloud import WordCloud, STOPWORDS
158
import re
159

160
# Custom text processing with WordCloud
161
def custom_preprocess(text):
162
    # Custom tokenization
163
    words = re.findall(r'\b[a-zA-Z]{3,}\b', text.lower())
164
    
165
    # Add domain-specific stopwords
166
    custom_stops = STOPWORDS.copy()
167
    custom_stops.update(['said', 'would', 'could'])
168
    
169
    return words, custom_stops
170

171
# Use with WordCloud
172
text = "Your text data here..."
173
words, stopwords = custom_preprocess(text)
174

175
# WordCloud will use its internal processing, but you can also
176
# use the tokenization functions directly for more control
177
from wordcloud.tokenization import unigrams_and_bigrams
178
frequencies = unigrams_and_bigrams(words, stopwords)
179

180
wc = WordCloud().generate_from_frequencies(frequencies)
181
```
182

183
### Controlling Plural Normalization
184

185
```python
186
from wordcloud.tokenization import process_tokens
187

188
# Keep plurals separate
189
words = ["cat", "cats", "dog", "dogs", "analysis", "analyses"]
190
counts_separate, _ = process_tokens(words, normalize_plurals=False)
191
print(counts_separate)  # {'cat': 1, 'cats': 1, 'dog': 1, 'dogs': 1, ...}
192

193
# Merge plurals (default behavior)
194
counts_merged, _ = process_tokens(words, normalize_plurals=True)
195
print(counts_merged)    # {'cat': 2, 'dog': 2, 'analysis': 2}
196
```
197

198
### Advanced Collocation Control
199

200
```python
201
from wordcloud.tokenization import unigrams_and_bigrams
202
from wordcloud import STOPWORDS
203

204
text = "New York City is a great place to visit in New York state"
205
words = text.split()
206

207
# High threshold - fewer bigrams
208
strict_counts = unigrams_and_bigrams(words, STOPWORDS, collocation_threshold=50)
209

210
# Low threshold - more bigrams  
211
loose_counts = unigrams_and_bigrams(words, STOPWORDS, collocation_threshold=5)
212

213
# Compare results
214
print("Strict:", strict_counts)
215
print("Loose:", loose_counts)  # May include "New York" as bigram
216
```

Version

Tile

Files

text-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

text-processing.mddocs/