0
# Pattern Mining
1
2
Association rule mining and frequent pattern discovery algorithms for transaction data analysis. These algorithms are commonly used in market basket analysis, web usage mining, and bioinformatics.
3
4
## Capabilities
5
6
### Apriori Algorithm
7
8
Classic algorithm for frequent itemset mining using the downward closure property.
9
10
```python { .api }
11
def apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0,
12
low_memory=False):
13
"""
14
Apriori algorithm for frequent itemset mining.
15
16
Parameters:
17
- df: DataFrame or array-like, binary transaction matrix
18
- min_support: float, minimum support threshold (0.0 to 1.0)
19
- use_colnames: bool, use column names in output instead of indices
20
- max_len: int, maximum length of itemsets to evaluate
21
- verbose: int, verbosity level (0=silent, 1=progress bar)
22
- low_memory: bool, use memory-efficient implementation
23
24
Returns:
25
- frequent_itemsets: DataFrame with columns ['support', 'itemsets']
26
- support: float, support value of the itemset
27
- itemsets: frozenset, the frequent itemset
28
"""
29
```
30
31
### FP-Growth Algorithm
32
33
Frequent Pattern Growth algorithm for efficient frequent itemset mining without candidate generation.
34
35
```python { .api }
36
def fpgrowth(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0):
37
"""
38
FP-Growth algorithm for frequent itemset mining.
39
40
Parameters:
41
- df: DataFrame or array-like, binary transaction matrix
42
- min_support: float, minimum support threshold (0.0 to 1.0)
43
- use_colnames: bool, use column names in output instead of indices
44
- max_len: int, maximum length of itemsets to evaluate
45
- verbose: int, verbosity level (0=silent, 1=progress bar)
46
47
Returns:
48
- frequent_itemsets: DataFrame with columns ['support', 'itemsets']
49
- support: float, support value of the itemset
50
- itemsets: frozenset, the frequent itemset
51
"""
52
```
53
54
### FPMax Algorithm
55
56
FPMax algorithm for mining maximal frequent itemsets efficiently.
57
58
```python { .api }
59
def fpmax(df, min_support=0.5, use_colnames=False, verbose=0):
60
"""
61
FPMax algorithm for maximal frequent itemset mining.
62
63
Parameters:
64
- df: DataFrame or array-like, binary transaction matrix
65
- min_support: float, minimum support threshold (0.0 to 1.0)
66
- use_colnames: bool, use column names in output instead of indices
67
- verbose: int, verbosity level (0=silent, 1=progress bar)
68
69
Returns:
70
- frequent_itemsets: DataFrame with columns ['support', 'itemsets']
71
- support: float, support value of the itemset
72
- itemsets: frozenset, the maximal frequent itemset
73
"""
74
```
75
76
### H-mine Algorithm
77
78
H-mine algorithm for frequent itemset mining with memory optimization.
79
80
```python { .api }
81
def hmine(df, min_support=0.5, use_colnames=False, verbose=0):
82
"""
83
H-mine algorithm for frequent itemset mining.
84
85
Parameters:
86
- df: DataFrame or array-like, binary transaction matrix
87
- min_support: float, minimum support threshold (0.0 to 1.0)
88
- use_colnames: bool, use column names in output instead of indices
89
- verbose: int, verbosity level (0=silent, 1=progress bar)
90
91
Returns:
92
- frequent_itemsets: DataFrame with columns ['support', 'itemsets']
93
- support: float, support value of the itemset
94
- itemsets: frozenset, the frequent itemset
95
"""
96
```
97
98
### Association Rules Generation
99
100
Generate association rules from frequent itemsets with various interest measures.
101
102
```python { .api }
103
def association_rules(df, metric="confidence", min_threshold=0.8, support_only=False,
104
num_itemsets=None, verbose=0):
105
"""
106
Generate association rules from frequent itemsets.
107
108
Parameters:
109
- df: DataFrame, frequent itemsets with 'support' and 'itemsets' columns
110
- metric: str, interest measure for rule evaluation:
111
- 'confidence': confidence measure
112
- 'lift': lift measure
113
- 'support': support measure
114
- 'leverage': leverage measure
115
- 'conviction': conviction measure
116
- 'zhangs_metric': Zhang's metric
117
- min_threshold: float, minimum threshold for the specified metric
118
- support_only: bool, only computes support metric (faster)
119
- num_itemsets: int, number of itemsets to use (uses all if None)
120
- verbose: int, verbosity level
121
122
Returns:
123
- rules: DataFrame with columns:
124
- 'antecedents': frozenset, left-hand side of rule
125
- 'consequents': frozenset, right-hand side of rule
126
- 'antecedent support': float, support of antecedent
127
- 'consequent support': float, support of consequent
128
- 'support': float, support of the rule (antecedent ∪ consequent)
129
- 'confidence': float, confidence of the rule
130
- 'lift': float, lift of the rule
131
- 'leverage': float, leverage of the rule
132
- 'conviction': float, conviction of the rule
133
- 'zhangs_metric': float, Zhang's metric
134
"""
135
```
136
137
## Interest Measures
138
139
The association rules generation supports multiple interest measures for evaluating rule quality:
140
141
### Confidence
142
```
143
confidence(A → B) = support(A ∪ B) / support(A)
144
```
145
Measures the reliability of the inference made by the rule.
146
147
### Lift
148
```
149
lift(A → B) = confidence(A → B) / support(B)
150
```
151
Measures how much more likely B is given A, compared to B occurring independently.
152
153
### Support
154
```
155
support(A → B) = support(A ∪ B)
156
```
157
Measures how frequently the itemset appears in the dataset.
158
159
### Leverage
160
```
161
leverage(A → B) = support(A ∪ B) - support(A) × support(B)
162
```
163
Measures the difference between observed and expected support if A and B were independent.
164
165
### Conviction
166
```
167
conviction(A → B) = (1 - support(B)) / (1 - confidence(A → B))
168
```
169
Compares the probability that A appears without B if they were dependent vs independent.
170
171
### Zhang's Metric
172
```
173
zhang(A → B) = (confidence(A → B) - support(B)) / max(confidence(A → B), support(B)) × (1 - max(confidence(A → B), support(B)))
174
```
175
Balances statistical significance and practical significance.
176
177
## Usage Examples
178
179
### Basic Market Basket Analysis
180
181
```python
182
import pandas as pd
183
from mlxtend.frequent_patterns import apriori, association_rules
184
from mlxtend.preprocessing import TransactionEncoder
185
186
# Sample transaction data
187
transactions = [
188
['bread', 'milk'],
189
['bread', 'diapers', 'beer', 'eggs'],
190
['milk', 'diapers', 'beer', 'cola'],
191
['bread', 'milk', 'diapers', 'beer'],
192
['bread', 'milk', 'diapers', 'cola']
193
]
194
195
# Encode transactions as binary matrix
196
te = TransactionEncoder()
197
te_ary = te.fit(transactions).transform(transactions)
198
df = pd.DataFrame(te_ary, columns=te.columns_)
199
200
print("Transaction Matrix:")
201
print(df)
202
203
# Find frequent itemsets using Apriori
204
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)
205
print("\nFrequent Itemsets:")
206
print(frequent_itemsets)
207
208
# Generate association rules
209
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)
210
print("\nAssociation Rules:")
211
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
212
```
213
214
### Comparing Different Algorithms
215
216
```python
217
import pandas as pd
218
from mlxtend.frequent_patterns import apriori, fpgrowth, fpmax
219
from mlxtend.preprocessing import TransactionEncoder
220
import time
221
222
# Sample large transaction dataset
223
transactions = [
224
['A', 'B', 'C'],
225
['A', 'C'],
226
['B', 'C', 'D'],
227
['A', 'B', 'C', 'D'],
228
['A', 'C'],
229
['B', 'C'],
230
['A', 'B', 'C'],
231
['A', 'B', 'D'],
232
['B', 'C', 'E'],
233
['A', 'B', 'C']
234
]
235
236
# Encode transactions
237
te = TransactionEncoder()
238
te_ary = te.fit(transactions).transform(transactions)
239
df = pd.DataFrame(te_ary, columns=te.columns_)
240
241
min_support = 0.3
242
243
# Compare algorithm performance
244
algorithms = {
245
'Apriori': apriori,
246
'FP-Growth': fpgrowth,
247
'FPMax': fpmax
248
}
249
250
for name, algorithm in algorithms.items():
251
start_time = time.time()
252
frequent_itemsets = algorithm(df, min_support=min_support, use_colnames=True)
253
end_time = time.time()
254
255
print(f"\n{name}:")
256
print(f"Execution time: {end_time - start_time:.4f} seconds")
257
print(f"Number of frequent itemsets: {len(frequent_itemsets)}")
258
print("Sample itemsets:")
259
print(frequent_itemsets.head())
260
```
261
262
### Advanced Rule Analysis
263
264
```python
265
import pandas as pd
266
from mlxtend.frequent_patterns import apriori, association_rules
267
from mlxtend.preprocessing import TransactionEncoder
268
269
# Create sample grocery transaction data
270
transactions = [
271
['milk', 'eggs', 'bread', 'cheese'],
272
['eggs', 'bread'],
273
['milk', 'bread'],
274
['beer', 'chips', 'cheese', 'nuts'],
275
['beer', 'chips'],
276
['milk', 'eggs', 'bread', 'butter'],
277
['beer', 'chips', 'nuts'],
278
['milk', 'cheese'],
279
['bread', 'butter'],
280
['beer', 'nuts']
281
]
282
283
# Encode and find frequent itemsets
284
te = TransactionEncoder()
285
te_ary = te.fit(transactions).transform(transactions)
286
df = pd.DataFrame(te_ary, columns=te.columns_)
287
288
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)
289
290
# Generate rules with different metrics
291
metrics = ['confidence', 'lift', 'leverage', 'conviction']
292
thresholds = [0.5, 1.0, 0.1, 1.0]
293
294
for metric, threshold in zip(metrics, thresholds):
295
rules = association_rules(frequent_itemsets, metric=metric, min_threshold=threshold)
296
print(f"\nRules with {metric} >= {threshold}:")
297
print(f"Number of rules: {len(rules)}")
298
299
if len(rules) > 0:
300
# Sort by the metric and show top rules
301
top_rules = rules.nlargest(3, metric)
302
for idx, rule in top_rules.iterrows():
303
antecedent = ', '.join(list(rule['antecedents']))
304
consequent = ', '.join(list(rule['consequents']))
305
print(f" {antecedent} → {consequent}")
306
print(f" Support: {rule['support']:.3f}, Confidence: {rule['confidence']:.3f}")
307
print(f" Lift: {rule['lift']:.3f}, {metric.title()}: {rule[metric]:.3f}")
308
```
309
310
### Working with Pandas DataFrame Input
311
312
```python
313
import pandas as pd
314
from mlxtend.frequent_patterns import fpgrowth, association_rules
315
316
# Create binary transaction matrix directly
317
data = {
318
'Apple': [1, 0, 1, 1, 0],
319
'Banana': [1, 1, 1, 0, 1],
320
'Orange': [0, 1, 1, 1, 0],
321
'Grapes': [1, 0, 0, 1, 1],
322
'Mango': [0, 1, 0, 0, 1]
323
}
324
325
df = pd.DataFrame(data)
326
print("Binary Transaction Matrix:")
327
print(df)
328
329
# Mine frequent itemsets
330
frequent_itemsets = fpgrowth(df, min_support=0.4, use_colnames=True)
331
print("\nFrequent Itemsets:")
332
print(frequent_itemsets.sort_values('support', ascending=False))
333
334
# Generate and analyze rules
335
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
336
print("\nAssociation Rules:")
337
for idx, rule in rules.iterrows():
338
antecedent = ', '.join(list(rule['antecedents']))
339
consequent = ', '.join(list(rule['consequents']))
340
print(f"{antecedent} → {consequent}")
341
print(f" Confidence: {rule['confidence']:.3f}, Lift: {rule['lift']:.3f}")
342
```
343
344
## Performance Considerations
345
346
- **Apriori**: Good for educational purposes and small datasets. Generates many candidates.
347
- **FP-Growth**: More efficient than Apriori, especially for dense datasets with many frequent items.
348
- **FPMax**: Fastest for finding only maximal frequent itemsets, but provides less detailed information.
349
- **H-mine**: Memory-efficient variant suitable for large datasets with memory constraints.
350
351
Choose the algorithm based on your specific requirements:
352
- Use **Apriori** for learning and small datasets
353
- Use **FP-Growth** for general-purpose frequent itemset mining
354
- Use **FPMax** when you only need maximal itemsets
355
- Use **H-mine** for memory-constrained environments