or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

classification.mdclustering.mddatasets.mdevaluation.mdfeature-engineering.mdfile-io.mdindex.mdmath-utils.mdpattern-mining.mdplotting.mdpreprocessing.mdregression.mdtext-processing.mdutilities.md

pattern-mining.mddocs/

0

# Pattern Mining

1

2

Association rule mining and frequent pattern discovery algorithms for transaction data analysis. These algorithms are commonly used in market basket analysis, web usage mining, and bioinformatics.

3

4

## Capabilities

5

6

### Apriori Algorithm

7

8

Classic algorithm for frequent itemset mining using the downward closure property.

9

10

```python { .api }

11

def apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0,

12

low_memory=False):

13

"""

14

Apriori algorithm for frequent itemset mining.

15

16

Parameters:

17

- df: DataFrame or array-like, binary transaction matrix

18

- min_support: float, minimum support threshold (0.0 to 1.0)

19

- use_colnames: bool, use column names in output instead of indices

20

- max_len: int, maximum length of itemsets to evaluate

21

- verbose: int, verbosity level (0=silent, 1=progress bar)

22

- low_memory: bool, use memory-efficient implementation

23

24

Returns:

25

- frequent_itemsets: DataFrame with columns ['support', 'itemsets']

26

- support: float, support value of the itemset

27

- itemsets: frozenset, the frequent itemset

28

"""

29

```

30

31

### FP-Growth Algorithm

32

33

Frequent Pattern Growth algorithm for efficient frequent itemset mining without candidate generation.

34

35

```python { .api }

36

def fpgrowth(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0):

37

"""

38

FP-Growth algorithm for frequent itemset mining.

39

40

Parameters:

41

- df: DataFrame or array-like, binary transaction matrix

42

- min_support: float, minimum support threshold (0.0 to 1.0)

43

- use_colnames: bool, use column names in output instead of indices

44

- max_len: int, maximum length of itemsets to evaluate

45

- verbose: int, verbosity level (0=silent, 1=progress bar)

46

47

Returns:

48

- frequent_itemsets: DataFrame with columns ['support', 'itemsets']

49

- support: float, support value of the itemset

50

- itemsets: frozenset, the frequent itemset

51

"""

52

```

53

54

### FPMax Algorithm

55

56

FPMax algorithm for mining maximal frequent itemsets efficiently.

57

58

```python { .api }

59

def fpmax(df, min_support=0.5, use_colnames=False, verbose=0):

60

"""

61

FPMax algorithm for maximal frequent itemset mining.

62

63

Parameters:

64

- df: DataFrame or array-like, binary transaction matrix

65

- min_support: float, minimum support threshold (0.0 to 1.0)

66

- use_colnames: bool, use column names in output instead of indices

67

- verbose: int, verbosity level (0=silent, 1=progress bar)

68

69

Returns:

70

- frequent_itemsets: DataFrame with columns ['support', 'itemsets']

71

- support: float, support value of the itemset

72

- itemsets: frozenset, the maximal frequent itemset

73

"""

74

```

75

76

### H-mine Algorithm

77

78

H-mine algorithm for frequent itemset mining with memory optimization.

79

80

```python { .api }

81

def hmine(df, min_support=0.5, use_colnames=False, verbose=0):

82

"""

83

H-mine algorithm for frequent itemset mining.

84

85

Parameters:

86

- df: DataFrame or array-like, binary transaction matrix

87

- min_support: float, minimum support threshold (0.0 to 1.0)

88

- use_colnames: bool, use column names in output instead of indices

89

- verbose: int, verbosity level (0=silent, 1=progress bar)

90

91

Returns:

92

- frequent_itemsets: DataFrame with columns ['support', 'itemsets']

93

- support: float, support value of the itemset

94

- itemsets: frozenset, the frequent itemset

95

"""

96

```

97

98

### Association Rules Generation

99

100

Generate association rules from frequent itemsets with various interest measures.

101

102

```python { .api }

103

def association_rules(df, metric="confidence", min_threshold=0.8, support_only=False,

104

num_itemsets=None, verbose=0):

105

"""

106

Generate association rules from frequent itemsets.

107

108

Parameters:

109

- df: DataFrame, frequent itemsets with 'support' and 'itemsets' columns

110

- metric: str, interest measure for rule evaluation:

111

- 'confidence': confidence measure

112

- 'lift': lift measure

113

- 'support': support measure

114

- 'leverage': leverage measure

115

- 'conviction': conviction measure

116

- 'zhangs_metric': Zhang's metric

117

- min_threshold: float, minimum threshold for the specified metric

118

- support_only: bool, only computes support metric (faster)

119

- num_itemsets: int, number of itemsets to use (uses all if None)

120

- verbose: int, verbosity level

121

122

Returns:

123

- rules: DataFrame with columns:

124

- 'antecedents': frozenset, left-hand side of rule

125

- 'consequents': frozenset, right-hand side of rule

126

- 'antecedent support': float, support of antecedent

127

- 'consequent support': float, support of consequent

128

- 'support': float, support of the rule (antecedent ∪ consequent)

129

- 'confidence': float, confidence of the rule

130

- 'lift': float, lift of the rule

131

- 'leverage': float, leverage of the rule

132

- 'conviction': float, conviction of the rule

133

- 'zhangs_metric': float, Zhang's metric

134

"""

135

```

136

137

## Interest Measures

138

139

The association rules generation supports multiple interest measures for evaluating rule quality:

140

141

### Confidence

142

```

143

confidence(A → B) = support(A ∪ B) / support(A)

144

```

145

Measures the reliability of the inference made by the rule.

146

147

### Lift

148

```

149

lift(A → B) = confidence(A → B) / support(B)

150

```

151

Measures how much more likely B is given A, compared to B occurring independently.

152

153

### Support

154

```

155

support(A → B) = support(A ∪ B)

156

```

157

Measures how frequently the itemset appears in the dataset.

158

159

### Leverage

160

```

161

leverage(A → B) = support(A ∪ B) - support(A) × support(B)

162

```

163

Measures the difference between observed and expected support if A and B were independent.

164

165

### Conviction

166

```

167

conviction(A → B) = (1 - support(B)) / (1 - confidence(A → B))

168

```

169

Compares the probability that A appears without B if they were dependent vs independent.

170

171

### Zhang's Metric

172

```

173

zhang(A → B) = (confidence(A → B) - support(B)) / max(confidence(A → B), support(B)) × (1 - max(confidence(A → B), support(B)))

174

```

175

Balances statistical significance and practical significance.

176

177

## Usage Examples

178

179

### Basic Market Basket Analysis

180

181

```python

182

import pandas as pd

183

from mlxtend.frequent_patterns import apriori, association_rules

184

from mlxtend.preprocessing import TransactionEncoder

185

186

# Sample transaction data

187

transactions = [

188

['bread', 'milk'],

189

['bread', 'diapers', 'beer', 'eggs'],

190

['milk', 'diapers', 'beer', 'cola'],

191

['bread', 'milk', 'diapers', 'beer'],

192

['bread', 'milk', 'diapers', 'cola']

193

]

194

195

# Encode transactions as binary matrix

196

te = TransactionEncoder()

197

te_ary = te.fit(transactions).transform(transactions)

198

df = pd.DataFrame(te_ary, columns=te.columns_)

199

200

print("Transaction Matrix:")

201

print(df)

202

203

# Find frequent itemsets using Apriori

204

frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)

205

print("\nFrequent Itemsets:")

206

print(frequent_itemsets)

207

208

# Generate association rules

209

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)

210

print("\nAssociation Rules:")

211

print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

212

```

213

214

### Comparing Different Algorithms

215

216

```python

217

import pandas as pd

218

from mlxtend.frequent_patterns import apriori, fpgrowth, fpmax

219

from mlxtend.preprocessing import TransactionEncoder

220

import time

221

222

# Sample large transaction dataset

223

transactions = [

224

['A', 'B', 'C'],

225

['A', 'C'],

226

['B', 'C', 'D'],

227

['A', 'B', 'C', 'D'],

228

['A', 'C'],

229

['B', 'C'],

230

['A', 'B', 'C'],

231

['A', 'B', 'D'],

232

['B', 'C', 'E'],

233

['A', 'B', 'C']

234

]

235

236

# Encode transactions

237

te = TransactionEncoder()

238

te_ary = te.fit(transactions).transform(transactions)

239

df = pd.DataFrame(te_ary, columns=te.columns_)

240

241

min_support = 0.3

242

243

# Compare algorithm performance

244

algorithms = {

245

'Apriori': apriori,

246

'FP-Growth': fpgrowth,

247

'FPMax': fpmax

248

}

249

250

for name, algorithm in algorithms.items():

251

start_time = time.time()

252

frequent_itemsets = algorithm(df, min_support=min_support, use_colnames=True)

253

end_time = time.time()

254

255

print(f"\n{name}:")

256

print(f"Execution time: {end_time - start_time:.4f} seconds")

257

print(f"Number of frequent itemsets: {len(frequent_itemsets)}")

258

print("Sample itemsets:")

259

print(frequent_itemsets.head())

260

```

261

262

### Advanced Rule Analysis

263

264

```python

265

import pandas as pd

266

from mlxtend.frequent_patterns import apriori, association_rules

267

from mlxtend.preprocessing import TransactionEncoder

268

269

# Create sample grocery transaction data

270

transactions = [

271

['milk', 'eggs', 'bread', 'cheese'],

272

['eggs', 'bread'],

273

['milk', 'bread'],

274

['beer', 'chips', 'cheese', 'nuts'],

275

['beer', 'chips'],

276

['milk', 'eggs', 'bread', 'butter'],

277

['beer', 'chips', 'nuts'],

278

['milk', 'cheese'],

279

['bread', 'butter'],

280

['beer', 'nuts']

281

]

282

283

# Encode and find frequent itemsets

284

te = TransactionEncoder()

285

te_ary = te.fit(transactions).transform(transactions)

286

df = pd.DataFrame(te_ary, columns=te.columns_)

287

288

frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)

289

290

# Generate rules with different metrics

291

metrics = ['confidence', 'lift', 'leverage', 'conviction']

292

thresholds = [0.5, 1.0, 0.1, 1.0]

293

294

for metric, threshold in zip(metrics, thresholds):

295

rules = association_rules(frequent_itemsets, metric=metric, min_threshold=threshold)

296

print(f"\nRules with {metric} >= {threshold}:")

297

print(f"Number of rules: {len(rules)}")

298

299

if len(rules) > 0:

300

# Sort by the metric and show top rules

301

top_rules = rules.nlargest(3, metric)

302

for idx, rule in top_rules.iterrows():

303

antecedent = ', '.join(list(rule['antecedents']))

304

consequent = ', '.join(list(rule['consequents']))

305

print(f" {antecedent} → {consequent}")

306

print(f" Support: {rule['support']:.3f}, Confidence: {rule['confidence']:.3f}")

307

print(f" Lift: {rule['lift']:.3f}, {metric.title()}: {rule[metric]:.3f}")

308

```

309

310

### Working with Pandas DataFrame Input

311

312

```python

313

import pandas as pd

314

from mlxtend.frequent_patterns import fpgrowth, association_rules

315

316

# Create binary transaction matrix directly

317

data = {

318

'Apple': [1, 0, 1, 1, 0],

319

'Banana': [1, 1, 1, 0, 1],

320

'Orange': [0, 1, 1, 1, 0],

321

'Grapes': [1, 0, 0, 1, 1],

322

'Mango': [0, 1, 0, 0, 1]

323

}

324

325

df = pd.DataFrame(data)

326

print("Binary Transaction Matrix:")

327

print(df)

328

329

# Mine frequent itemsets

330

frequent_itemsets = fpgrowth(df, min_support=0.4, use_colnames=True)

331

print("\nFrequent Itemsets:")

332

print(frequent_itemsets.sort_values('support', ascending=False))

333

334

# Generate and analyze rules

335

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

336

print("\nAssociation Rules:")

337

for idx, rule in rules.iterrows():

338

antecedent = ', '.join(list(rule['antecedents']))

339

consequent = ', '.join(list(rule['consequents']))

340

print(f"{antecedent} → {consequent}")

341

print(f" Confidence: {rule['confidence']:.3f}, Lift: {rule['lift']:.3f}")

342

```

343

344

## Performance Considerations

345

346

- **Apriori**: Good for educational purposes and small datasets. Generates many candidates.

347

- **FP-Growth**: More efficient than Apriori, especially for dense datasets with many frequent items.

348

- **FPMax**: Fastest for finding only maximal frequent itemsets, but provides less detailed information.

349

- **H-mine**: Memory-efficient variant suitable for large datasets with memory constraints.

350

351

Choose the algorithm based on your specific requirements:

352

- Use **Apriori** for learning and small datasets

353

- Use **FP-Growth** for general-purpose frequent itemset mining

354

- Use **FPMax** when you only need maximal itemsets

355

- Use **H-mine** for memory-constrained environments