Machine Learning Library Extensions providing essential tools for day-to-day data science tasks
—
Association rule mining and frequent pattern discovery algorithms for transaction data analysis. These algorithms are commonly used in market basket analysis, web usage mining, and bioinformatics.
Classic algorithm for frequent itemset mining using the downward closure property.
def apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0,
low_memory=False):
"""
Apriori algorithm for frequent itemset mining.
Parameters:
- df: DataFrame or array-like, binary transaction matrix
- min_support: float, minimum support threshold (0.0 to 1.0)
- use_colnames: bool, use column names in output instead of indices
- max_len: int, maximum length of itemsets to evaluate
- verbose: int, verbosity level (0=silent, 1=progress bar)
- low_memory: bool, use memory-efficient implementation
Returns:
- frequent_itemsets: DataFrame with columns ['support', 'itemsets']
- support: float, support value of the itemset
- itemsets: frozenset, the frequent itemset
"""Frequent Pattern Growth algorithm for efficient frequent itemset mining without candidate generation.
def fpgrowth(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0):
"""
FP-Growth algorithm for frequent itemset mining.
Parameters:
- df: DataFrame or array-like, binary transaction matrix
- min_support: float, minimum support threshold (0.0 to 1.0)
- use_colnames: bool, use column names in output instead of indices
- max_len: int, maximum length of itemsets to evaluate
- verbose: int, verbosity level (0=silent, 1=progress bar)
Returns:
- frequent_itemsets: DataFrame with columns ['support', 'itemsets']
- support: float, support value of the itemset
- itemsets: frozenset, the frequent itemset
"""FPMax algorithm for mining maximal frequent itemsets efficiently.
def fpmax(df, min_support=0.5, use_colnames=False, verbose=0):
"""
FPMax algorithm for maximal frequent itemset mining.
Parameters:
- df: DataFrame or array-like, binary transaction matrix
- min_support: float, minimum support threshold (0.0 to 1.0)
- use_colnames: bool, use column names in output instead of indices
- verbose: int, verbosity level (0=silent, 1=progress bar)
Returns:
- frequent_itemsets: DataFrame with columns ['support', 'itemsets']
- support: float, support value of the itemset
- itemsets: frozenset, the maximal frequent itemset
"""H-mine algorithm for frequent itemset mining with memory optimization.
def hmine(df, min_support=0.5, use_colnames=False, verbose=0):
"""
H-mine algorithm for frequent itemset mining.
Parameters:
- df: DataFrame or array-like, binary transaction matrix
- min_support: float, minimum support threshold (0.0 to 1.0)
- use_colnames: bool, use column names in output instead of indices
- verbose: int, verbosity level (0=silent, 1=progress bar)
Returns:
- frequent_itemsets: DataFrame with columns ['support', 'itemsets']
- support: float, support value of the itemset
- itemsets: frozenset, the frequent itemset
"""Generate association rules from frequent itemsets with various interest measures.
def association_rules(df, metric="confidence", min_threshold=0.8, support_only=False,
num_itemsets=None, verbose=0):
"""
Generate association rules from frequent itemsets.
Parameters:
- df: DataFrame, frequent itemsets with 'support' and 'itemsets' columns
- metric: str, interest measure for rule evaluation:
- 'confidence': confidence measure
- 'lift': lift measure
- 'support': support measure
- 'leverage': leverage measure
- 'conviction': conviction measure
- 'zhangs_metric': Zhang's metric
- min_threshold: float, minimum threshold for the specified metric
- support_only: bool, only computes support metric (faster)
- num_itemsets: int, number of itemsets to use (uses all if None)
- verbose: int, verbosity level
Returns:
- rules: DataFrame with columns:
- 'antecedents': frozenset, left-hand side of rule
- 'consequents': frozenset, right-hand side of rule
- 'antecedent support': float, support of antecedent
- 'consequent support': float, support of consequent
- 'support': float, support of the rule (antecedent ∪ consequent)
- 'confidence': float, confidence of the rule
- 'lift': float, lift of the rule
- 'leverage': float, leverage of the rule
- 'conviction': float, conviction of the rule
- 'zhangs_metric': float, Zhang's metric
"""The association rules generation supports multiple interest measures for evaluating rule quality:
confidence(A → B) = support(A ∪ B) / support(A)Measures the reliability of the inference made by the rule.
lift(A → B) = confidence(A → B) / support(B)Measures how much more likely B is given A, compared to B occurring independently.
support(A → B) = support(A ∪ B)Measures how frequently the itemset appears in the dataset.
leverage(A → B) = support(A ∪ B) - support(A) × support(B)Measures the difference between observed and expected support if A and B were independent.
conviction(A → B) = (1 - support(B)) / (1 - confidence(A → B))Compares the probability that A appears without B if they were dependent vs independent.
zhang(A → B) = (confidence(A → B) - support(B)) / max(confidence(A → B), support(B)) × (1 - max(confidence(A → B), support(B)))Balances statistical significance and practical significance.
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
# Sample transaction data
transactions = [
['bread', 'milk'],
['bread', 'diapers', 'beer', 'eggs'],
['milk', 'diapers', 'beer', 'cola'],
['bread', 'milk', 'diapers', 'beer'],
['bread', 'milk', 'diapers', 'cola']
]
# Encode transactions as binary matrix
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
print("Transaction Matrix:")
print(df)
# Find frequent itemsets using Apriori
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)
print("\nFrequent Itemsets:")
print(frequent_itemsets)
# Generate association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)
print("\nAssociation Rules:")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])import pandas as pd
from mlxtend.frequent_patterns import apriori, fpgrowth, fpmax
from mlxtend.preprocessing import TransactionEncoder
import time
# Sample large transaction dataset
transactions = [
['A', 'B', 'C'],
['A', 'C'],
['B', 'C', 'D'],
['A', 'B', 'C', 'D'],
['A', 'C'],
['B', 'C'],
['A', 'B', 'C'],
['A', 'B', 'D'],
['B', 'C', 'E'],
['A', 'B', 'C']
]
# Encode transactions
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
min_support = 0.3
# Compare algorithm performance
algorithms = {
'Apriori': apriori,
'FP-Growth': fpgrowth,
'FPMax': fpmax
}
for name, algorithm in algorithms.items():
start_time = time.time()
frequent_itemsets = algorithm(df, min_support=min_support, use_colnames=True)
end_time = time.time()
print(f"\n{name}:")
print(f"Execution time: {end_time - start_time:.4f} seconds")
print(f"Number of frequent itemsets: {len(frequent_itemsets)}")
print("Sample itemsets:")
print(frequent_itemsets.head())import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
# Create sample grocery transaction data
transactions = [
['milk', 'eggs', 'bread', 'cheese'],
['eggs', 'bread'],
['milk', 'bread'],
['beer', 'chips', 'cheese', 'nuts'],
['beer', 'chips'],
['milk', 'eggs', 'bread', 'butter'],
['beer', 'chips', 'nuts'],
['milk', 'cheese'],
['bread', 'butter'],
['beer', 'nuts']
]
# Encode and find frequent itemsets
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)
# Generate rules with different metrics
metrics = ['confidence', 'lift', 'leverage', 'conviction']
thresholds = [0.5, 1.0, 0.1, 1.0]
for metric, threshold in zip(metrics, thresholds):
rules = association_rules(frequent_itemsets, metric=metric, min_threshold=threshold)
print(f"\nRules with {metric} >= {threshold}:")
print(f"Number of rules: {len(rules)}")
if len(rules) > 0:
# Sort by the metric and show top rules
top_rules = rules.nlargest(3, metric)
for idx, rule in top_rules.iterrows():
antecedent = ', '.join(list(rule['antecedents']))
consequent = ', '.join(list(rule['consequents']))
print(f" {antecedent} → {consequent}")
print(f" Support: {rule['support']:.3f}, Confidence: {rule['confidence']:.3f}")
print(f" Lift: {rule['lift']:.3f}, {metric.title()}: {rule[metric]:.3f}")import pandas as pd
from mlxtend.frequent_patterns import fpgrowth, association_rules
# Create binary transaction matrix directly
data = {
'Apple': [1, 0, 1, 1, 0],
'Banana': [1, 1, 1, 0, 1],
'Orange': [0, 1, 1, 1, 0],
'Grapes': [1, 0, 0, 1, 1],
'Mango': [0, 1, 0, 0, 1]
}
df = pd.DataFrame(data)
print("Binary Transaction Matrix:")
print(df)
# Mine frequent itemsets
frequent_itemsets = fpgrowth(df, min_support=0.4, use_colnames=True)
print("\nFrequent Itemsets:")
print(frequent_itemsets.sort_values('support', ascending=False))
# Generate and analyze rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
print("\nAssociation Rules:")
for idx, rule in rules.iterrows():
antecedent = ', '.join(list(rule['antecedents']))
consequent = ', '.join(list(rule['consequents']))
print(f"{antecedent} → {consequent}")
print(f" Confidence: {rule['confidence']:.3f}, Lift: {rule['lift']:.3f}")Choose the algorithm based on your specific requirements:
Install with Tessl CLI
npx tessl i tessl/pypi-mlxtend