0
# Data Preprocessing
1
2
Data transformation utilities including scaling, encoding, and array manipulation functions compatible with scikit-learn pipelines.
3
4
## Capabilities
5
6
### Mean Centering
7
8
Center data around the mean for normalization.
9
10
```python { .api }
11
class MeanCenterer:
12
def __init__(self):
13
"""Mean centering transformer"""
14
15
def fit(self, X, y=None):
16
"""Compute the mean to be used for centering"""
17
18
def transform(self, X):
19
"""Center data around the mean"""
20
21
def fit_transform(self, X, y=None):
22
"""Fit and transform data"""
23
24
mean_: # Computed mean values
25
```
26
27
### Transaction Encoding
28
29
Encode transaction data for frequent pattern mining algorithms.
30
31
```python { .api }
32
class TransactionEncoder:
33
def __init__(self):
34
"""Encode transaction data to binary matrix format"""
35
36
def fit(self, X):
37
"""Learn the unique items in the transaction dataset"""
38
39
def transform(self, X):
40
"""Transform transactions to binary matrix"""
41
42
def fit_transform(self, X):
43
"""Fit and transform transactions"""
44
45
columns_: # Column names (unique items)
46
```
47
48
### Scaling Functions
49
50
Scaling and standardization utilities for feature normalization.
51
52
```python { .api }
53
def standardize(array, columns=None, ddof=0):
54
"""
55
Z-score standardization of features.
56
57
Parameters:
58
- array: array-like, input data
59
- columns: list, columns to standardize (all if None)
60
- ddof: int, degrees of freedom for standard deviation
61
62
Returns:
63
- standardized_array: array-like, standardized data
64
"""
65
66
def minmax_scaling(array, columns=None, min_val=0, max_val=1):
67
"""
68
Min-max feature scaling to specified range.
69
70
Parameters:
71
- array: array-like, input data
72
- columns: list, columns to scale (all if None)
73
- min_val: float, minimum value of scaled range
74
- max_val: float, maximum value of scaled range
75
76
Returns:
77
- scaled_array: array-like, scaled data
78
"""
79
```
80
81
### Additional Transformers
82
83
Utility transformers for data pipeline integration.
84
85
```python { .api }
86
class CopyTransformer:
87
def __init__(self):
88
"""Identity transformer that copies input data"""
89
90
def fit(self, X, y=None):
91
"""Fit transformer (no-op)"""
92
93
def transform(self, X):
94
"""Return copy of input data"""
95
96
class DenseTransformer:
97
def __init__(self):
98
"""Convert sparse matrices to dense format"""
99
100
def fit(self, X, y=None):
101
"""Fit transformer (no-op)"""
102
103
def transform(self, X):
104
"""Convert sparse matrix to dense"""
105
106
def one_hot(y, dtype=int):
107
"""
108
One-hot encode categorical labels.
109
110
Parameters:
111
- y: array-like, categorical labels
112
- dtype: data type for output array
113
114
Returns:
115
- encoded: array, one-hot encoded matrix
116
"""
117
118
def shuffle_arrays_unison(*arrays, random_seed=None):
119
"""
120
Shuffle multiple arrays in unison.
121
122
Parameters:
123
- arrays: array-like objects to shuffle together
124
- random_seed: int, random seed for reproducibility
125
126
Returns:
127
- shuffled_arrays: tuple of shuffled arrays
128
"""
129
```
130
131
## Usage Examples
132
133
```python
134
from mlxtend.preprocessing import TransactionEncoder, MeanCenterer, standardize
135
import pandas as pd
136
import numpy as np
137
138
# Transaction encoding example
139
transactions = [['bread', 'milk'], ['bread', 'beer'], ['milk', 'beer']]
140
te = TransactionEncoder()
141
te_ary = te.fit(transactions).transform(transactions)
142
df = pd.DataFrame(te_ary, columns=te.columns_)
143
144
# Mean centering example
145
X = np.random.randn(100, 5)
146
mc = MeanCenterer()
147
X_centered = mc.fit_transform(X)
148
149
# Standardization example
150
X_std = standardize(X)
151
```