0
# Preprocessing
1
2
Preprocessing techniques that transform features to reduce correlation with sensitive attributes, addressing fairness at the data preparation stage. These methods modify the input data before model training to reduce potential for discriminatory outcomes.
3
4
## Capabilities
5
6
### CorrelationRemover
7
8
Removes correlations between non-sensitive features and sensitive attributes using linear projection. This preprocessing technique helps ensure that the model cannot infer sensitive attributes from the remaining features.
9
10
```python { .api }
11
class CorrelationRemover:
12
def __init__(self, *, sensitive_feature_ids, alpha=1.0):
13
"""
14
Remove correlations between features and sensitive attributes.
15
16
Parameters:
17
- sensitive_feature_ids: list of int or str, indices or names of sensitive features
18
- alpha: float, strength of correlation removal (0.0 = no removal, 1.0 = full removal)
19
"""
20
21
def fit(self, X, y=None):
22
"""
23
Learn the transformation to remove correlations.
24
25
Parameters:
26
- X: array-like or DataFrame, feature matrix including sensitive features
27
- y: array-like, target values (unused, present for sklearn compatibility)
28
29
Returns:
30
self
31
"""
32
33
def transform(self, X):
34
"""
35
Apply the correlation removal transformation.
36
37
Parameters:
38
- X: array-like or DataFrame, feature matrix to transform
39
40
Returns:
41
array-like or DataFrame: Transformed features with reduced correlation
42
"""
43
44
def fit_transform(self, X, y=None):
45
"""
46
Fit and transform the data in one step.
47
48
Parameters:
49
- X: array-like or DataFrame, feature matrix
50
- y: array-like, target values (unused)
51
52
Returns:
53
array-like or DataFrame: Transformed features
54
"""
55
56
@property
57
def mean_(self):
58
"""Mean values used for centering during transformation."""
59
60
@property
61
def projection_matrix_(self):
62
"""Projection matrix used for correlation removal."""
63
```
64
65
#### Usage Example
66
67
```python
68
import pandas as pd
69
from fairlearn.preprocessing import CorrelationRemover
70
from sklearn.model_selection import train_test_split
71
from sklearn.linear_model import LogisticRegression
72
73
# Load data with sensitive features included
74
data = pd.DataFrame({
75
'feature1': [1, 2, 3, 4, 5],
76
'feature2': [2, 4, 6, 8, 10],
77
'sensitive_gender': [0, 1, 0, 1, 0],
78
'sensitive_age': [25, 35, 45, 30, 40]
79
})
80
target = [0, 1, 0, 1, 1]
81
82
# Specify which columns are sensitive
83
cr = CorrelationRemover(
84
sensitive_feature_ids=['sensitive_gender', 'sensitive_age'],
85
alpha=1.0 # Full correlation removal
86
)
87
88
# Fit and transform the data
89
data_transformed = cr.fit_transform(data)
90
91
# Now sensitive features have reduced correlation with other features
92
# Continue with normal ML pipeline
93
X_train, X_test, y_train, y_test = train_test_split(
94
data_transformed, target, test_size=0.3, random_state=42
95
)
96
97
model = LogisticRegression()
98
model.fit(X_train, y_train)
99
predictions = model.predict(X_test)
100
```
101
102
#### Working with Numeric Indices
103
104
```python
105
import numpy as np
106
from fairlearn.preprocessing import CorrelationRemover
107
108
# Data as numpy array where columns 2 and 3 are sensitive
109
X = np.array([
110
[1.0, 2.0, 0, 25], # features + gender + age
111
[2.0, 4.0, 1, 35],
112
[3.0, 6.0, 0, 45],
113
[4.0, 8.0, 1, 30]
114
])
115
116
# Use numeric indices for sensitive features
117
cr = CorrelationRemover(
118
sensitive_feature_ids=[2, 3], # Gender and age columns
119
alpha=0.8 # Partial correlation removal
120
)
121
122
X_transformed = cr.fit_transform(X)
123
```
124
125
## Algorithm Details
126
127
### Correlation Removal Process
128
129
The CorrelationRemover works by:
130
131
1. **Centering**: Centers all features around their mean values
132
2. **Identifying Correlations**: Computes correlations between non-sensitive and sensitive features
133
3. **Projection**: Creates a linear projection that removes these correlations
134
4. **Transformation**: Applies the projection to transform the input features
135
136
The mathematical approach:
137
- Let X be the non-sensitive features and S be the sensitive features
138
- The algorithm finds a projection matrix P such that P·X has minimal correlation with S
139
- The strength of correlation removal is controlled by the alpha parameter
140
141
### Hyperparameter Tuning
142
143
The `alpha` parameter controls the trade-off between fairness and utility:
144
145
- **alpha = 0.0**: No correlation removal (original features preserved)
146
- **alpha = 1.0**: Maximum correlation removal (may reduce predictive power)
147
- **alpha ∈ (0, 1)**: Partial correlation removal (balanced approach)
148
149
```python
150
# Example of testing different alpha values
151
alphas = [0.0, 0.3, 0.6, 1.0]
152
results = {}
153
154
for alpha in alphas:
155
cr = CorrelationRemover(sensitive_feature_ids=[2, 3], alpha=alpha)
156
X_transformed = cr.fit_transform(X)
157
158
# Train model and evaluate fairness/accuracy
159
model = LogisticRegression()
160
model.fit(X_transformed, y)
161
162
# Store results for comparison
163
results[alpha] = evaluate_model(model, X_transformed, y)
164
```
165
166
## Integration with Scikit-learn
167
168
CorrelationRemover follows scikit-learn conventions and can be used in pipelines:
169
170
```python
171
from sklearn.pipeline import Pipeline
172
from sklearn.preprocessing import StandardScaler
173
from sklearn.linear_model import LogisticRegression
174
175
# Create preprocessing pipeline
176
pipeline = Pipeline([
177
('correlation_removal', CorrelationRemover(sensitive_feature_ids=[2, 3])),
178
('scaling', StandardScaler()),
179
('classifier', LogisticRegression())
180
])
181
182
# Fit entire pipeline
183
pipeline.fit(X_train, y_train)
184
predictions = pipeline.predict(X_test)
185
```
186
187
## Considerations and Limitations
188
189
### Data Requirements
190
191
- **Feature Types**: Works with continuous and categorical features (after encoding)
192
- **Sensitive Features**: Can handle multiple sensitive attributes simultaneously
193
- **Sample Size**: More reliable with larger datasets for stable correlation estimates
194
195
### Fairness Trade-offs
196
197
- **Utility Loss**: Removing correlations may reduce predictive performance
198
- **Fairness Gain**: Reduces the model's ability to discriminate based on sensitive attributes
199
- **Proxy Variables**: Cannot prevent discrimination through unmeasured proxy variables
200
201
### Best Practices
202
203
1. **Preprocessing Order**: Apply CorrelationRemover before other preprocessing steps that might reintroduce correlations
204
2. **Cross-validation**: Use cross-validation to select optimal alpha values
205
3. **Fairness Assessment**: Always evaluate both fairness and performance after preprocessing
206
4. **Domain Knowledge**: Consider domain-specific relationships when choosing sensitive features
207
208
```python
209
# Recommended workflow
210
from sklearn.model_selection import GridSearchCV
211
from fairlearn.metrics import MetricFrame
212
213
# Grid search over alpha values
214
param_grid = {'correlation_removal__alpha': [0.0, 0.3, 0.6, 1.0]}
215
216
pipeline = Pipeline([
217
('correlation_removal', CorrelationRemover(sensitive_feature_ids=[2, 3])),
218
('classifier', LogisticRegression())
219
])
220
221
# Find best alpha balancing accuracy and fairness
222
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
223
grid_search.fit(X_train, y_train)
224
225
# Evaluate fairness of best model
226
best_model = grid_search.best_estimator_
227
predictions = best_model.predict(X_test)
228
229
fairness_metrics = MetricFrame(
230
metrics={'accuracy': lambda y_true, y_pred: (y_true == y_pred).mean()},
231
y_true=y_test,
232
y_pred=predictions,
233
sensitive_features=sensitive_features_test
234
)
235
```