Tessl Tile for pypi/fairlearn@0.12.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

adversarial.md assessment.md datasets.md index.md postprocessing.md preprocessing.md reductions.md

preprocessing.mddocs/

0
# Preprocessing
1

2
Preprocessing techniques that transform features to reduce correlation with sensitive attributes, addressing fairness at the data preparation stage. These methods modify the input data before model training to reduce potential for discriminatory outcomes.
3

4
## Capabilities
5

6
### CorrelationRemover
7

8
Removes correlations between non-sensitive features and sensitive attributes using linear projection. This preprocessing technique helps ensure that the model cannot infer sensitive attributes from the remaining features.
9

10
```python { .api }
11
class CorrelationRemover:
12
    def __init__(self, *, sensitive_feature_ids, alpha=1.0):
13
        """
14
        Remove correlations between features and sensitive attributes.
15

16
        Parameters:
17
        - sensitive_feature_ids: list of int or str, indices or names of sensitive features
18
        - alpha: float, strength of correlation removal (0.0 = no removal, 1.0 = full removal)
19
        """
20
    
21
    def fit(self, X, y=None):
22
        """
23
        Learn the transformation to remove correlations.
24
        
25
        Parameters:
26
        - X: array-like or DataFrame, feature matrix including sensitive features
27
        - y: array-like, target values (unused, present for sklearn compatibility)
28
        
29
        Returns:
30
        self
31
        """
32
    
33
    def transform(self, X):
34
        """
35
        Apply the correlation removal transformation.
36
        
37
        Parameters:
38
        - X: array-like or DataFrame, feature matrix to transform
39
        
40
        Returns:
41
        array-like or DataFrame: Transformed features with reduced correlation
42
        """
43
    
44
    def fit_transform(self, X, y=None):
45
        """
46
        Fit and transform the data in one step.
47
        
48
        Parameters:
49
        - X: array-like or DataFrame, feature matrix
50
        - y: array-like, target values (unused)
51
        
52
        Returns:
53
        array-like or DataFrame: Transformed features
54
        """
55
        
56
    @property
57
    def mean_(self):
58
        """Mean values used for centering during transformation."""
59
        
60
    @property
61
    def projection_matrix_(self):
62
        """Projection matrix used for correlation removal."""
63
```
64

65
#### Usage Example
66

67
```python
68
import pandas as pd
69
from fairlearn.preprocessing import CorrelationRemover
70
from sklearn.model_selection import train_test_split
71
from sklearn.linear_model import LogisticRegression
72

73
# Load data with sensitive features included
74
data = pd.DataFrame({
75
    'feature1': [1, 2, 3, 4, 5],
76
    'feature2': [2, 4, 6, 8, 10], 
77
    'sensitive_gender': [0, 1, 0, 1, 0],
78
    'sensitive_age': [25, 35, 45, 30, 40]
79
})
80
target = [0, 1, 0, 1, 1]
81

82
# Specify which columns are sensitive
83
cr = CorrelationRemover(
84
    sensitive_feature_ids=['sensitive_gender', 'sensitive_age'],
85
    alpha=1.0  # Full correlation removal
86
)
87

88
# Fit and transform the data
89
data_transformed = cr.fit_transform(data)
90

91
# Now sensitive features have reduced correlation with other features
92
# Continue with normal ML pipeline
93
X_train, X_test, y_train, y_test = train_test_split(
94
    data_transformed, target, test_size=0.3, random_state=42
95
)
96

97
model = LogisticRegression()
98
model.fit(X_train, y_train)
99
predictions = model.predict(X_test)
100
```
101

102
#### Working with Numeric Indices
103

104
```python
105
import numpy as np
106
from fairlearn.preprocessing import CorrelationRemover
107

108
# Data as numpy array where columns 2 and 3 are sensitive
109
X = np.array([
110
    [1.0, 2.0, 0, 25],  # features + gender + age
111
    [2.0, 4.0, 1, 35],
112
    [3.0, 6.0, 0, 45],
113
    [4.0, 8.0, 1, 30]
114
])
115

116
# Use numeric indices for sensitive features
117
cr = CorrelationRemover(
118
    sensitive_feature_ids=[2, 3],  # Gender and age columns
119
    alpha=0.8  # Partial correlation removal
120
)
121

122
X_transformed = cr.fit_transform(X)
123
```
124

125
## Algorithm Details
126

127
### Correlation Removal Process
128

129
The CorrelationRemover works by:
130

131
1. **Centering**: Centers all features around their mean values
132
2. **Identifying Correlations**: Computes correlations between non-sensitive and sensitive features
133
3. **Projection**: Creates a linear projection that removes these correlations
134
4. **Transformation**: Applies the projection to transform the input features
135

136
The mathematical approach:
137
- Let X be the non-sensitive features and S be the sensitive features
138
- The algorithm finds a projection matrix P such that P·X has minimal correlation with S
139
- The strength of correlation removal is controlled by the alpha parameter
140

141
### Hyperparameter Tuning
142

143
The `alpha` parameter controls the trade-off between fairness and utility:
144

145
- **alpha = 0.0**: No correlation removal (original features preserved)
146
- **alpha = 1.0**: Maximum correlation removal (may reduce predictive power)
147
- **alpha ∈ (0, 1)**: Partial correlation removal (balanced approach)
148

149
```python
150
# Example of testing different alpha values
151
alphas = [0.0, 0.3, 0.6, 1.0]
152
results = {}
153

154
for alpha in alphas:
155
    cr = CorrelationRemover(sensitive_feature_ids=[2, 3], alpha=alpha)
156
    X_transformed = cr.fit_transform(X)
157
    
158
    # Train model and evaluate fairness/accuracy
159
    model = LogisticRegression()
160
    model.fit(X_transformed, y)
161
    
162
    # Store results for comparison
163
    results[alpha] = evaluate_model(model, X_transformed, y)
164
```
165

166
## Integration with Scikit-learn
167

168
CorrelationRemover follows scikit-learn conventions and can be used in pipelines:
169

170
```python
171
from sklearn.pipeline import Pipeline
172
from sklearn.preprocessing import StandardScaler
173
from sklearn.linear_model import LogisticRegression
174

175
# Create preprocessing pipeline
176
pipeline = Pipeline([
177
    ('correlation_removal', CorrelationRemover(sensitive_feature_ids=[2, 3])),
178
    ('scaling', StandardScaler()),  
179
    ('classifier', LogisticRegression())
180
])
181

182
# Fit entire pipeline
183
pipeline.fit(X_train, y_train)
184
predictions = pipeline.predict(X_test)
185
```
186

187
## Considerations and Limitations
188

189
### Data Requirements
190

191
- **Feature Types**: Works with continuous and categorical features (after encoding)
192
- **Sensitive Features**: Can handle multiple sensitive attributes simultaneously
193
- **Sample Size**: More reliable with larger datasets for stable correlation estimates
194

195
### Fairness Trade-offs
196

197
- **Utility Loss**: Removing correlations may reduce predictive performance
198
- **Fairness Gain**: Reduces the model's ability to discriminate based on sensitive attributes
199
- **Proxy Variables**: Cannot prevent discrimination through unmeasured proxy variables
200

201
### Best Practices
202

203
1. **Preprocessing Order**: Apply CorrelationRemover before other preprocessing steps that might reintroduce correlations
204
2. **Cross-validation**: Use cross-validation to select optimal alpha values
205
3. **Fairness Assessment**: Always evaluate both fairness and performance after preprocessing
206
4. **Domain Knowledge**: Consider domain-specific relationships when choosing sensitive features
207

208
```python
209
# Recommended workflow
210
from sklearn.model_selection import GridSearchCV
211
from fairlearn.metrics import MetricFrame
212

213
# Grid search over alpha values
214
param_grid = {'correlation_removal__alpha': [0.0, 0.3, 0.6, 1.0]}
215

216
pipeline = Pipeline([
217
    ('correlation_removal', CorrelationRemover(sensitive_feature_ids=[2, 3])),
218
    ('classifier', LogisticRegression())
219
])
220

221
# Find best alpha balancing accuracy and fairness
222
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
223
grid_search.fit(X_train, y_train)
224

225
# Evaluate fairness of best model
226
best_model = grid_search.best_estimator_
227
predictions = best_model.predict(X_test)
228

229
fairness_metrics = MetricFrame(
230
    metrics={'accuracy': lambda y_true, y_pred: (y_true == y_pred).mean()},
231
    y_true=y_test,
232
    y_pred=predictions,
233
    sensitive_features=sensitive_features_test
234
)
235
```

Version

Tile

Files

preprocessing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

preprocessing.mddocs/