or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

adversarial.mdassessment.mddatasets.mdindex.mdpostprocessing.mdpreprocessing.mdreductions.md

preprocessing.mddocs/

0

# Preprocessing

1

2

Preprocessing techniques that transform features to reduce correlation with sensitive attributes, addressing fairness at the data preparation stage. These methods modify the input data before model training to reduce potential for discriminatory outcomes.

3

4

## Capabilities

5

6

### CorrelationRemover

7

8

Removes correlations between non-sensitive features and sensitive attributes using linear projection. This preprocessing technique helps ensure that the model cannot infer sensitive attributes from the remaining features.

9

10

```python { .api }

11

class CorrelationRemover:

12

def __init__(self, *, sensitive_feature_ids, alpha=1.0):

13

"""

14

Remove correlations between features and sensitive attributes.

15

16

Parameters:

17

- sensitive_feature_ids: list of int or str, indices or names of sensitive features

18

- alpha: float, strength of correlation removal (0.0 = no removal, 1.0 = full removal)

19

"""

20

21

def fit(self, X, y=None):

22

"""

23

Learn the transformation to remove correlations.

24

25

Parameters:

26

- X: array-like or DataFrame, feature matrix including sensitive features

27

- y: array-like, target values (unused, present for sklearn compatibility)

28

29

Returns:

30

self

31

"""

32

33

def transform(self, X):

34

"""

35

Apply the correlation removal transformation.

36

37

Parameters:

38

- X: array-like or DataFrame, feature matrix to transform

39

40

Returns:

41

array-like or DataFrame: Transformed features with reduced correlation

42

"""

43

44

def fit_transform(self, X, y=None):

45

"""

46

Fit and transform the data in one step.

47

48

Parameters:

49

- X: array-like or DataFrame, feature matrix

50

- y: array-like, target values (unused)

51

52

Returns:

53

array-like or DataFrame: Transformed features

54

"""

55

56

@property

57

def mean_(self):

58

"""Mean values used for centering during transformation."""

59

60

@property

61

def projection_matrix_(self):

62

"""Projection matrix used for correlation removal."""

63

```

64

65

#### Usage Example

66

67

```python

68

import pandas as pd

69

from fairlearn.preprocessing import CorrelationRemover

70

from sklearn.model_selection import train_test_split

71

from sklearn.linear_model import LogisticRegression

72

73

# Load data with sensitive features included

74

data = pd.DataFrame({

75

'feature1': [1, 2, 3, 4, 5],

76

'feature2': [2, 4, 6, 8, 10],

77

'sensitive_gender': [0, 1, 0, 1, 0],

78

'sensitive_age': [25, 35, 45, 30, 40]

79

})

80

target = [0, 1, 0, 1, 1]

81

82

# Specify which columns are sensitive

83

cr = CorrelationRemover(

84

sensitive_feature_ids=['sensitive_gender', 'sensitive_age'],

85

alpha=1.0 # Full correlation removal

86

)

87

88

# Fit and transform the data

89

data_transformed = cr.fit_transform(data)

90

91

# Now sensitive features have reduced correlation with other features

92

# Continue with normal ML pipeline

93

X_train, X_test, y_train, y_test = train_test_split(

94

data_transformed, target, test_size=0.3, random_state=42

95

)

96

97

model = LogisticRegression()

98

model.fit(X_train, y_train)

99

predictions = model.predict(X_test)

100

```

101

102

#### Working with Numeric Indices

103

104

```python

105

import numpy as np

106

from fairlearn.preprocessing import CorrelationRemover

107

108

# Data as numpy array where columns 2 and 3 are sensitive

109

X = np.array([

110

[1.0, 2.0, 0, 25], # features + gender + age

111

[2.0, 4.0, 1, 35],

112

[3.0, 6.0, 0, 45],

113

[4.0, 8.0, 1, 30]

114

])

115

116

# Use numeric indices for sensitive features

117

cr = CorrelationRemover(

118

sensitive_feature_ids=[2, 3], # Gender and age columns

119

alpha=0.8 # Partial correlation removal

120

)

121

122

X_transformed = cr.fit_transform(X)

123

```

124

125

## Algorithm Details

126

127

### Correlation Removal Process

128

129

The CorrelationRemover works by:

130

131

1. **Centering**: Centers all features around their mean values

132

2. **Identifying Correlations**: Computes correlations between non-sensitive and sensitive features

133

3. **Projection**: Creates a linear projection that removes these correlations

134

4. **Transformation**: Applies the projection to transform the input features

135

136

The mathematical approach:

137

- Let X be the non-sensitive features and S be the sensitive features

138

- The algorithm finds a projection matrix P such that P·X has minimal correlation with S

139

- The strength of correlation removal is controlled by the alpha parameter

140

141

### Hyperparameter Tuning

142

143

The `alpha` parameter controls the trade-off between fairness and utility:

144

145

- **alpha = 0.0**: No correlation removal (original features preserved)

146

- **alpha = 1.0**: Maximum correlation removal (may reduce predictive power)

147

- **alpha ∈ (0, 1)**: Partial correlation removal (balanced approach)

148

149

```python

150

# Example of testing different alpha values

151

alphas = [0.0, 0.3, 0.6, 1.0]

152

results = {}

153

154

for alpha in alphas:

155

cr = CorrelationRemover(sensitive_feature_ids=[2, 3], alpha=alpha)

156

X_transformed = cr.fit_transform(X)

157

158

# Train model and evaluate fairness/accuracy

159

model = LogisticRegression()

160

model.fit(X_transformed, y)

161

162

# Store results for comparison

163

results[alpha] = evaluate_model(model, X_transformed, y)

164

```

165

166

## Integration with Scikit-learn

167

168

CorrelationRemover follows scikit-learn conventions and can be used in pipelines:

169

170

```python

171

from sklearn.pipeline import Pipeline

172

from sklearn.preprocessing import StandardScaler

173

from sklearn.linear_model import LogisticRegression

174

175

# Create preprocessing pipeline

176

pipeline = Pipeline([

177

('correlation_removal', CorrelationRemover(sensitive_feature_ids=[2, 3])),

178

('scaling', StandardScaler()),

179

('classifier', LogisticRegression())

180

])

181

182

# Fit entire pipeline

183

pipeline.fit(X_train, y_train)

184

predictions = pipeline.predict(X_test)

185

```

186

187

## Considerations and Limitations

188

189

### Data Requirements

190

191

- **Feature Types**: Works with continuous and categorical features (after encoding)

192

- **Sensitive Features**: Can handle multiple sensitive attributes simultaneously

193

- **Sample Size**: More reliable with larger datasets for stable correlation estimates

194

195

### Fairness Trade-offs

196

197

- **Utility Loss**: Removing correlations may reduce predictive performance

198

- **Fairness Gain**: Reduces the model's ability to discriminate based on sensitive attributes

199

- **Proxy Variables**: Cannot prevent discrimination through unmeasured proxy variables

200

201

### Best Practices

202

203

1. **Preprocessing Order**: Apply CorrelationRemover before other preprocessing steps that might reintroduce correlations

204

2. **Cross-validation**: Use cross-validation to select optimal alpha values

205

3. **Fairness Assessment**: Always evaluate both fairness and performance after preprocessing

206

4. **Domain Knowledge**: Consider domain-specific relationships when choosing sensitive features

207

208

```python

209

# Recommended workflow

210

from sklearn.model_selection import GridSearchCV

211

from fairlearn.metrics import MetricFrame

212

213

# Grid search over alpha values

214

param_grid = {'correlation_removal__alpha': [0.0, 0.3, 0.6, 1.0]}

215

216

pipeline = Pipeline([

217

('correlation_removal', CorrelationRemover(sensitive_feature_ids=[2, 3])),

218

('classifier', LogisticRegression())

219

])

220

221

# Find best alpha balancing accuracy and fairness

222

grid_search = GridSearchCV(pipeline, param_grid, cv=5)

223

grid_search.fit(X_train, y_train)

224

225

# Evaluate fairness of best model

226

best_model = grid_search.best_estimator_

227

predictions = best_model.predict(X_test)

228

229

fairness_metrics = MetricFrame(

230

metrics={'accuracy': lambda y_true, y_pred: (y_true == y_pred).mean()},

231

y_true=y_test,

232

y_pred=predictions,

233

sensitive_features=sensitive_features_test

234

)

235

```