0
# Datasets
1
2
Standard datasets commonly used for fairness research and benchmarking, with consistent interfaces and built-in sensitive feature identification. These datasets provide realistic examples for testing fairness algorithms and include known fairness challenges.
3
4
## Capabilities
5
6
### Adult (Census Income) Dataset
7
8
The Adult dataset predicts whether income exceeds $50K/year based on census data. This is one of the most commonly used datasets in fairness research.
9
10
```python { .api }
11
def fetch_adult(*, cache=True, data_home=None, as_frame=True, return_X_y=False):
12
"""
13
Load the Adult (Census Income) dataset.
14
15
Parameters:
16
- cache: bool, whether to cache downloaded data locally
17
- data_home: str, path to store cached data (default: ~/fairlearn_data)
18
- as_frame: bool, return pandas DataFrame and Series (True) or numpy arrays (False)
19
- return_X_y: bool, return (X, y, sensitive_features) tuple instead of Bunch object
20
21
Returns:
22
Bunch object with:
23
- data: DataFrame or array, feature matrix
24
- target: Series or array, target values (0: <=50K, 1: >50K)
25
- feature_names: list, names of features
26
- target_names: list, names of target classes
27
- sensitive_features: DataFrame or array, sensitive attributes (sex, race)
28
- sensitive_feature_names: list, names of sensitive features
29
"""
30
```
31
32
#### Usage Example
33
34
```python
35
from fairlearn.datasets import fetch_adult
36
37
# Load as DataFrames (recommended)
38
adult_data = fetch_adult(as_frame=True)
39
X = adult_data.data
40
y = adult_data.target
41
sensitive_features = adult_data.sensitive_features
42
43
print(f"Features shape: {X.shape}")
44
print(f"Target distribution: {y.value_counts()}")
45
print(f"Sensitive features: {adult_data.sensitive_feature_names}")
46
47
# Load for direct use
48
X, y, A = fetch_adult(return_X_y=True)
49
```
50
51
### ACS Income Dataset
52
53
American Community Survey (ACS) income dataset providing more recent census-like data with state-level filtering options.
54
55
```python { .api }
56
def fetch_acs_income(*, cache=True, data_home=None, as_frame=True, return_X_y=False,
57
state="CA", year=2018, with_nulls=False,
58
optimization="mem", accept_download=False):
59
"""
60
Load the ACS Income dataset from American Community Survey.
61
62
Parameters:
63
- cache: bool, whether to cache downloaded data
64
- data_home: str, path to store cached data
65
- as_frame: bool, return pandas DataFrame and Series
66
- return_X_y: bool, return (X, y, sensitive_features) tuple
67
- state: str, state abbreviation for data filtering (e.g., "CA", "NY", "TX")
68
- year: int, year of survey data (2014-2018 available)
69
- with_nulls: bool, whether to include missing values
70
- optimization: str, memory optimization ("mem" or "speed")
71
- accept_download: bool, whether to accept downloading large dataset
72
73
Returns:
74
Bunch object with census data and sensitive features including race and sex
75
"""
76
```
77
78
#### Usage Example
79
80
```python
81
from fairlearn.datasets import fetch_acs_income
82
83
# Load California 2018 data
84
acs_data = fetch_acs_income(
85
state="CA",
86
year=2018,
87
accept_download=True # Required for first download
88
)
89
90
X = acs_data.data
91
y = acs_data.target
92
sensitive_features = acs_data.sensitive_features
93
94
print(f"ACS Income dataset for CA 2018: {X.shape[0]} samples")
95
```
96
97
### Bank Marketing Dataset
98
99
Portuguese bank marketing campaign dataset for predicting term deposit subscriptions.
100
101
```python { .api }
102
def fetch_bank_marketing(*, cache=True, data_home=None, as_frame=True, return_X_y=False):
103
"""
104
Load the Bank Marketing dataset.
105
106
Parameters:
107
- cache: bool, whether to cache downloaded data
108
- data_home: str, path to store cached data
109
- as_frame: bool, return pandas DataFrame and Series
110
- return_X_y: bool, return (X, y, sensitive_features) tuple
111
112
Returns:
113
Bunch object with:
114
- data: feature matrix with client information
115
- target: binary target (subscribed to term deposit)
116
- sensitive_features: age group as sensitive attribute
117
"""
118
```
119
120
### Boston Housing Dataset
121
122
Boston housing prices dataset (note: deprecated due to ethical concerns).
123
124
```python { .api }
125
def fetch_boston(*, cache=True, data_home=None, as_frame=True, return_X_y=False, warn=True):
126
"""
127
Load the Boston Housing dataset.
128
129
**Warning**: This dataset has known fairness issues and is deprecated.
130
131
Parameters:
132
- cache: bool, whether to cache data
133
- data_home: str, path to store cached data
134
- as_frame: bool, return pandas DataFrame and Series
135
- return_X_y: bool, return (X, y, sensitive_features) tuple
136
- warn: bool, whether to display fairness warning
137
138
Returns:
139
Bunch object with housing data and racial composition as sensitive feature
140
"""
141
```
142
143
### Credit Card Fraud Dataset
144
145
Credit card fraud detection dataset for binary classification.
146
147
```python { .api }
148
def fetch_credit_card(*, cache=True, data_home=None, as_frame=True, return_X_y=False):
149
"""
150
Load the Credit Card Fraud dataset.
151
152
Parameters:
153
- cache: bool, whether to cache downloaded data
154
- data_home: str, path to store cached data
155
- as_frame: bool, return pandas DataFrame and Series
156
- return_X_y: bool, return (X, y, sensitive_features) tuple
157
158
Returns:
159
Bunch object with:
160
- data: anonymized credit card transaction features
161
- target: binary fraud indicator (0: legitimate, 1: fraud)
162
- sensitive_features: derived sensitive attributes
163
"""
164
```
165
166
### Diabetes Hospital Dataset
167
168
Hospital diabetes patient dataset for predicting readmission risk.
169
170
```python { .api }
171
def fetch_diabetes_hospital(*, as_frame=True, cache=True, data_home=None, return_X_y=False):
172
"""
173
Load the Diabetes Hospital dataset.
174
175
Parameters:
176
- as_frame: bool, return pandas DataFrame and Series
177
- cache: bool, whether to cache downloaded data
178
- data_home: str, path to store cached data
179
- return_X_y: bool, return (X, y, sensitive_features) tuple
180
181
Returns:
182
Bunch object with:
183
- data: patient medical features
184
- target: readmission outcome
185
- sensitive_features: race and gender
186
"""
187
```
188
189
## Common Usage Patterns
190
191
### Basic Data Loading
192
193
```python
194
from fairlearn.datasets import fetch_adult, fetch_acs_income
195
196
# Load Adult dataset
197
adult = fetch_adult(as_frame=True)
198
X_adult, y_adult, A_adult = adult.data, adult.target, adult.sensitive_features
199
200
# Load ACS dataset for specific state and year
201
acs = fetch_acs_income(state="NY", year=2017, accept_download=True)
202
X_acs, y_acs, A_acs = acs.data, acs.target, acs.sensitive_features
203
```
204
205
### Direct Unpacking
206
207
```python
208
# Get data ready for ML pipeline
209
X, y, sensitive_features = fetch_adult(return_X_y=True)
210
211
# Split for training
212
from sklearn.model_selection import train_test_split
213
X_train, X_test, y_train, y_test, A_train, A_test = train_test_split(
214
X, y, sensitive_features, test_size=0.3, random_state=42, stratify=y
215
)
216
```
217
218
### Exploring Dataset Properties
219
220
```python
221
import pandas as pd
222
223
# Load and explore Adult dataset
224
adult = fetch_adult(as_frame=True)
225
226
print("Dataset shape:", adult.data.shape)
227
print("Target distribution:")
228
print(adult.target.value_counts())
229
230
print("\nSensitive features:")
231
print(adult.sensitive_features.head())
232
233
print("\nFeature names:")
234
print(adult.feature_names)
235
236
print("\nSensitive feature breakdown:")
237
for col in adult.sensitive_feature_names:
238
print(f"{col}: {adult.sensitive_features[col].value_counts()}")
239
```
240
241
### Data Preprocessing Pipeline
242
243
```python
244
from sklearn.preprocessing import StandardScaler, LabelEncoder
245
from sklearn.compose import ColumnTransformer
246
from sklearn.pipeline import Pipeline
247
248
# Load dataset
249
X, y, A = fetch_adult(return_X_y=True)
250
251
# Identify categorical and numerical columns
252
categorical_features = X.select_dtypes(include=['object']).columns
253
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
254
255
# Create preprocessing pipeline
256
preprocessor = ColumnTransformer(
257
transformers=[
258
('num', StandardScaler(), numerical_features),
259
('cat', LabelEncoder(), categorical_features)
260
]
261
)
262
263
# Preprocess features
264
X_processed = preprocessor.fit_transform(X)
265
```
266
267
## Dataset Characteristics
268
269
### Adult Dataset
270
- **Size**: ~48,000 samples
271
- **Features**: 14 (age, workclass, education, etc.)
272
- **Target**: Binary income classification (>50K/year)
273
- **Sensitive Features**: Sex, race
274
- **Fairness Issues**: Historical biases in income based on demographics
275
276
### ACS Income Dataset
277
- **Size**: Variable by state and year (10K-500K samples)
278
- **Features**: 10 census-related features
279
- **Target**: Binary income classification
280
- **Sensitive Features**: Sex, race
281
- **Advantages**: More recent data, state-specific filtering
282
283
### Bank Marketing Dataset
284
- **Size**: ~45,000 samples
285
- **Features**: 16 client and campaign features
286
- **Target**: Binary term deposit subscription
287
- **Sensitive Features**: Age groups
288
- **Use Case**: Marketing fairness, age discrimination
289
290
### Other Datasets
291
Each dataset includes appropriate sensitive features and represents realistic fairness challenges in different domains (finance, healthcare, housing, etc.).
292
293
## Best Practices
294
295
### Data Exploration
296
297
Always explore the dataset before use:
298
299
```python
300
def explore_fairness_dataset(data_bunch):
301
"""Explore fairness-related properties of a dataset."""
302
303
print(f"Dataset shape: {data_bunch.data.shape}")
304
print(f"Missing values: {data_bunch.data.isnull().sum().sum()}")
305
306
# Target distribution
307
print("\nTarget distribution:")
308
print(data_bunch.target.value_counts(normalize=True))
309
310
# Sensitive feature distributions
311
print("\nSensitive feature distributions:")
312
for col in data_bunch.sensitive_feature_names:
313
print(f"\n{col}:")
314
print(data_bunch.sensitive_features[col].value_counts())
315
316
# Cross-tabulation of target and sensitive features
317
for col in data_bunch.sensitive_feature_names:
318
print(f"\nTarget vs {col}:")
319
crosstab = pd.crosstab(
320
data_bunch.sensitive_features[col],
321
data_bunch.target,
322
normalize='index'
323
)
324
print(crosstab)
325
326
# Use the exploration function
327
adult = fetch_adult(as_frame=True)
328
explore_fairness_dataset(adult)
329
```
330
331
### Ethical Considerations
332
333
1. **Data Awareness**: Understand the historical context and potential biases
334
2. **Boston Housing**: Avoid using due to known racial bias issues
335
3. **Sensitive Feature Selection**: Consider which attributes should be treated as sensitive
336
4. **Intersectionality**: Consider interactions between multiple sensitive attributes
337
338
### Performance Baselines
339
340
Establish fairness baselines:
341
342
```python
343
from fairlearn.metrics import MetricFrame, demographic_parity_difference
344
from sklearn.ensemble import RandomForestClassifier
345
346
def establish_baseline(dataset_name="adult"):
347
"""Establish baseline fairness metrics for a dataset."""
348
349
if dataset_name == "adult":
350
X, y, A = fetch_adult(return_X_y=True)
351
elif dataset_name == "acs":
352
X, y, A = fetch_acs_income(return_X_y=True, accept_download=True)
353
354
# Train simple baseline model
355
model = RandomForestClassifier(random_state=42)
356
model.fit(X, y)
357
predictions = model.predict(X)
358
359
# Compute fairness metrics
360
mf = MetricFrame(
361
metrics={'accuracy': lambda y_true, y_pred: (y_true == y_pred).mean()},
362
y_true=y,
363
y_pred=predictions,
364
sensitive_features=A
365
)
366
367
dp_diff = demographic_parity_difference(y, predictions, sensitive_features=A)
368
369
return {
370
'overall_accuracy': mf.overall['accuracy'],
371
'group_accuracies': mf.by_group['accuracy'],
372
'demographic_parity_difference': dp_diff
373
}
374
375
baseline = establish_baseline("adult")
376
print("Baseline fairness metrics:", baseline)
377
```