Tessl Tile for pypi/fairlearn@0.12.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

adversarial.md assessment.md datasets.md index.md postprocessing.md preprocessing.md reductions.md

datasets.mddocs/

0
# Datasets
1

2
Standard datasets commonly used for fairness research and benchmarking, with consistent interfaces and built-in sensitive feature identification. These datasets provide realistic examples for testing fairness algorithms and include known fairness challenges.
3

4
## Capabilities
5

6
### Adult (Census Income) Dataset
7

8
The Adult dataset predicts whether income exceeds $50K/year based on census data. This is one of the most commonly used datasets in fairness research.
9

10
```python { .api }
11
def fetch_adult(*, cache=True, data_home=None, as_frame=True, return_X_y=False):
12
    """
13
    Load the Adult (Census Income) dataset.
14
    
15
    Parameters:
16
    - cache: bool, whether to cache downloaded data locally
17
    - data_home: str, path to store cached data (default: ~/fairlearn_data)
18
    - as_frame: bool, return pandas DataFrame and Series (True) or numpy arrays (False)
19
    - return_X_y: bool, return (X, y, sensitive_features) tuple instead of Bunch object
20
    
21
    Returns:
22
    Bunch object with:
23
    - data: DataFrame or array, feature matrix
24
    - target: Series or array, target values (0: <=50K, 1: >50K)
25
    - feature_names: list, names of features
26
    - target_names: list, names of target classes
27
    - sensitive_features: DataFrame or array, sensitive attributes (sex, race)
28
    - sensitive_feature_names: list, names of sensitive features
29
    """
30
```
31

32
#### Usage Example
33

34
```python
35
from fairlearn.datasets import fetch_adult
36

37
# Load as DataFrames (recommended)
38
adult_data = fetch_adult(as_frame=True)
39
X = adult_data.data
40
y = adult_data.target  
41
sensitive_features = adult_data.sensitive_features
42

43
print(f"Features shape: {X.shape}")
44
print(f"Target distribution: {y.value_counts()}")
45
print(f"Sensitive features: {adult_data.sensitive_feature_names}")
46

47
# Load for direct use
48
X, y, A = fetch_adult(return_X_y=True)
49
```
50

51
### ACS Income Dataset
52

53
American Community Survey (ACS) income dataset providing more recent census-like data with state-level filtering options.
54

55
```python { .api }
56
def fetch_acs_income(*, cache=True, data_home=None, as_frame=True, return_X_y=False,
57
                     state="CA", year=2018, with_nulls=False, 
58
                     optimization="mem", accept_download=False):
59
    """
60
    Load the ACS Income dataset from American Community Survey.
61
    
62
    Parameters:
63
    - cache: bool, whether to cache downloaded data
64
    - data_home: str, path to store cached data
65
    - as_frame: bool, return pandas DataFrame and Series
66
    - return_X_y: bool, return (X, y, sensitive_features) tuple
67
    - state: str, state abbreviation for data filtering (e.g., "CA", "NY", "TX")
68
    - year: int, year of survey data (2014-2018 available)
69
    - with_nulls: bool, whether to include missing values
70
    - optimization: str, memory optimization ("mem" or "speed")
71
    - accept_download: bool, whether to accept downloading large dataset
72
    
73
    Returns:
74
    Bunch object with census data and sensitive features including race and sex
75
    """
76
```
77

78
#### Usage Example
79

80
```python
81
from fairlearn.datasets import fetch_acs_income
82

83
# Load California 2018 data
84
acs_data = fetch_acs_income(
85
    state="CA", 
86
    year=2018,
87
    accept_download=True  # Required for first download
88
)
89

90
X = acs_data.data
91
y = acs_data.target
92
sensitive_features = acs_data.sensitive_features
93

94
print(f"ACS Income dataset for CA 2018: {X.shape[0]} samples")
95
```
96

97
### Bank Marketing Dataset
98

99
Portuguese bank marketing campaign dataset for predicting term deposit subscriptions.
100

101
```python { .api }
102
def fetch_bank_marketing(*, cache=True, data_home=None, as_frame=True, return_X_y=False):
103
    """
104
    Load the Bank Marketing dataset.
105
    
106
    Parameters:
107
    - cache: bool, whether to cache downloaded data
108
    - data_home: str, path to store cached data  
109
    - as_frame: bool, return pandas DataFrame and Series
110
    - return_X_y: bool, return (X, y, sensitive_features) tuple
111
    
112
    Returns:
113
    Bunch object with:
114
    - data: feature matrix with client information
115
    - target: binary target (subscribed to term deposit)
116
    - sensitive_features: age group as sensitive attribute
117
    """
118
```
119

120
### Boston Housing Dataset  
121

122
Boston housing prices dataset (note: deprecated due to ethical concerns).
123

124
```python { .api }
125
def fetch_boston(*, cache=True, data_home=None, as_frame=True, return_X_y=False, warn=True):
126
    """
127
    Load the Boston Housing dataset.
128
    
129
    **Warning**: This dataset has known fairness issues and is deprecated.
130
    
131
    Parameters:
132
    - cache: bool, whether to cache data
133
    - data_home: str, path to store cached data
134
    - as_frame: bool, return pandas DataFrame and Series  
135
    - return_X_y: bool, return (X, y, sensitive_features) tuple
136
    - warn: bool, whether to display fairness warning
137
    
138
    Returns:
139
    Bunch object with housing data and racial composition as sensitive feature
140
    """
141
```
142

143
### Credit Card Fraud Dataset
144

145
Credit card fraud detection dataset for binary classification.
146

147
```python { .api }
148
def fetch_credit_card(*, cache=True, data_home=None, as_frame=True, return_X_y=False):
149
    """
150
    Load the Credit Card Fraud dataset.
151
    
152
    Parameters:
153
    - cache: bool, whether to cache downloaded data
154
    - data_home: str, path to store cached data
155
    - as_frame: bool, return pandas DataFrame and Series
156
    - return_X_y: bool, return (X, y, sensitive_features) tuple
157
    
158
    Returns:
159
    Bunch object with:
160
    - data: anonymized credit card transaction features
161
    - target: binary fraud indicator (0: legitimate, 1: fraud)
162
    - sensitive_features: derived sensitive attributes
163
    """
164
```
165

166
### Diabetes Hospital Dataset
167

168
Hospital diabetes patient dataset for predicting readmission risk.
169

170
```python { .api }
171
def fetch_diabetes_hospital(*, as_frame=True, cache=True, data_home=None, return_X_y=False):
172
    """
173
    Load the Diabetes Hospital dataset.
174
    
175
    Parameters:
176
    - as_frame: bool, return pandas DataFrame and Series
177
    - cache: bool, whether to cache downloaded data
178
    - data_home: str, path to store cached data
179
    - return_X_y: bool, return (X, y, sensitive_features) tuple
180
    
181
    Returns:
182
    Bunch object with:
183
    - data: patient medical features
184
    - target: readmission outcome
185
    - sensitive_features: race and gender
186
    """
187
```
188

189
## Common Usage Patterns
190

191
### Basic Data Loading
192

193
```python
194
from fairlearn.datasets import fetch_adult, fetch_acs_income
195

196
# Load Adult dataset
197
adult = fetch_adult(as_frame=True)
198
X_adult, y_adult, A_adult = adult.data, adult.target, adult.sensitive_features
199

200
# Load ACS dataset for specific state and year
201
acs = fetch_acs_income(state="NY", year=2017, accept_download=True)
202
X_acs, y_acs, A_acs = acs.data, acs.target, acs.sensitive_features
203
```
204

205
### Direct Unpacking
206

207
```python
208
# Get data ready for ML pipeline
209
X, y, sensitive_features = fetch_adult(return_X_y=True)
210

211
# Split for training
212
from sklearn.model_selection import train_test_split
213
X_train, X_test, y_train, y_test, A_train, A_test = train_test_split(
214
    X, y, sensitive_features, test_size=0.3, random_state=42, stratify=y
215
)
216
```
217

218
### Exploring Dataset Properties
219

220
```python
221
import pandas as pd
222

223
# Load and explore Adult dataset
224
adult = fetch_adult(as_frame=True)
225

226
print("Dataset shape:", adult.data.shape)
227
print("Target distribution:")
228
print(adult.target.value_counts())
229

230
print("\nSensitive features:")
231
print(adult.sensitive_features.head())
232

233
print("\nFeature names:")
234
print(adult.feature_names)
235

236
print("\nSensitive feature breakdown:")
237
for col in adult.sensitive_feature_names:
238
    print(f"{col}: {adult.sensitive_features[col].value_counts()}")
239
```
240

241
### Data Preprocessing Pipeline
242

243
```python
244
from sklearn.preprocessing import StandardScaler, LabelEncoder
245
from sklearn.compose import ColumnTransformer
246
from sklearn.pipeline import Pipeline
247

248
# Load dataset
249
X, y, A = fetch_adult(return_X_y=True)
250

251
# Identify categorical and numerical columns
252
categorical_features = X.select_dtypes(include=['object']).columns
253
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
254

255
# Create preprocessing pipeline
256
preprocessor = ColumnTransformer(
257
    transformers=[
258
        ('num', StandardScaler(), numerical_features),
259
        ('cat', LabelEncoder(), categorical_features)
260
    ]
261
)
262

263
# Preprocess features
264
X_processed = preprocessor.fit_transform(X)
265
```
266

267
## Dataset Characteristics
268

269
### Adult Dataset
270
- **Size**: ~48,000 samples
271
- **Features**: 14 (age, workclass, education, etc.)
272
- **Target**: Binary income classification (>50K/year)
273
- **Sensitive Features**: Sex, race
274
- **Fairness Issues**: Historical biases in income based on demographics
275

276
### ACS Income Dataset
277
- **Size**: Variable by state and year (10K-500K samples)
278
- **Features**: 10 census-related features  
279
- **Target**: Binary income classification
280
- **Sensitive Features**: Sex, race
281
- **Advantages**: More recent data, state-specific filtering
282

283
### Bank Marketing Dataset
284
- **Size**: ~45,000 samples
285
- **Features**: 16 client and campaign features
286
- **Target**: Binary term deposit subscription
287
- **Sensitive Features**: Age groups
288
- **Use Case**: Marketing fairness, age discrimination
289

290
### Other Datasets
291
Each dataset includes appropriate sensitive features and represents realistic fairness challenges in different domains (finance, healthcare, housing, etc.).
292

293
## Best Practices
294

295
### Data Exploration
296

297
Always explore the dataset before use:
298

299
```python
300
def explore_fairness_dataset(data_bunch):
301
    """Explore fairness-related properties of a dataset."""
302
    
303
    print(f"Dataset shape: {data_bunch.data.shape}")
304
    print(f"Missing values: {data_bunch.data.isnull().sum().sum()}")
305
    
306
    # Target distribution
307
    print("\nTarget distribution:")
308
    print(data_bunch.target.value_counts(normalize=True))
309
    
310
    # Sensitive feature distributions
311
    print("\nSensitive feature distributions:")
312
    for col in data_bunch.sensitive_feature_names:
313
        print(f"\n{col}:")
314
        print(data_bunch.sensitive_features[col].value_counts())
315
    
316
    # Cross-tabulation of target and sensitive features
317
    for col in data_bunch.sensitive_feature_names:
318
        print(f"\nTarget vs {col}:")
319
        crosstab = pd.crosstab(
320
            data_bunch.sensitive_features[col], 
321
            data_bunch.target,
322
            normalize='index'
323
        )
324
        print(crosstab)
325

326
# Use the exploration function
327
adult = fetch_adult(as_frame=True)
328
explore_fairness_dataset(adult)
329
```
330

331
### Ethical Considerations
332

333
1. **Data Awareness**: Understand the historical context and potential biases
334
2. **Boston Housing**: Avoid using due to known racial bias issues
335
3. **Sensitive Feature Selection**: Consider which attributes should be treated as sensitive
336
4. **Intersectionality**: Consider interactions between multiple sensitive attributes
337

338
### Performance Baselines
339

340
Establish fairness baselines:
341

342
```python
343
from fairlearn.metrics import MetricFrame, demographic_parity_difference
344
from sklearn.ensemble import RandomForestClassifier
345

346
def establish_baseline(dataset_name="adult"):
347
    """Establish baseline fairness metrics for a dataset."""
348
    
349
    if dataset_name == "adult":
350
        X, y, A = fetch_adult(return_X_y=True)
351
    elif dataset_name == "acs":
352
        X, y, A = fetch_acs_income(return_X_y=True, accept_download=True)
353
    
354
    # Train simple baseline model
355
    model = RandomForestClassifier(random_state=42)
356
    model.fit(X, y)
357
    predictions = model.predict(X)
358
    
359
    # Compute fairness metrics
360
    mf = MetricFrame(
361
        metrics={'accuracy': lambda y_true, y_pred: (y_true == y_pred).mean()},
362
        y_true=y,
363
        y_pred=predictions,
364
        sensitive_features=A
365
    )
366
    
367
    dp_diff = demographic_parity_difference(y, predictions, sensitive_features=A)
368
    
369
    return {
370
        'overall_accuracy': mf.overall['accuracy'],
371
        'group_accuracies': mf.by_group['accuracy'],
372
        'demographic_parity_difference': dp_diff
373
    }
374

375
baseline = establish_baseline("adult")
376
print("Baseline fairness metrics:", baseline)
377
```

Version

Tile

Files

datasets.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

datasets.mddocs/