or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

adversarial.mdassessment.mddatasets.mdindex.mdpostprocessing.mdpreprocessing.mdreductions.md

datasets.mddocs/

0

# Datasets

1

2

Standard datasets commonly used for fairness research and benchmarking, with consistent interfaces and built-in sensitive feature identification. These datasets provide realistic examples for testing fairness algorithms and include known fairness challenges.

3

4

## Capabilities

5

6

### Adult (Census Income) Dataset

7

8

The Adult dataset predicts whether income exceeds $50K/year based on census data. This is one of the most commonly used datasets in fairness research.

9

10

```python { .api }

11

def fetch_adult(*, cache=True, data_home=None, as_frame=True, return_X_y=False):

12

"""

13

Load the Adult (Census Income) dataset.

14

15

Parameters:

16

- cache: bool, whether to cache downloaded data locally

17

- data_home: str, path to store cached data (default: ~/fairlearn_data)

18

- as_frame: bool, return pandas DataFrame and Series (True) or numpy arrays (False)

19

- return_X_y: bool, return (X, y, sensitive_features) tuple instead of Bunch object

20

21

Returns:

22

Bunch object with:

23

- data: DataFrame or array, feature matrix

24

- target: Series or array, target values (0: <=50K, 1: >50K)

25

- feature_names: list, names of features

26

- target_names: list, names of target classes

27

- sensitive_features: DataFrame or array, sensitive attributes (sex, race)

28

- sensitive_feature_names: list, names of sensitive features

29

"""

30

```

31

32

#### Usage Example

33

34

```python

35

from fairlearn.datasets import fetch_adult

36

37

# Load as DataFrames (recommended)

38

adult_data = fetch_adult(as_frame=True)

39

X = adult_data.data

40

y = adult_data.target

41

sensitive_features = adult_data.sensitive_features

42

43

print(f"Features shape: {X.shape}")

44

print(f"Target distribution: {y.value_counts()}")

45

print(f"Sensitive features: {adult_data.sensitive_feature_names}")

46

47

# Load for direct use

48

X, y, A = fetch_adult(return_X_y=True)

49

```

50

51

### ACS Income Dataset

52

53

American Community Survey (ACS) income dataset providing more recent census-like data with state-level filtering options.

54

55

```python { .api }

56

def fetch_acs_income(*, cache=True, data_home=None, as_frame=True, return_X_y=False,

57

state="CA", year=2018, with_nulls=False,

58

optimization="mem", accept_download=False):

59

"""

60

Load the ACS Income dataset from American Community Survey.

61

62

Parameters:

63

- cache: bool, whether to cache downloaded data

64

- data_home: str, path to store cached data

65

- as_frame: bool, return pandas DataFrame and Series

66

- return_X_y: bool, return (X, y, sensitive_features) tuple

67

- state: str, state abbreviation for data filtering (e.g., "CA", "NY", "TX")

68

- year: int, year of survey data (2014-2018 available)

69

- with_nulls: bool, whether to include missing values

70

- optimization: str, memory optimization ("mem" or "speed")

71

- accept_download: bool, whether to accept downloading large dataset

72

73

Returns:

74

Bunch object with census data and sensitive features including race and sex

75

"""

76

```

77

78

#### Usage Example

79

80

```python

81

from fairlearn.datasets import fetch_acs_income

82

83

# Load California 2018 data

84

acs_data = fetch_acs_income(

85

state="CA",

86

year=2018,

87

accept_download=True # Required for first download

88

)

89

90

X = acs_data.data

91

y = acs_data.target

92

sensitive_features = acs_data.sensitive_features

93

94

print(f"ACS Income dataset for CA 2018: {X.shape[0]} samples")

95

```

96

97

### Bank Marketing Dataset

98

99

Portuguese bank marketing campaign dataset for predicting term deposit subscriptions.

100

101

```python { .api }

102

def fetch_bank_marketing(*, cache=True, data_home=None, as_frame=True, return_X_y=False):

103

"""

104

Load the Bank Marketing dataset.

105

106

Parameters:

107

- cache: bool, whether to cache downloaded data

108

- data_home: str, path to store cached data

109

- as_frame: bool, return pandas DataFrame and Series

110

- return_X_y: bool, return (X, y, sensitive_features) tuple

111

112

Returns:

113

Bunch object with:

114

- data: feature matrix with client information

115

- target: binary target (subscribed to term deposit)

116

- sensitive_features: age group as sensitive attribute

117

"""

118

```

119

120

### Boston Housing Dataset

121

122

Boston housing prices dataset (note: deprecated due to ethical concerns).

123

124

```python { .api }

125

def fetch_boston(*, cache=True, data_home=None, as_frame=True, return_X_y=False, warn=True):

126

"""

127

Load the Boston Housing dataset.

128

129

**Warning**: This dataset has known fairness issues and is deprecated.

130

131

Parameters:

132

- cache: bool, whether to cache data

133

- data_home: str, path to store cached data

134

- as_frame: bool, return pandas DataFrame and Series

135

- return_X_y: bool, return (X, y, sensitive_features) tuple

136

- warn: bool, whether to display fairness warning

137

138

Returns:

139

Bunch object with housing data and racial composition as sensitive feature

140

"""

141

```

142

143

### Credit Card Fraud Dataset

144

145

Credit card fraud detection dataset for binary classification.

146

147

```python { .api }

148

def fetch_credit_card(*, cache=True, data_home=None, as_frame=True, return_X_y=False):

149

"""

150

Load the Credit Card Fraud dataset.

151

152

Parameters:

153

- cache: bool, whether to cache downloaded data

154

- data_home: str, path to store cached data

155

- as_frame: bool, return pandas DataFrame and Series

156

- return_X_y: bool, return (X, y, sensitive_features) tuple

157

158

Returns:

159

Bunch object with:

160

- data: anonymized credit card transaction features

161

- target: binary fraud indicator (0: legitimate, 1: fraud)

162

- sensitive_features: derived sensitive attributes

163

"""

164

```

165

166

### Diabetes Hospital Dataset

167

168

Hospital diabetes patient dataset for predicting readmission risk.

169

170

```python { .api }

171

def fetch_diabetes_hospital(*, as_frame=True, cache=True, data_home=None, return_X_y=False):

172

"""

173

Load the Diabetes Hospital dataset.

174

175

Parameters:

176

- as_frame: bool, return pandas DataFrame and Series

177

- cache: bool, whether to cache downloaded data

178

- data_home: str, path to store cached data

179

- return_X_y: bool, return (X, y, sensitive_features) tuple

180

181

Returns:

182

Bunch object with:

183

- data: patient medical features

184

- target: readmission outcome

185

- sensitive_features: race and gender

186

"""

187

```

188

189

## Common Usage Patterns

190

191

### Basic Data Loading

192

193

```python

194

from fairlearn.datasets import fetch_adult, fetch_acs_income

195

196

# Load Adult dataset

197

adult = fetch_adult(as_frame=True)

198

X_adult, y_adult, A_adult = adult.data, adult.target, adult.sensitive_features

199

200

# Load ACS dataset for specific state and year

201

acs = fetch_acs_income(state="NY", year=2017, accept_download=True)

202

X_acs, y_acs, A_acs = acs.data, acs.target, acs.sensitive_features

203

```

204

205

### Direct Unpacking

206

207

```python

208

# Get data ready for ML pipeline

209

X, y, sensitive_features = fetch_adult(return_X_y=True)

210

211

# Split for training

212

from sklearn.model_selection import train_test_split

213

X_train, X_test, y_train, y_test, A_train, A_test = train_test_split(

214

X, y, sensitive_features, test_size=0.3, random_state=42, stratify=y

215

)

216

```

217

218

### Exploring Dataset Properties

219

220

```python

221

import pandas as pd

222

223

# Load and explore Adult dataset

224

adult = fetch_adult(as_frame=True)

225

226

print("Dataset shape:", adult.data.shape)

227

print("Target distribution:")

228

print(adult.target.value_counts())

229

230

print("\nSensitive features:")

231

print(adult.sensitive_features.head())

232

233

print("\nFeature names:")

234

print(adult.feature_names)

235

236

print("\nSensitive feature breakdown:")

237

for col in adult.sensitive_feature_names:

238

print(f"{col}: {adult.sensitive_features[col].value_counts()}")

239

```

240

241

### Data Preprocessing Pipeline

242

243

```python

244

from sklearn.preprocessing import StandardScaler, LabelEncoder

245

from sklearn.compose import ColumnTransformer

246

from sklearn.pipeline import Pipeline

247

248

# Load dataset

249

X, y, A = fetch_adult(return_X_y=True)

250

251

# Identify categorical and numerical columns

252

categorical_features = X.select_dtypes(include=['object']).columns

253

numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

254

255

# Create preprocessing pipeline

256

preprocessor = ColumnTransformer(

257

transformers=[

258

('num', StandardScaler(), numerical_features),

259

('cat', LabelEncoder(), categorical_features)

260

]

261

)

262

263

# Preprocess features

264

X_processed = preprocessor.fit_transform(X)

265

```

266

267

## Dataset Characteristics

268

269

### Adult Dataset

270

- **Size**: ~48,000 samples

271

- **Features**: 14 (age, workclass, education, etc.)

272

- **Target**: Binary income classification (>50K/year)

273

- **Sensitive Features**: Sex, race

274

- **Fairness Issues**: Historical biases in income based on demographics

275

276

### ACS Income Dataset

277

- **Size**: Variable by state and year (10K-500K samples)

278

- **Features**: 10 census-related features

279

- **Target**: Binary income classification

280

- **Sensitive Features**: Sex, race

281

- **Advantages**: More recent data, state-specific filtering

282

283

### Bank Marketing Dataset

284

- **Size**: ~45,000 samples

285

- **Features**: 16 client and campaign features

286

- **Target**: Binary term deposit subscription

287

- **Sensitive Features**: Age groups

288

- **Use Case**: Marketing fairness, age discrimination

289

290

### Other Datasets

291

Each dataset includes appropriate sensitive features and represents realistic fairness challenges in different domains (finance, healthcare, housing, etc.).

292

293

## Best Practices

294

295

### Data Exploration

296

297

Always explore the dataset before use:

298

299

```python

300

def explore_fairness_dataset(data_bunch):

301

"""Explore fairness-related properties of a dataset."""

302

303

print(f"Dataset shape: {data_bunch.data.shape}")

304

print(f"Missing values: {data_bunch.data.isnull().sum().sum()}")

305

306

# Target distribution

307

print("\nTarget distribution:")

308

print(data_bunch.target.value_counts(normalize=True))

309

310

# Sensitive feature distributions

311

print("\nSensitive feature distributions:")

312

for col in data_bunch.sensitive_feature_names:

313

print(f"\n{col}:")

314

print(data_bunch.sensitive_features[col].value_counts())

315

316

# Cross-tabulation of target and sensitive features

317

for col in data_bunch.sensitive_feature_names:

318

print(f"\nTarget vs {col}:")

319

crosstab = pd.crosstab(

320

data_bunch.sensitive_features[col],

321

data_bunch.target,

322

normalize='index'

323

)

324

print(crosstab)

325

326

# Use the exploration function

327

adult = fetch_adult(as_frame=True)

328

explore_fairness_dataset(adult)

329

```

330

331

### Ethical Considerations

332

333

1. **Data Awareness**: Understand the historical context and potential biases

334

2. **Boston Housing**: Avoid using due to known racial bias issues

335

3. **Sensitive Feature Selection**: Consider which attributes should be treated as sensitive

336

4. **Intersectionality**: Consider interactions between multiple sensitive attributes

337

338

### Performance Baselines

339

340

Establish fairness baselines:

341

342

```python

343

from fairlearn.metrics import MetricFrame, demographic_parity_difference

344

from sklearn.ensemble import RandomForestClassifier

345

346

def establish_baseline(dataset_name="adult"):

347

"""Establish baseline fairness metrics for a dataset."""

348

349

if dataset_name == "adult":

350

X, y, A = fetch_adult(return_X_y=True)

351

elif dataset_name == "acs":

352

X, y, A = fetch_acs_income(return_X_y=True, accept_download=True)

353

354

# Train simple baseline model

355

model = RandomForestClassifier(random_state=42)

356

model.fit(X, y)

357

predictions = model.predict(X)

358

359

# Compute fairness metrics

360

mf = MetricFrame(

361

metrics={'accuracy': lambda y_true, y_pred: (y_true == y_pred).mean()},

362

y_true=y,

363

y_pred=predictions,

364

sensitive_features=A

365

)

366

367

dp_diff = demographic_parity_difference(y, predictions, sensitive_features=A)

368

369

return {

370

'overall_accuracy': mf.overall['accuracy'],

371

'group_accuracies': mf.by_group['accuracy'],

372

'demographic_parity_difference': dp_diff

373

}

374

375

baseline = establish_baseline("adult")

376

print("Baseline fairness metrics:", baseline)

377

```