tessl/pypi-fairlearn

A Python package to assess and improve fairness of machine learning models

—

Pending

Overview

Eval results

Files

Datasets

Name: tessl/pypi-fairlearn
Author: tessl

Standard datasets commonly used for fairness research and benchmarking, with consistent interfaces and built-in sensitive feature identification. These datasets provide realistic examples for testing fairness algorithms and include known fairness challenges.

Capabilities

Adult (Census Income) Dataset

The Adult dataset predicts whether income exceeds $50K/year based on census data. This is one of the most commonly used datasets in fairness research.

def fetch_adult(*, cache=True, data_home=None, as_frame=True, return_X_y=False):
    """
    Load the Adult (Census Income) dataset.
    
    Parameters:
    - cache: bool, whether to cache downloaded data locally
    - data_home: str, path to store cached data (default: ~/fairlearn_data)
    - as_frame: bool, return pandas DataFrame and Series (True) or numpy arrays (False)
    - return_X_y: bool, return (X, y, sensitive_features) tuple instead of Bunch object
    
    Returns:
    Bunch object with:
    - data: DataFrame or array, feature matrix
    - target: Series or array, target values (0: <=50K, 1: >50K)
    - feature_names: list, names of features
    - target_names: list, names of target classes
    - sensitive_features: DataFrame or array, sensitive attributes (sex, race)
    - sensitive_feature_names: list, names of sensitive features
    """

Usage Example

from fairlearn.datasets import fetch_adult

# Load as DataFrames (recommended)
adult_data = fetch_adult(as_frame=True)
X = adult_data.data
y = adult_data.target  
sensitive_features = adult_data.sensitive_features

print(f"Features shape: {X.shape}")
print(f"Target distribution: {y.value_counts()}")
print(f"Sensitive features: {adult_data.sensitive_feature_names}")

# Load for direct use
X, y, A = fetch_adult(return_X_y=True)

ACS Income Dataset

American Community Survey (ACS) income dataset providing more recent census-like data with state-level filtering options.

def fetch_acs_income(*, cache=True, data_home=None, as_frame=True, return_X_y=False,
                     state="CA", year=2018, with_nulls=False, 
                     optimization="mem", accept_download=False):
    """
    Load the ACS Income dataset from American Community Survey.
    
    Parameters:
    - cache: bool, whether to cache downloaded data
    - data_home: str, path to store cached data
    - as_frame: bool, return pandas DataFrame and Series
    - return_X_y: bool, return (X, y, sensitive_features) tuple
    - state: str, state abbreviation for data filtering (e.g., "CA", "NY", "TX")
    - year: int, year of survey data (2014-2018 available)
    - with_nulls: bool, whether to include missing values
    - optimization: str, memory optimization ("mem" or "speed")
    - accept_download: bool, whether to accept downloading large dataset
    
    Returns:
    Bunch object with census data and sensitive features including race and sex
    """

Usage Example

from fairlearn.datasets import fetch_acs_income

# Load California 2018 data
acs_data = fetch_acs_income(
    state="CA", 
    year=2018,
    accept_download=True  # Required for first download
)

X = acs_data.data
y = acs_data.target
sensitive_features = acs_data.sensitive_features

print(f"ACS Income dataset for CA 2018: {X.shape[0]} samples")

Bank Marketing Dataset

Portuguese bank marketing campaign dataset for predicting term deposit subscriptions.

def fetch_bank_marketing(*, cache=True, data_home=None, as_frame=True, return_X_y=False):
    """
    Load the Bank Marketing dataset.
    
    Parameters:
    - cache: bool, whether to cache downloaded data
    - data_home: str, path to store cached data  
    - as_frame: bool, return pandas DataFrame and Series
    - return_X_y: bool, return (X, y, sensitive_features) tuple
    
    Returns:
    Bunch object with:
    - data: feature matrix with client information
    - target: binary target (subscribed to term deposit)
    - sensitive_features: age group as sensitive attribute
    """

Boston Housing Dataset

Boston housing prices dataset (note: deprecated due to ethical concerns).

def fetch_boston(*, cache=True, data_home=None, as_frame=True, return_X_y=False, warn=True):
    """
    Load the Boston Housing dataset.
    
    **Warning**: This dataset has known fairness issues and is deprecated.
    
    Parameters:
    - cache: bool, whether to cache data
    - data_home: str, path to store cached data
    - as_frame: bool, return pandas DataFrame and Series  
    - return_X_y: bool, return (X, y, sensitive_features) tuple
    - warn: bool, whether to display fairness warning
    
    Returns:
    Bunch object with housing data and racial composition as sensitive feature
    """

Credit Card Fraud Dataset

Credit card fraud detection dataset for binary classification.

def fetch_credit_card(*, cache=True, data_home=None, as_frame=True, return_X_y=False):
    """
    Load the Credit Card Fraud dataset.
    
    Parameters:
    - cache: bool, whether to cache downloaded data
    - data_home: str, path to store cached data
    - as_frame: bool, return pandas DataFrame and Series
    - return_X_y: bool, return (X, y, sensitive_features) tuple
    
    Returns:
    Bunch object with:
    - data: anonymized credit card transaction features
    - target: binary fraud indicator (0: legitimate, 1: fraud)
    - sensitive_features: derived sensitive attributes
    """

Diabetes Hospital Dataset

Hospital diabetes patient dataset for predicting readmission risk.

def fetch_diabetes_hospital(*, as_frame=True, cache=True, data_home=None, return_X_y=False):
    """
    Load the Diabetes Hospital dataset.
    
    Parameters:
    - as_frame: bool, return pandas DataFrame and Series
    - cache: bool, whether to cache downloaded data
    - data_home: str, path to store cached data
    - return_X_y: bool, return (X, y, sensitive_features) tuple
    
    Returns:
    Bunch object with:
    - data: patient medical features
    - target: readmission outcome
    - sensitive_features: race and gender
    """

Common Usage Patterns

Basic Data Loading

from fairlearn.datasets import fetch_adult, fetch_acs_income

# Load Adult dataset
adult = fetch_adult(as_frame=True)
X_adult, y_adult, A_adult = adult.data, adult.target, adult.sensitive_features

# Load ACS dataset for specific state and year
acs = fetch_acs_income(state="NY", year=2017, accept_download=True)
X_acs, y_acs, A_acs = acs.data, acs.target, acs.sensitive_features

Direct Unpacking

# Get data ready for ML pipeline
X, y, sensitive_features = fetch_adult(return_X_y=True)

# Split for training
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, A_train, A_test = train_test_split(
    X, y, sensitive_features, test_size=0.3, random_state=42, stratify=y
)

Exploring Dataset Properties

import pandas as pd

# Load and explore Adult dataset
adult = fetch_adult(as_frame=True)

print("Dataset shape:", adult.data.shape)
print("Target distribution:")
print(adult.target.value_counts())

print("\nSensitive features:")
print(adult.sensitive_features.head())

print("\nFeature names:")
print(adult.feature_names)

print("\nSensitive feature breakdown:")
for col in adult.sensitive_feature_names:
    print(f"{col}: {adult.sensitive_features[col].value_counts()}")

Data Preprocessing Pipeline

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load dataset
X, y, A = fetch_adult(return_X_y=True)

# Identify categorical and numerical columns
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', LabelEncoder(), categorical_features)
    ]
)

# Preprocess features
X_processed = preprocessor.fit_transform(X)

Dataset Characteristics

Adult Dataset

Size: ~48,000 samples
Features: 14 (age, workclass, education, etc.)
Target: Binary income classification (>50K/year)
Sensitive Features: Sex, race
Fairness Issues: Historical biases in income based on demographics

ACS Income Dataset

Size: Variable by state and year (10K-500K samples)
Features: 10 census-related features
Target: Binary income classification
Sensitive Features: Sex, race
Advantages: More recent data, state-specific filtering

Bank Marketing Dataset

Size: ~45,000 samples
Features: 16 client and campaign features
Target: Binary term deposit subscription
Sensitive Features: Age groups
Use Case: Marketing fairness, age discrimination

Other Datasets

Each dataset includes appropriate sensitive features and represents realistic fairness challenges in different domains (finance, healthcare, housing, etc.).

Best Practices

Data Exploration

Always explore the dataset before use:

def explore_fairness_dataset(data_bunch):
    """Explore fairness-related properties of a dataset."""
    
    print(f"Dataset shape: {data_bunch.data.shape}")
    print(f"Missing values: {data_bunch.data.isnull().sum().sum()}")
    
    # Target distribution
    print("\nTarget distribution:")
    print(data_bunch.target.value_counts(normalize=True))
    
    # Sensitive feature distributions
    print("\nSensitive feature distributions:")
    for col in data_bunch.sensitive_feature_names:
        print(f"\n{col}:")
        print(data_bunch.sensitive_features[col].value_counts())
    
    # Cross-tabulation of target and sensitive features
    for col in data_bunch.sensitive_feature_names:
        print(f"\nTarget vs {col}:")
        crosstab = pd.crosstab(
            data_bunch.sensitive_features[col], 
            data_bunch.target,
            normalize='index'
        )
        print(crosstab)

# Use the exploration function
adult = fetch_adult(as_frame=True)
explore_fairness_dataset(adult)

Ethical Considerations

Data Awareness: Understand the historical context and potential biases
Boston Housing: Avoid using due to known racial bias issues
Sensitive Feature Selection: Consider which attributes should be treated as sensitive
Intersectionality: Consider interactions between multiple sensitive attributes

Performance Baselines

Establish fairness baselines:

from fairlearn.metrics import MetricFrame, demographic_parity_difference
from sklearn.ensemble import RandomForestClassifier

def establish_baseline(dataset_name="adult"):
    """Establish baseline fairness metrics for a dataset."""
    
    if dataset_name == "adult":
        X, y, A = fetch_adult(return_X_y=True)
    elif dataset_name == "acs":
        X, y, A = fetch_acs_income(return_X_y=True, accept_download=True)
    
    # Train simple baseline model
    model = RandomForestClassifier(random_state=42)
    model.fit(X, y)
    predictions = model.predict(X)
    
    # Compute fairness metrics
    mf = MetricFrame(
        metrics={'accuracy': lambda y_true, y_pred: (y_true == y_pred).mean()},
        y_true=y,
        y_pred=predictions,
        sensitive_features=A
    )
    
    dp_diff = demographic_parity_difference(y, predictions, sensitive_features=A)
    
    return {
        'overall_accuracy': mf.overall['accuracy'],
        'group_accuracies': mf.by_group['accuracy'],
        'demographic_parity_difference': dp_diff
    }

baseline = establish_baseline("adult")
print("Baseline fairness metrics:", baseline)

Install with Tessl CLI