tessl/pypi-imbalanced-learn

Toolbox for imbalanced dataset in machine learning

—

Pending

Overview

Eval results

Files

Model Selection

Name: tessl/pypi-imbalanced-learn
Author: tessl

Cross-validation and model selection tools adapted for imbalanced datasets, providing specialized splitting strategies that consider instance hardness and class distribution to ensure more reliable model evaluation.

Overview

Imbalanced-learn extends scikit-learn's model selection capabilities with specialized cross-validation strategies that account for class imbalance. These tools help ensure fair evaluation of models on imbalanced datasets by considering instance difficulty and maintaining appropriate class distributions across folds.

Key Features

Instance hardness awareness: Cross-validation that considers sample difficulty
Balanced fold distribution: Ensures minority class representation across all folds
Compatible with scikit-learn: Seamless integration with existing model selection workflows
Binary classification focus: Specialized for binary imbalanced problems

Cross-Validation Strategies

InstanceHardnessCV

{ .api }
class InstanceHardnessCV:
    def __init__(
        self,
        estimator,
        *,
        n_splits=5,
        pos_label=None
    ): ...
    def split(self, X, y, groups=None): ...
    def get_n_splits(self, X=None, y=None, groups=None): ...

Instance-hardness cross-validation splitter that distributes samples with large instance hardness equally over the folds.

Parameters:

estimator (estimator object): Classifier to be used to estimate instance hardness of the samples. This classifier should implement predict_proba
n_splits (int, default=5): Number of folds. Must be at least 2
pos_label (int, float, bool or str, default=None): The class considered the positive class when selecting the probability representing the instance hardness. If None, the positive class is automatically inferred by the estimator as estimator.classes_[1]

Methods:

split

def split(self, X, y, groups=None) -> Generator[tuple[ndarray, ndarray], None, None]

Generate indices to split data into training and test set.

Parameters:

X (array-like of shape (n_samples, n_features)): Training data, where n_samples is the number of samples and n_features is the number of features
y (array-like of shape (n_samples,)): The target variable for supervised learning problems
groups (object): Always ignored, exists for compatibility

Yields:

train (ndarray): The training set indices for that split
test (ndarray): The testing set indices for that split

get_n_splits

def get_n_splits(self, X=None, y=None, groups=None) -> int

Returns the number of splitting iterations in the cross-validator.

Parameters:

X (object): Always ignored, exists for compatibility
y (object): Always ignored, exists for compatibility
groups (object): Always ignored, exists for compatibility

Returns:

n_splits (int): Returns the number of splitting iterations in the cross-validator

Instance Hardness Concept: The instance hardness is internally estimated using the provided estimator and stratified cross-validation. Samples with higher instance hardness (those that are harder to classify correctly) are distributed more evenly across folds to ensure each fold contains a representative mix of easy and difficult samples.

Algorithm:

Uses cross-validation to estimate instance hardness via predict_proba
Sorts samples first by class label, then by instance hardness
Distributes samples across folds to balance both class distribution and hardness levels
Ensures each fold has similar difficulty characteristics

Example:

from imblearn.model_selection import InstanceHardnessCV
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression

# Create imbalanced dataset
X, y = make_classification(
    weights=[0.9, 0.1], 
    class_sep=2,
    n_informative=3, 
    n_redundant=1, 
    flip_y=0.05, 
    n_samples=1000, 
    random_state=10
)

# Create instance hardness CV
estimator = LogisticRegression()
ih_cv = InstanceHardnessCV(estimator)

# Use in cross-validation
cv_result = cross_validate(estimator, X, y, cv=ih_cv)
print(f"Standard deviation of test_scores: {cv_result['test_score'].std():.3f}")

# Manual splitting
for train_idx, test_idx in ih_cv.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    # Train and evaluate model

Integration with scikit-learn

Compatible Workflows

Cross-validation Functions:

from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV
from imblearn.model_selection import InstanceHardnessCV

# Use with cross_val_score
scores = cross_val_score(estimator, X, y, cv=InstanceHardnessCV(estimator))

# Use with cross_validate  
cv_results = cross_validate(estimator, X, y, cv=InstanceHardnessCV(estimator))

# Use with GridSearchCV
grid_search = GridSearchCV(
    estimator, 
    param_grid, 
    cv=InstanceHardnessCV(estimator)
)

Pipeline Integration:

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

# Create pipeline with sampling
pipeline = Pipeline([
    ('sampling', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Use instance hardness CV for evaluation
ih_cv = InstanceHardnessCV(LogisticRegression())
scores = cross_val_score(pipeline, X, y, cv=ih_cv)

Comparison with Standard CV

Advantages over Standard Cross-Validation

Standard StratifiedKFold:

Only considers class distribution
May create folds with varying difficulty levels
Can lead to optimistic or pessimistic performance estimates

InstanceHardnessCV:

Considers both class distribution and sample difficulty
Creates folds with balanced hardness levels
Provides more reliable performance estimates on imbalanced data

When to Use:

Binary classification problems with class imbalance
When sample difficulty varies significantly within classes
For more reliable model selection on imbalanced datasets
When you need consistent cross-validation performance

Limitations:

Currently supports only binary classification
Requires additional computation for hardness estimation
The base estimator must implement predict_proba

Best Practices

Choose appropriate base estimator: Use a fast, reasonable classifier for hardness estimation
Consider computational cost: Instance hardness estimation adds overhead
Validate assumptions: Ensure your problem benefits from hardness-aware splitting
Combine with sampling: Use alongside imblearn sampling techniques for comprehensive approach

Complete Example:

from imblearn.model_selection import InstanceHardnessCV
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.datasets import make_classification

# Create imbalanced dataset
X, y = make_classification(
    n_classes=2, 
    weights=[0.8, 0.2], 
    n_samples=1000,
    random_state=42
)

# Create pipeline
pipeline = Pipeline([
    ('sampling', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Use instance hardness CV
base_estimator = LogisticRegression()
ih_cv = InstanceHardnessCV(base_estimator, n_splits=5)

# Evaluate model
cv_results = cross_validate(
    pipeline, X, y, 
    cv=ih_cv, 
    scoring=['accuracy', 'f1', 'roc_auc'],
    return_train_score=True
)

print(f"Test scores: {cv_results['test_accuracy'].mean():.3f} ± {cv_results['test_accuracy'].std():.3f}")

Install with Tessl CLI