CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-imbalanced-learn

Toolbox for imbalanced dataset in machine learning

Pending
Overview
Eval results
Files

model-selection.mddocs/

Model Selection

Cross-validation and model selection tools adapted for imbalanced datasets, providing specialized splitting strategies that consider instance hardness and class distribution to ensure more reliable model evaluation.

Overview

Imbalanced-learn extends scikit-learn's model selection capabilities with specialized cross-validation strategies that account for class imbalance. These tools help ensure fair evaluation of models on imbalanced datasets by considering instance difficulty and maintaining appropriate class distributions across folds.

Key Features

  • Instance hardness awareness: Cross-validation that considers sample difficulty
  • Balanced fold distribution: Ensures minority class representation across all folds
  • Compatible with scikit-learn: Seamless integration with existing model selection workflows
  • Binary classification focus: Specialized for binary imbalanced problems

Cross-Validation Strategies

InstanceHardnessCV

InstanceHardnessCV

{ .api }
class InstanceHardnessCV:
    def __init__(
        self,
        estimator,
        *,
        n_splits=5,
        pos_label=None
    ): ...
    def split(self, X, y, groups=None): ...
    def get_n_splits(self, X=None, y=None, groups=None): ...

Instance-hardness cross-validation splitter that distributes samples with large instance hardness equally over the folds.

Parameters:

  • estimator (estimator object): Classifier to be used to estimate instance hardness of the samples. This classifier should implement predict_proba
  • n_splits (int, default=5): Number of folds. Must be at least 2
  • pos_label (int, float, bool or str, default=None): The class considered the positive class when selecting the probability representing the instance hardness. If None, the positive class is automatically inferred by the estimator as estimator.classes_[1]

Methods:

split
def split(self, X, y, groups=None) -> Generator[tuple[ndarray, ndarray], None, None]

Generate indices to split data into training and test set.

Parameters:

  • X (array-like of shape (n_samples, n_features)): Training data, where n_samples is the number of samples and n_features is the number of features
  • y (array-like of shape (n_samples,)): The target variable for supervised learning problems
  • groups (object): Always ignored, exists for compatibility

Yields:

  • train (ndarray): The training set indices for that split
  • test (ndarray): The testing set indices for that split
get_n_splits
def get_n_splits(self, X=None, y=None, groups=None) -> int

Returns the number of splitting iterations in the cross-validator.

Parameters:

  • X (object): Always ignored, exists for compatibility
  • y (object): Always ignored, exists for compatibility
  • groups (object): Always ignored, exists for compatibility

Returns:

  • n_splits (int): Returns the number of splitting iterations in the cross-validator

Instance Hardness Concept: The instance hardness is internally estimated using the provided estimator and stratified cross-validation. Samples with higher instance hardness (those that are harder to classify correctly) are distributed more evenly across folds to ensure each fold contains a representative mix of easy and difficult samples.

Algorithm:

  1. Uses cross-validation to estimate instance hardness via predict_proba
  2. Sorts samples first by class label, then by instance hardness
  3. Distributes samples across folds to balance both class distribution and hardness levels
  4. Ensures each fold has similar difficulty characteristics

Example:

from imblearn.model_selection import InstanceHardnessCV
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression

# Create imbalanced dataset
X, y = make_classification(
    weights=[0.9, 0.1], 
    class_sep=2,
    n_informative=3, 
    n_redundant=1, 
    flip_y=0.05, 
    n_samples=1000, 
    random_state=10
)

# Create instance hardness CV
estimator = LogisticRegression()
ih_cv = InstanceHardnessCV(estimator)

# Use in cross-validation
cv_result = cross_validate(estimator, X, y, cv=ih_cv)
print(f"Standard deviation of test_scores: {cv_result['test_score'].std():.3f}")

# Manual splitting
for train_idx, test_idx in ih_cv.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    # Train and evaluate model

Integration with scikit-learn

Compatible Workflows

Cross-validation Functions:

from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV
from imblearn.model_selection import InstanceHardnessCV

# Use with cross_val_score
scores = cross_val_score(estimator, X, y, cv=InstanceHardnessCV(estimator))

# Use with cross_validate  
cv_results = cross_validate(estimator, X, y, cv=InstanceHardnessCV(estimator))

# Use with GridSearchCV
grid_search = GridSearchCV(
    estimator, 
    param_grid, 
    cv=InstanceHardnessCV(estimator)
)

Pipeline Integration:

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

# Create pipeline with sampling
pipeline = Pipeline([
    ('sampling', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Use instance hardness CV for evaluation
ih_cv = InstanceHardnessCV(LogisticRegression())
scores = cross_val_score(pipeline, X, y, cv=ih_cv)

Comparison with Standard CV

Advantages over Standard Cross-Validation

Standard StratifiedKFold:

  • Only considers class distribution
  • May create folds with varying difficulty levels
  • Can lead to optimistic or pessimistic performance estimates

InstanceHardnessCV:

  • Considers both class distribution and sample difficulty
  • Creates folds with balanced hardness levels
  • Provides more reliable performance estimates on imbalanced data

When to Use:

  • Binary classification problems with class imbalance
  • When sample difficulty varies significantly within classes
  • For more reliable model selection on imbalanced datasets
  • When you need consistent cross-validation performance

Limitations:

  • Currently supports only binary classification
  • Requires additional computation for hardness estimation
  • The base estimator must implement predict_proba

Best Practices

  1. Choose appropriate base estimator: Use a fast, reasonable classifier for hardness estimation
  2. Consider computational cost: Instance hardness estimation adds overhead
  3. Validate assumptions: Ensure your problem benefits from hardness-aware splitting
  4. Combine with sampling: Use alongside imblearn sampling techniques for comprehensive approach

Complete Example:

from imblearn.model_selection import InstanceHardnessCV
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.datasets import make_classification

# Create imbalanced dataset
X, y = make_classification(
    n_classes=2, 
    weights=[0.8, 0.2], 
    n_samples=1000,
    random_state=42
)

# Create pipeline
pipeline = Pipeline([
    ('sampling', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Use instance hardness CV
base_estimator = LogisticRegression()
ih_cv = InstanceHardnessCV(base_estimator, n_splits=5)

# Evaluate model
cv_results = cross_validate(
    pipeline, X, y, 
    cv=ih_cv, 
    scoring=['accuracy', 'f1', 'roc_auc'],
    return_train_score=True
)

print(f"Test scores: {cv_results['test_accuracy'].mean():.3f} ± {cv_results['test_accuracy'].std():.3f}")

Install with Tessl CLI

npx tessl i tessl/pypi-imbalanced-learn

docs

combination.md

datasets.md

deep-learning.md

ensemble.md

index.md

metrics.md

model-selection.md

over-sampling.md

pipeline.md

under-sampling.md

utilities.md

tile.json