Toolbox for imbalanced dataset in machine learning
—
Cross-validation and model selection tools adapted for imbalanced datasets, providing specialized splitting strategies that consider instance hardness and class distribution to ensure more reliable model evaluation.
Imbalanced-learn extends scikit-learn's model selection capabilities with specialized cross-validation strategies that account for class imbalance. These tools help ensure fair evaluation of models on imbalanced datasets by considering instance difficulty and maintaining appropriate class distributions across folds.
{ .api }
class InstanceHardnessCV:
def __init__(
self,
estimator,
*,
n_splits=5,
pos_label=None
): ...
def split(self, X, y, groups=None): ...
def get_n_splits(self, X=None, y=None, groups=None): ...Instance-hardness cross-validation splitter that distributes samples with large instance hardness equally over the folds.
Parameters:
estimator object): Classifier to be used to estimate instance hardness of the samples. This classifier should implement predict_probaint, default=5): Number of folds. Must be at least 2int, float, bool or str, default=None): The class considered the positive class when selecting the probability representing the instance hardness. If None, the positive class is automatically inferred by the estimator as estimator.classes_[1]Methods:
def split(self, X, y, groups=None) -> Generator[tuple[ndarray, ndarray], None, None]Generate indices to split data into training and test set.
Parameters:
array-like of shape (n_samples, n_features)): Training data, where n_samples is the number of samples and n_features is the number of featuresarray-like of shape (n_samples,)): The target variable for supervised learning problemsobject): Always ignored, exists for compatibilityYields:
ndarray): The training set indices for that splitndarray): The testing set indices for that splitdef get_n_splits(self, X=None, y=None, groups=None) -> intReturns the number of splitting iterations in the cross-validator.
Parameters:
object): Always ignored, exists for compatibilityobject): Always ignored, exists for compatibilityobject): Always ignored, exists for compatibilityReturns:
int): Returns the number of splitting iterations in the cross-validatorInstance Hardness Concept:
The instance hardness is internally estimated using the provided estimator and stratified cross-validation. Samples with higher instance hardness (those that are harder to classify correctly) are distributed more evenly across folds to ensure each fold contains a representative mix of easy and difficult samples.
Algorithm:
predict_probaExample:
from imblearn.model_selection import InstanceHardnessCV
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
# Create imbalanced dataset
X, y = make_classification(
weights=[0.9, 0.1],
class_sep=2,
n_informative=3,
n_redundant=1,
flip_y=0.05,
n_samples=1000,
random_state=10
)
# Create instance hardness CV
estimator = LogisticRegression()
ih_cv = InstanceHardnessCV(estimator)
# Use in cross-validation
cv_result = cross_validate(estimator, X, y, cv=ih_cv)
print(f"Standard deviation of test_scores: {cv_result['test_score'].std():.3f}")
# Manual splitting
for train_idx, test_idx in ih_cv.split(X, y):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Train and evaluate modelCross-validation Functions:
from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV
from imblearn.model_selection import InstanceHardnessCV
# Use with cross_val_score
scores = cross_val_score(estimator, X, y, cv=InstanceHardnessCV(estimator))
# Use with cross_validate
cv_results = cross_validate(estimator, X, y, cv=InstanceHardnessCV(estimator))
# Use with GridSearchCV
grid_search = GridSearchCV(
estimator,
param_grid,
cv=InstanceHardnessCV(estimator)
)Pipeline Integration:
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
# Create pipeline with sampling
pipeline = Pipeline([
('sampling', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(random_state=42))
])
# Use instance hardness CV for evaluation
ih_cv = InstanceHardnessCV(LogisticRegression())
scores = cross_val_score(pipeline, X, y, cv=ih_cv)Standard StratifiedKFold:
InstanceHardnessCV:
When to Use:
Limitations:
predict_probaComplete Example:
from imblearn.model_selection import InstanceHardnessCV
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.datasets import make_classification
# Create imbalanced dataset
X, y = make_classification(
n_classes=2,
weights=[0.8, 0.2],
n_samples=1000,
random_state=42
)
# Create pipeline
pipeline = Pipeline([
('sampling', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(random_state=42))
])
# Use instance hardness CV
base_estimator = LogisticRegression()
ih_cv = InstanceHardnessCV(base_estimator, n_splits=5)
# Evaluate model
cv_results = cross_validate(
pipeline, X, y,
cv=ih_cv,
scoring=['accuracy', 'f1', 'roc_auc'],
return_train_score=True
)
print(f"Test scores: {cv_results['test_accuracy'].mean():.3f} ± {cv_results['test_accuracy'].std():.3f}")Install with Tessl CLI
npx tessl i tessl/pypi-imbalanced-learn