tessl/pypi-imbalanced-learn

Toolbox for imbalanced dataset in machine learning

—

Pending

Overview

Eval results

Files

Ensemble Methods for Imbalanced Learning

Name: tessl/pypi-imbalanced-learn
Author: tessl

Overview

Ensemble methods combine multiple base learners to improve classification performance beyond what individual models can achieve. However, traditional ensemble methods often struggle with imbalanced datasets where minority classes are underrepresented. The imbalanced-learn library provides specialized ensemble classifiers that integrate resampling techniques directly into the ensemble learning process.

These ensemble methods address class imbalance by applying resampling strategies during training, ensuring that each base learner in the ensemble receives balanced training data. This approach leads to improved performance on minority classes while maintaining overall classification accuracy.

The ensemble module includes four main approaches:

BalancedBaggingClassifier: Applies random under-sampling to each bootstrap sample in bagging
BalancedRandomForestClassifier: Integrates random under-sampling into random forest construction
EasyEnsembleClassifier: Combines multiple balanced AdaBoost classifiers
RUSBoostClassifier: Integrates random under-sampling directly into the AdaBoost algorithm

BalancedBaggingClassifier

A bagging classifier with additional balancing that applies resampling to each bootstrap sample before training base estimators.

class BalancedBaggingClassifier(BaggingClassifier):
    def __init__(
        self,
        estimator=None,
        n_estimators=10,
        *,
        max_samples=1.0,
        max_features=1.0,
        bootstrap=True,
        bootstrap_features=False,
        oob_score=False,
        warm_start=False,
        sampling_strategy="auto",
        replacement=False,
        n_jobs=None,
        random_state=None,
        verbose=0,
        sampler=None,
    )

Parameters

estimator : estimator object, default=None
- The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a DecisionTreeClassifier
n_estimators : int, default=10
- The number of base estimators in the ensemble
max_samples : int or float, default=1.0
- The number of samples to draw from X to train each base estimator
max_features : int or float, default=1.0
- The number of features to draw from X to train each base estimator
bootstrap : bool, default=True
- Whether samples are drawn with replacement (applied after resampling)
bootstrap_features : bool, default=False
- Whether features are drawn with replacement
oob_score : bool, default=False
- Whether to use out-of-bag samples to estimate generalization error
warm_start : bool, default=False
- When set to True, reuse the solution of the previous call to fit
sampling_strategy : float, str, dict, callable, default="auto"
- Sampling information to resample the dataset
replacement : bool, default=False
- Whether to sample randomly with replacement when using RandomUnderSampler
n_jobs : int, default=None
- The number of jobs to run in parallel for both fit and predict
random_state : int or RandomState, default=None
- Controls the random seed given to each base estimator
verbose : int, default=0
- Controls the verbosity of the building process
sampler : sampler object, default=None
- The sampler used to balance the dataset before bootstrapping. By default, RandomUnderSampler is used

Methods

def fit(self, X, y):
    """Build a Bagging ensemble of estimators from the training set (X, y).
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The training input samples
    y : array-like of shape (n_samples,)
        The target values (class labels)
        
    Returns
    -------
    self : object
        Fitted estimator
    """

def predict(self, X):
    """Predict class for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    y : ndarray of shape (n_samples,)
        The predicted classes
    """

def predict_proba(self, X):
    """Predict class probabilities for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    p : ndarray of shape (n_samples, n_classes)
        The class probabilities of the input samples
    """

Attributes

estimator_ : estimator - The base estimator from which the ensemble is grown
estimators_ : list of estimators - The collection of fitted base estimators
sampler_ : sampler object - The validated sampler created from the sampler parameter
estimators_samples_ : list of ndarray - The subset of drawn samples for each base estimator
estimators_features_ : list of ndarray - The subset of drawn features for each base estimator
classes_ : ndarray - The classes labels
n_classes_ : int or list - The number of classes
oob_score_ : float - Score using out-of-bag estimate (if oob_score=True)

Example Usage

from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9], 
    n_informative=3, n_redundant=1, n_features=20, 
    n_clusters_per_class=1, n_samples=1000, random_state=10
)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Train balanced bagging classifier
bbc = BalancedBaggingClassifier(n_estimators=10, random_state=42)
bbc.fit(X_train, y_train)

# Make predictions
y_pred = bbc.predict(X_test)
y_proba = bbc.predict_proba(X_test)

BalancedRandomForestClassifier

A balanced random forest classifier that applies random under-sampling to balance each bootstrap sample during forest construction.

class BalancedRandomForestClassifier(RandomForestClassifier):
    def __init__(
        self,
        n_estimators=100,
        *,
        criterion="gini",
        max_depth=None,
        min_samples_split=2,
        min_samples_leaf=1,
        min_weight_fraction_leaf=0.0,
        max_features="sqrt",
        max_leaf_nodes=None,
        min_impurity_decrease=0.0,
        bootstrap=False,
        oob_score=False,
        sampling_strategy="all",
        replacement=True,
        n_jobs=None,
        random_state=None,
        verbose=0,
        warm_start=False,
        class_weight=None,
        ccp_alpha=0.0,
        max_samples=None,
        monotonic_cst=None,
    )

Parameters

n_estimators : int, default=100
- The number of trees in the forest
criterion : {"gini", "entropy"}, default="gini"
- The function to measure the quality of a split
max_depth : int, default=None
- The maximum depth of the tree
min_samples_split : int or float, default=2
- The minimum number of samples required to split an internal node
min_samples_leaf : int or float, default=1
- The minimum number of samples required to be at a leaf node
min_weight_fraction_leaf : float, default=0.0
- The minimum weighted fraction of the sum total of weights required to be at a leaf node
max_features : {"auto", "sqrt", "log2"}, int, float, or None, default="sqrt"
- The number of features to consider when looking for the best split
max_leaf_nodes : int, default=None
- Grow trees with max_leaf_nodes in best-first fashion
min_impurity_decrease : float, default=0.0
- A node will be split if this split induces a decrease of impurity greater than or equal to this value
bootstrap : bool, default=False
- Whether bootstrap samples are used when building trees (applied after resampling)
oob_score : bool, default=False
- Whether to use out-of-bag samples to estimate generalization accuracy
sampling_strategy : float, str, dict, callable, default="all"
- Sampling information to resample the dataset
replacement : bool, default=True
- Whether to sample randomly with replacement or not
n_jobs : int, default=None
- The number of jobs to run in parallel
random_state : int or RandomState, default=None
- Controls both the randomness of the bootstrap and feature sampling
verbose : int, default=0
- Controls the verbosity of the tree building process
warm_start : bool, default=False
- When set to True, reuse the solution of the previous call to fit
class_weight : dict, list of dicts, {"balanced", "balanced_subsample"}, default=None
- Weights associated with classes
ccp_alpha : non-negative float, default=0.0
- Complexity parameter used for Minimal Cost-Complexity Pruning
max_samples : int or float, default=None
- The number of samples to draw from X to train each base estimator
monotonic_cst : array-like of int, default=None
- Indicates the monotonicity constraint to enforce on each feature

Methods

def fit(self, X, y, sample_weight=None):
    """Build a forest of trees from the training set (X, y).
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The training input samples
    y : array-like of shape (n_samples,) or (n_samples, n_outputs)
        The target values (class labels)
    sample_weight : array-like of shape (n_samples,), default=None
        Sample weights
        
    Returns
    -------
    self : object
        The fitted instance
    """

def predict(self, X):
    """Predict class for samples in X.
    
    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    y : ndarray of shape (n_samples,)
        The predicted classes
    """

def predict_proba(self, X):
    """Predict class probabilities for samples in X.
    
    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    p : ndarray of shape (n_samples, n_classes)
        The class probabilities
    """

Attributes

estimator_ : DecisionTreeClassifier - The child estimator template used to create the collection
estimators_ : list of DecisionTreeClassifier - The collection of fitted sub-estimators
base_sampler_ : RandomUnderSampler - The base sampler used to construct subsequent samplers
samplers_ : list of RandomUnderSampler - The collection of fitted samplers
pipelines_ : list of Pipeline - The collection of fitted pipelines (samplers + trees)
classes_ : ndarray - The classes labels
n_classes_ : int or list - The number of classes
feature_importances_ : ndarray - The feature importances
oob_score_ : float - Score using out-of-bag estimate (if oob_score=True)

Example Usage

from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.datasets import make_classification

# Create imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_classes=3, n_informative=4, 
    weights=[0.2, 0.3, 0.5], random_state=0
)

# Train balanced random forest
brf = BalancedRandomForestClassifier(
    n_estimators=10,
    sampling_strategy="all", 
    replacement=True,
    max_depth=2, 
    random_state=0,
    bootstrap=False
)
brf.fit(X, y)

# Make predictions
y_pred = brf.predict(X)
feature_importances = brf.feature_importances_

EasyEnsembleClassifier

Bag of balanced boosted learners, also known as EasyEnsemble. This classifier is an ensemble of AdaBoost learners trained on different balanced bootstrap samples achieved by random under-sampling.

class EasyEnsembleClassifier(BaggingClassifier):
    def __init__(
        self,
        n_estimators=10,
        estimator=None,
        *,
        warm_start=False,
        sampling_strategy="auto",
        replacement=False,
        n_jobs=None,
        random_state=None,
        verbose=0,
    )

Parameters

n_estimators : int, default=10
- Number of AdaBoost learners in the ensemble
estimator : estimator object, default=AdaBoostClassifier()
- The base AdaBoost classifier used in the inner ensemble. You can set the number of inner learners by passing your own instance
warm_start : bool, default=False
- When set to True, reuse the solution of the previous call to fit
sampling_strategy : float, str, dict, callable, default="auto"
- Sampling information to resample the dataset
replacement : bool, default=False
- Whether to sample randomly with replacement or not
n_jobs : int, default=None
- The number of jobs to run in parallel for both fit and predict
random_state : int or RandomState, default=None
- Controls the random seed given to each base estimator
verbose : int, default=0
- Controls the verbosity of the building process

Methods

def fit(self, X, y):
    """Build a Bagging ensemble of estimators from the training set (X, y).
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The training input samples
    y : array-like of shape (n_samples,)
        The target values (class labels)
        
    Returns
    -------
    self : object
        Fitted estimator
    """

def predict(self, X):
    """Predict class for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    y : ndarray of shape (n_samples,)
        The predicted classes
    """

def predict_proba(self, X):
    """Predict class probabilities for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    p : ndarray of shape (n_samples, n_classes)
        The class probabilities
    """

Attributes

estimator_ : estimator - The base estimator from which the ensemble is grown
estimators_ : list of estimators - The collection of fitted base estimators
estimators_samples_ : list of arrays - The subset of drawn samples for each base estimator
estimators_features_ : list of arrays - The subset of drawn features for each base estimator
classes_ : ndarray - The classes labels
n_classes_ : int or list - The number of classes

Example Usage

from imblearn.ensemble import EasyEnsembleClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9], 
    n_informative=3, n_redundant=1, n_features=20, 
    n_clusters_per_class=1, n_samples=1000, random_state=10
)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Create custom AdaBoost estimator
ada_estimator = AdaBoostClassifier(n_estimators=10, algorithm="SAMME")

# Train EasyEnsemble classifier
eec = EasyEnsembleClassifier(
    n_estimators=10, 
    estimator=ada_estimator,
    random_state=42
)
eec.fit(X_train, y_train)

# Make predictions
y_pred = eec.predict(X_test)
y_proba = eec.predict_proba(X_test)

RUSBoostClassifier

Random under-sampling integrated into the learning of AdaBoost. During learning, class balancing is alleviated by random under-sampling the dataset at each iteration of the boosting algorithm.

class RUSBoostClassifier(AdaBoostClassifier):
    def __init__(
        self,
        estimator=None,
        *,
        n_estimators=50,
        learning_rate=1.0,
        algorithm="deprecated",
        sampling_strategy="auto",
        replacement=False,
        random_state=None,
    )

Parameters

estimator : estimator object, default=None
- The base estimator from which the boosted ensemble is built. If None, then DecisionTreeClassifier(max_depth=1)
n_estimators : int, default=50
- The maximum number of estimators at which boosting is terminated
learning_rate : float, default=1.0
- Learning rate shrinks the contribution of each classifier
algorithm : {"SAMME", "SAMME.R"}, default="deprecated"
- The boosting algorithm to use. SAMME.R uses real boosting algorithm, SAMME uses discrete boosting
sampling_strategy : float, str, dict, callable, default="auto"
- Sampling information to resample the dataset
replacement : bool, default=False
- Whether to sample randomly with replacement or not
random_state : int or RandomState, default=None
- Controls the random seed given to each base estimator

Methods

def fit(self, X, y, sample_weight=None):
    """Build a boosted classifier from the training set (X, y).
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The training input samples
    y : array-like of shape (n_samples,)
        The target values (class labels)
    sample_weight : array-like of shape (n_samples,), default=None
        Sample weights
        
    Returns
    -------
    self : object
        Returns self
    """

def predict(self, X):
    """Predict classes for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    y : ndarray of shape (n_samples,)
        The predicted classes
    """

def predict_proba(self, X):
    """Predict class probabilities for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    p : ndarray of shape (n_samples, n_classes)
        The class probabilities
    """

Attributes

estimator_ : estimator - The base estimator from which the ensemble is grown
estimators_ : list of classifiers - The collection of fitted sub-estimators
base_sampler_ : RandomUnderSampler - The base sampler used to generate subsequent samplers
samplers_ : list of RandomUnderSampler - The collection of fitted samplers
pipelines_ : list of Pipeline - The collection of fitted pipelines (samplers + trees)
classes_ : ndarray - The classes labels
n_classes_ : int - The number of classes
estimator_weights_ : ndarray - Weights for each estimator in the boosted ensemble
estimator_errors_ : ndarray - Classification error for each estimator
feature_importances_ : ndarray - The feature importances (if supported by base estimator)

Example Usage

from imblearn.ensemble import RUSBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification

# Create imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_classes=3, n_informative=4, 
    weights=[0.2, 0.3, 0.5], random_state=0
)

# Use custom base estimator
base_estimator = DecisionTreeClassifier(max_depth=2)

# Train RUSBoost classifier
rusboost = RUSBoostClassifier(
    estimator=base_estimator,
    n_estimators=10,
    learning_rate=1.0,
    sampling_strategy="auto",
    random_state=0
)
rusboost.fit(X, y)

# Make predictions
y_pred = rusboost.predict(X)
y_proba = rusboost.predict_proba(X)

# Access ensemble information
print(f"Estimator weights: {rusboost.estimator_weights_}")
print(f"Estimator errors: {rusboost.estimator_errors_}")

Algorithm Details and Relationships

Relationship to Scikit-learn

All imbalanced-learn ensemble classifiers extend their corresponding scikit-learn base classes:

BalancedBaggingClassifier extends sklearn.ensemble.BaggingClassifier
BalancedRandomForestClassifier extends sklearn.ensemble.RandomForestClassifier
EasyEnsembleClassifier extends sklearn.ensemble.BaggingClassifier
RUSBoostClassifier extends sklearn.ensemble.AdaBoostClassifier

This inheritance ensures compatibility with scikit-learn's API while adding resampling capabilities.

Resampling Integration

Each ensemble method integrates resampling differently:

Bagging approaches (BalancedBaggingClassifier, EasyEnsembleClassifier) apply resampling to each bootstrap sample before training individual estimators
Random Forest (BalancedRandomForestClassifier) applies resampling before constructing each tree, then optionally applies additional bootstrapping
Boosting (RUSBoostClassifier) applies resampling at each boosting iteration, ensuring balanced training data throughout the adaptive process

Performance Considerations

BalancedRandomForestClassifier typically provides the best balance of performance and training speed
RUSBoostClassifier can be more sensitive to noise but often performs well on structured data
EasyEnsembleClassifier provides good performance but requires more computational resources
BalancedBaggingClassifier offers the most flexibility in base estimator selection

Best Practices

Start with BalancedRandomForestClassifier for most imbalanced classification tasks
Use sampling_strategy="all" with replacement=True for BalancedRandomForestClassifier to follow the original algorithm
Consider RUSBoostClassifier for problems where boosting has shown advantages
Tune n_estimators based on dataset size and computational constraints
Use cross-validation with appropriate metrics (balanced accuracy, F1-score, geometric mean) for model selection

Integration with Pipelines

All ensemble classifiers can be used within scikit-learn pipelines:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.ensemble import BalancedRandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', BalancedRandomForestClassifier(random_state=42))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

This modular design enables easy integration into existing machine learning workflows while providing the benefits of balanced ensemble learning for imbalanced datasets.

Install with Tessl CLI