CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-imbalanced-learn

Toolbox for imbalanced dataset in machine learning

Pending
Overview
Eval results
Files

ensemble.mddocs/

Ensemble Methods for Imbalanced Learning

Overview

Ensemble methods combine multiple base learners to improve classification performance beyond what individual models can achieve. However, traditional ensemble methods often struggle with imbalanced datasets where minority classes are underrepresented. The imbalanced-learn library provides specialized ensemble classifiers that integrate resampling techniques directly into the ensemble learning process.

These ensemble methods address class imbalance by applying resampling strategies during training, ensuring that each base learner in the ensemble receives balanced training data. This approach leads to improved performance on minority classes while maintaining overall classification accuracy.

The ensemble module includes four main approaches:

  • BalancedBaggingClassifier: Applies random under-sampling to each bootstrap sample in bagging
  • BalancedRandomForestClassifier: Integrates random under-sampling into random forest construction
  • EasyEnsembleClassifier: Combines multiple balanced AdaBoost classifiers
  • RUSBoostClassifier: Integrates random under-sampling directly into the AdaBoost algorithm

BalancedBaggingClassifier

A bagging classifier with additional balancing that applies resampling to each bootstrap sample before training base estimators.

class BalancedBaggingClassifier(BaggingClassifier):
    def __init__(
        self,
        estimator=None,
        n_estimators=10,
        *,
        max_samples=1.0,
        max_features=1.0,
        bootstrap=True,
        bootstrap_features=False,
        oob_score=False,
        warm_start=False,
        sampling_strategy="auto",
        replacement=False,
        n_jobs=None,
        random_state=None,
        verbose=0,
        sampler=None,
    )

Parameters

  • estimator : estimator object, default=None
    • The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a DecisionTreeClassifier
  • n_estimators : int, default=10
    • The number of base estimators in the ensemble
  • max_samples : int or float, default=1.0
    • The number of samples to draw from X to train each base estimator
  • max_features : int or float, default=1.0
    • The number of features to draw from X to train each base estimator
  • bootstrap : bool, default=True
    • Whether samples are drawn with replacement (applied after resampling)
  • bootstrap_features : bool, default=False
    • Whether features are drawn with replacement
  • oob_score : bool, default=False
    • Whether to use out-of-bag samples to estimate generalization error
  • warm_start : bool, default=False
    • When set to True, reuse the solution of the previous call to fit
  • sampling_strategy : float, str, dict, callable, default="auto"
    • Sampling information to resample the dataset
  • replacement : bool, default=False
    • Whether to sample randomly with replacement when using RandomUnderSampler
  • n_jobs : int, default=None
    • The number of jobs to run in parallel for both fit and predict
  • random_state : int or RandomState, default=None
    • Controls the random seed given to each base estimator
  • verbose : int, default=0
    • Controls the verbosity of the building process
  • sampler : sampler object, default=None
    • The sampler used to balance the dataset before bootstrapping. By default, RandomUnderSampler is used

Methods

def fit(self, X, y):
    """Build a Bagging ensemble of estimators from the training set (X, y).
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The training input samples
    y : array-like of shape (n_samples,)
        The target values (class labels)
        
    Returns
    -------
    self : object
        Fitted estimator
    """

def predict(self, X):
    """Predict class for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    y : ndarray of shape (n_samples,)
        The predicted classes
    """

def predict_proba(self, X):
    """Predict class probabilities for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    p : ndarray of shape (n_samples, n_classes)
        The class probabilities of the input samples
    """

Attributes

  • estimator_ : estimator - The base estimator from which the ensemble is grown
  • estimators_ : list of estimators - The collection of fitted base estimators
  • sampler_ : sampler object - The validated sampler created from the sampler parameter
  • estimators_samples_ : list of ndarray - The subset of drawn samples for each base estimator
  • estimators_features_ : list of ndarray - The subset of drawn features for each base estimator
  • classes_ : ndarray - The classes labels
  • n_classes_ : int or list - The number of classes
  • oob_score_ : float - Score using out-of-bag estimate (if oob_score=True)

Example Usage

from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9], 
    n_informative=3, n_redundant=1, n_features=20, 
    n_clusters_per_class=1, n_samples=1000, random_state=10
)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Train balanced bagging classifier
bbc = BalancedBaggingClassifier(n_estimators=10, random_state=42)
bbc.fit(X_train, y_train)

# Make predictions
y_pred = bbc.predict(X_test)
y_proba = bbc.predict_proba(X_test)

BalancedRandomForestClassifier

A balanced random forest classifier that applies random under-sampling to balance each bootstrap sample during forest construction.

class BalancedRandomForestClassifier(RandomForestClassifier):
    def __init__(
        self,
        n_estimators=100,
        *,
        criterion="gini",
        max_depth=None,
        min_samples_split=2,
        min_samples_leaf=1,
        min_weight_fraction_leaf=0.0,
        max_features="sqrt",
        max_leaf_nodes=None,
        min_impurity_decrease=0.0,
        bootstrap=False,
        oob_score=False,
        sampling_strategy="all",
        replacement=True,
        n_jobs=None,
        random_state=None,
        verbose=0,
        warm_start=False,
        class_weight=None,
        ccp_alpha=0.0,
        max_samples=None,
        monotonic_cst=None,
    )

Parameters

  • n_estimators : int, default=100
    • The number of trees in the forest
  • criterion : {"gini", "entropy"}, default="gini"
    • The function to measure the quality of a split
  • max_depth : int, default=None
    • The maximum depth of the tree
  • min_samples_split : int or float, default=2
    • The minimum number of samples required to split an internal node
  • min_samples_leaf : int or float, default=1
    • The minimum number of samples required to be at a leaf node
  • min_weight_fraction_leaf : float, default=0.0
    • The minimum weighted fraction of the sum total of weights required to be at a leaf node
  • max_features : {"auto", "sqrt", "log2"}, int, float, or None, default="sqrt"
    • The number of features to consider when looking for the best split
  • max_leaf_nodes : int, default=None
    • Grow trees with max_leaf_nodes in best-first fashion
  • min_impurity_decrease : float, default=0.0
    • A node will be split if this split induces a decrease of impurity greater than or equal to this value
  • bootstrap : bool, default=False
    • Whether bootstrap samples are used when building trees (applied after resampling)
  • oob_score : bool, default=False
    • Whether to use out-of-bag samples to estimate generalization accuracy
  • sampling_strategy : float, str, dict, callable, default="all"
    • Sampling information to resample the dataset
  • replacement : bool, default=True
    • Whether to sample randomly with replacement or not
  • n_jobs : int, default=None
    • The number of jobs to run in parallel
  • random_state : int or RandomState, default=None
    • Controls both the randomness of the bootstrap and feature sampling
  • verbose : int, default=0
    • Controls the verbosity of the tree building process
  • warm_start : bool, default=False
    • When set to True, reuse the solution of the previous call to fit
  • class_weight : dict, list of dicts, {"balanced", "balanced_subsample"}, default=None
    • Weights associated with classes
  • ccp_alpha : non-negative float, default=0.0
    • Complexity parameter used for Minimal Cost-Complexity Pruning
  • max_samples : int or float, default=None
    • The number of samples to draw from X to train each base estimator
  • monotonic_cst : array-like of int, default=None
    • Indicates the monotonicity constraint to enforce on each feature

Methods

def fit(self, X, y, sample_weight=None):
    """Build a forest of trees from the training set (X, y).
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The training input samples
    y : array-like of shape (n_samples,) or (n_samples, n_outputs)
        The target values (class labels)
    sample_weight : array-like of shape (n_samples,), default=None
        Sample weights
        
    Returns
    -------
    self : object
        The fitted instance
    """

def predict(self, X):
    """Predict class for samples in X.
    
    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    y : ndarray of shape (n_samples,)
        The predicted classes
    """

def predict_proba(self, X):
    """Predict class probabilities for samples in X.
    
    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    p : ndarray of shape (n_samples, n_classes)
        The class probabilities
    """

Attributes

  • estimator_ : DecisionTreeClassifier - The child estimator template used to create the collection
  • estimators_ : list of DecisionTreeClassifier - The collection of fitted sub-estimators
  • base_sampler_ : RandomUnderSampler - The base sampler used to construct subsequent samplers
  • samplers_ : list of RandomUnderSampler - The collection of fitted samplers
  • pipelines_ : list of Pipeline - The collection of fitted pipelines (samplers + trees)
  • classes_ : ndarray - The classes labels
  • n_classes_ : int or list - The number of classes
  • feature_importances_ : ndarray - The feature importances
  • oob_score_ : float - Score using out-of-bag estimate (if oob_score=True)

Example Usage

from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.datasets import make_classification

# Create imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_classes=3, n_informative=4, 
    weights=[0.2, 0.3, 0.5], random_state=0
)

# Train balanced random forest
brf = BalancedRandomForestClassifier(
    n_estimators=10,
    sampling_strategy="all", 
    replacement=True,
    max_depth=2, 
    random_state=0,
    bootstrap=False
)
brf.fit(X, y)

# Make predictions
y_pred = brf.predict(X)
feature_importances = brf.feature_importances_

EasyEnsembleClassifier

Bag of balanced boosted learners, also known as EasyEnsemble. This classifier is an ensemble of AdaBoost learners trained on different balanced bootstrap samples achieved by random under-sampling.

class EasyEnsembleClassifier(BaggingClassifier):
    def __init__(
        self,
        n_estimators=10,
        estimator=None,
        *,
        warm_start=False,
        sampling_strategy="auto",
        replacement=False,
        n_jobs=None,
        random_state=None,
        verbose=0,
    )

Parameters

  • n_estimators : int, default=10
    • Number of AdaBoost learners in the ensemble
  • estimator : estimator object, default=AdaBoostClassifier()
    • The base AdaBoost classifier used in the inner ensemble. You can set the number of inner learners by passing your own instance
  • warm_start : bool, default=False
    • When set to True, reuse the solution of the previous call to fit
  • sampling_strategy : float, str, dict, callable, default="auto"
    • Sampling information to resample the dataset
  • replacement : bool, default=False
    • Whether to sample randomly with replacement or not
  • n_jobs : int, default=None
    • The number of jobs to run in parallel for both fit and predict
  • random_state : int or RandomState, default=None
    • Controls the random seed given to each base estimator
  • verbose : int, default=0
    • Controls the verbosity of the building process

Methods

def fit(self, X, y):
    """Build a Bagging ensemble of estimators from the training set (X, y).
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The training input samples
    y : array-like of shape (n_samples,)
        The target values (class labels)
        
    Returns
    -------
    self : object
        Fitted estimator
    """

def predict(self, X):
    """Predict class for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    y : ndarray of shape (n_samples,)
        The predicted classes
    """

def predict_proba(self, X):
    """Predict class probabilities for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    p : ndarray of shape (n_samples, n_classes)
        The class probabilities
    """

Attributes

  • estimator_ : estimator - The base estimator from which the ensemble is grown
  • estimators_ : list of estimators - The collection of fitted base estimators
  • estimators_samples_ : list of arrays - The subset of drawn samples for each base estimator
  • estimators_features_ : list of arrays - The subset of drawn features for each base estimator
  • classes_ : ndarray - The classes labels
  • n_classes_ : int or list - The number of classes

Example Usage

from imblearn.ensemble import EasyEnsembleClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9], 
    n_informative=3, n_redundant=1, n_features=20, 
    n_clusters_per_class=1, n_samples=1000, random_state=10
)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Create custom AdaBoost estimator
ada_estimator = AdaBoostClassifier(n_estimators=10, algorithm="SAMME")

# Train EasyEnsemble classifier
eec = EasyEnsembleClassifier(
    n_estimators=10, 
    estimator=ada_estimator,
    random_state=42
)
eec.fit(X_train, y_train)

# Make predictions
y_pred = eec.predict(X_test)
y_proba = eec.predict_proba(X_test)

RUSBoostClassifier

Random under-sampling integrated into the learning of AdaBoost. During learning, class balancing is alleviated by random under-sampling the dataset at each iteration of the boosting algorithm.

class RUSBoostClassifier(AdaBoostClassifier):
    def __init__(
        self,
        estimator=None,
        *,
        n_estimators=50,
        learning_rate=1.0,
        algorithm="deprecated",
        sampling_strategy="auto",
        replacement=False,
        random_state=None,
    )

Parameters

  • estimator : estimator object, default=None
    • The base estimator from which the boosted ensemble is built. If None, then DecisionTreeClassifier(max_depth=1)
  • n_estimators : int, default=50
    • The maximum number of estimators at which boosting is terminated
  • learning_rate : float, default=1.0
    • Learning rate shrinks the contribution of each classifier
  • algorithm : {"SAMME", "SAMME.R"}, default="deprecated"
    • The boosting algorithm to use. SAMME.R uses real boosting algorithm, SAMME uses discrete boosting
  • sampling_strategy : float, str, dict, callable, default="auto"
    • Sampling information to resample the dataset
  • replacement : bool, default=False
    • Whether to sample randomly with replacement or not
  • random_state : int or RandomState, default=None
    • Controls the random seed given to each base estimator

Methods

def fit(self, X, y, sample_weight=None):
    """Build a boosted classifier from the training set (X, y).
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The training input samples
    y : array-like of shape (n_samples,)
        The target values (class labels)
    sample_weight : array-like of shape (n_samples,), default=None
        Sample weights
        
    Returns
    -------
    self : object
        Returns self
    """

def predict(self, X):
    """Predict classes for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    y : ndarray of shape (n_samples,)
        The predicted classes
    """

def predict_proba(self, X):
    """Predict class probabilities for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples
        
    Returns
    -------
    p : ndarray of shape (n_samples, n_classes)
        The class probabilities
    """

Attributes

  • estimator_ : estimator - The base estimator from which the ensemble is grown
  • estimators_ : list of classifiers - The collection of fitted sub-estimators
  • base_sampler_ : RandomUnderSampler - The base sampler used to generate subsequent samplers
  • samplers_ : list of RandomUnderSampler - The collection of fitted samplers
  • pipelines_ : list of Pipeline - The collection of fitted pipelines (samplers + trees)
  • classes_ : ndarray - The classes labels
  • n_classes_ : int - The number of classes
  • estimator_weights_ : ndarray - Weights for each estimator in the boosted ensemble
  • estimator_errors_ : ndarray - Classification error for each estimator
  • feature_importances_ : ndarray - The feature importances (if supported by base estimator)

Example Usage

from imblearn.ensemble import RUSBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification

# Create imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_classes=3, n_informative=4, 
    weights=[0.2, 0.3, 0.5], random_state=0
)

# Use custom base estimator
base_estimator = DecisionTreeClassifier(max_depth=2)

# Train RUSBoost classifier
rusboost = RUSBoostClassifier(
    estimator=base_estimator,
    n_estimators=10,
    learning_rate=1.0,
    sampling_strategy="auto",
    random_state=0
)
rusboost.fit(X, y)

# Make predictions
y_pred = rusboost.predict(X)
y_proba = rusboost.predict_proba(X)

# Access ensemble information
print(f"Estimator weights: {rusboost.estimator_weights_}")
print(f"Estimator errors: {rusboost.estimator_errors_}")

Algorithm Details and Relationships

Relationship to Scikit-learn

All imbalanced-learn ensemble classifiers extend their corresponding scikit-learn base classes:

  • BalancedBaggingClassifier extends sklearn.ensemble.BaggingClassifier
  • BalancedRandomForestClassifier extends sklearn.ensemble.RandomForestClassifier
  • EasyEnsembleClassifier extends sklearn.ensemble.BaggingClassifier
  • RUSBoostClassifier extends sklearn.ensemble.AdaBoostClassifier

This inheritance ensures compatibility with scikit-learn's API while adding resampling capabilities.

Resampling Integration

Each ensemble method integrates resampling differently:

  1. Bagging approaches (BalancedBaggingClassifier, EasyEnsembleClassifier) apply resampling to each bootstrap sample before training individual estimators

  2. Random Forest (BalancedRandomForestClassifier) applies resampling before constructing each tree, then optionally applies additional bootstrapping

  3. Boosting (RUSBoostClassifier) applies resampling at each boosting iteration, ensuring balanced training data throughout the adaptive process

Performance Considerations

  • BalancedRandomForestClassifier typically provides the best balance of performance and training speed
  • RUSBoostClassifier can be more sensitive to noise but often performs well on structured data
  • EasyEnsembleClassifier provides good performance but requires more computational resources
  • BalancedBaggingClassifier offers the most flexibility in base estimator selection

Best Practices

  1. Start with BalancedRandomForestClassifier for most imbalanced classification tasks
  2. Use sampling_strategy="all" with replacement=True for BalancedRandomForestClassifier to follow the original algorithm
  3. Consider RUSBoostClassifier for problems where boosting has shown advantages
  4. Tune n_estimators based on dataset size and computational constraints
  5. Use cross-validation with appropriate metrics (balanced accuracy, F1-score, geometric mean) for model selection

Integration with Pipelines

All ensemble classifiers can be used within scikit-learn pipelines:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.ensemble import BalancedRandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', BalancedRandomForestClassifier(random_state=42))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

This modular design enables easy integration into existing machine learning workflows while providing the benefits of balanced ensemble learning for imbalanced datasets.

Install with Tessl CLI

npx tessl i tessl/pypi-imbalanced-learn

docs

combination.md

datasets.md

deep-learning.md

ensemble.md

index.md

metrics.md

model-selection.md

over-sampling.md

pipeline.md

under-sampling.md

utilities.md

tile.json