CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-imbalanced-learn

Toolbox for imbalanced dataset in machine learning

Pending
Overview
Eval results
Files

combination.mddocs/

Combination Methods

Combination methods in imbalanced-learn provide a powerful approach to handling imbalanced datasets by sequentially applying both over-sampling and under-sampling techniques. These hybrid methods first generate synthetic samples to balance the dataset, then remove noisy or problematic samples to improve data quality.

Overview

Combination methods work by:

  1. Over-sampling phase: Generate synthetic samples using techniques like SMOTE to increase minority class representation
  2. Under-sampling phase: Remove noisy, borderline, or problematic samples using cleaning techniques like Edited Nearest Neighbours or Tomek Links removal

This two-step approach aims to achieve both balanced class distribution and improved data quality, potentially leading to better classifier performance than using either technique alone.

Available Methods

The imblearn.combine module provides two main combination methods:

  • SMOTEENN: Combines SMOTE over-sampling with Edited Nearest Neighbours cleaning
  • SMOTETomek: Combines SMOTE over-sampling with Tomek Links removal

Both methods follow the same general pattern: apply SMOTE first to generate synthetic samples, then apply a cleaning technique to remove noisy samples from the augmented dataset.


SMOTEENN

class SMOTEENN(
    *,
    sampling_strategy="auto",
    random_state=None,
    smote=None,
    enn=None,
    n_jobs=None
)

Over-sampling using SMOTE and cleaning using Edited Nearest Neighbours.

This method combines the SMOTE over-sampling technique with Edited Nearest Neighbours (ENN) cleaning. It first applies SMOTE to generate synthetic samples for minority classes, then uses ENN to remove noisy samples from the resulting dataset.

Parameters

  • sampling_strategy : float, str, dict or callable, default='auto'

    Sampling information to resample the data set.

    • When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as α_os = N_rm / N_M where N_rm is the number of samples in the minority class after resampling and N_M is the number of samples in the majority class.

      Warning: float is only available for binary classification. An error is raised for multi-class classification.

    • When str, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:

      • 'minority': resample only the minority class
      • 'not minority': resample all classes but the minority class
      • 'not majority': resample all classes but the majority class
      • 'all': resample all classes
      • 'auto': equivalent to 'not majority'
    • When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.

    • When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

  • random_state : int, RandomState instance, default=None

    Control the randomization of the algorithm.

    • If int, random_state is the seed used by the random number generator
    • If RandomState instance, random_state is the random number generator
    • If None, the random number generator is the RandomState instance used by np.random
  • smote : sampler object, default=None

    The SMOTE object to use. If not given, a SMOTE object with default parameters will be used.

  • enn : sampler object, default=None

    The EditedNearestNeighbours object to use. If not given, an EditedNearestNeighbours object with sampling_strategy='all' will be used.

  • n_jobs : int, default=None

    Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

Attributes

  • sampling_strategy_ : dict

    Dictionary containing the information to sample the dataset. The keys correspond to the class labels from which to sample and the values are the number of samples to sample.

  • smote_ : sampler object

    The validated SMOTE instance.

  • enn_ : sampler object

    The validated EditedNearestNeighbours instance.

  • n_features_in_ : int

    Number of features in the input dataset.

  • feature_names_in_ : ndarray of shape (n_features_in_,)

    Names of features seen during fit. Defined only when X has feature names that are all strings.

Methods

def fit_resample(X, y, **params)

Resample the dataset.

Parameters:

  • X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

    Matrix containing the data which have to be sampled.

  • y : array-like of shape (n_samples,)

    Corresponding label for each sample in X.

  • **params : dict

    Extra parameters to use by the sampler.

Returns:

  • X_resampled : {array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)

    The array containing the resampled data.

  • y_resampled : array-like of shape (n_samples_new,)

    The corresponding label of X_resampled.

Example Usage

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTEENN

# Create an imbalanced dataset
X, y = make_classification(
    n_classes=2, 
    class_sep=2,
    weights=[0.1, 0.9], 
    n_informative=3, 
    n_redundant=1, 
    flip_y=0,
    n_features=20, 
    n_clusters_per_class=1, 
    n_samples=1000, 
    random_state=10
)

print('Original dataset shape:', Counter(y))
# Original dataset shape: Counter({1: 900, 0: 100})

# Apply SMOTEENN
sme = SMOTEENN(random_state=42)
X_res, y_res = sme.fit_resample(X, y)

print('Resampled dataset shape:', Counter(y_res))
# Resampled dataset shape: Counter({0: 900, 1: 881})

# Using custom SMOTE and ENN parameters
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import EditedNearestNeighbours

custom_smote = SMOTE(k_neighbors=3, random_state=42)
custom_enn = EditedNearestNeighbours(n_neighbors=5, kind_sel='mode')

sme_custom = SMOTEENN(
    smote=custom_smote,
    enn=custom_enn,
    random_state=42
)
X_res_custom, y_res_custom = sme_custom.fit_resample(X, y)

Notes

  • The method was first presented in Batista et al. (2004)
  • Supports multi-class resampling following the schemes used by SMOTE and ENN
  • The ENN cleaning step removes samples that are misclassified by their nearest neighbors, which can help remove both noisy samples and borderline cases created by SMOTE
  • The final dataset size is typically smaller than what SMOTE alone would produce due to the cleaning step

SMOTETomek

class SMOTETomek(
    *,
    sampling_strategy="auto",
    random_state=None,
    smote=None,
    tomek=None,
    n_jobs=None
)

Over-sampling using SMOTE and cleaning using Tomek links.

This method combines the SMOTE over-sampling technique with Tomek links removal. It first applies SMOTE to generate synthetic samples for minority classes, then removes Tomek links (pairs of nearest neighbors from different classes) from the resulting dataset.

Parameters

  • sampling_strategy : float, str, dict or callable, default='auto'

    Sampling information to resample the data set.

    • When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as α_os = N_rm / N_M where N_rm is the number of samples in the minority class after resampling and N_M is the number of samples in the majority class.

      Warning: float is only available for binary classification. An error is raised for multi-class classification.

    • When str, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:

      • 'minority': resample only the minority class
      • 'not minority': resample all classes but the minority class
      • 'not majority': resample all classes but the majority class
      • 'all': resample all classes
      • 'auto': equivalent to 'not majority'
    • When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.

    • When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

  • random_state : int, RandomState instance, default=None

    Control the randomization of the algorithm.

    • If int, random_state is the seed used by the random number generator
    • If RandomState instance, random_state is the random number generator
    • If None, the random number generator is the RandomState instance used by np.random
  • smote : sampler object, default=None

    The SMOTE object to use. If not given, a SMOTE object with default parameters will be used.

  • tomek : sampler object, default=None

    The TomekLinks object to use. If not given, a TomekLinks object with sampling_strategy='all' will be used.

  • n_jobs : int, default=None

    Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

Attributes

  • sampling_strategy_ : dict

    Dictionary containing the information to sample the dataset. The keys correspond to the class labels from which to sample and the values are the number of samples to sample.

  • smote_ : sampler object

    The validated SMOTE instance.

  • tomek_ : sampler object

    The validated TomekLinks instance.

  • n_features_in_ : int

    Number of features in the input dataset.

  • feature_names_in_ : ndarray of shape (n_features_in_,)

    Names of features seen during fit. Defined only when X has feature names that are all strings.

Methods

def fit_resample(X, y, **params)

Resample the dataset.

Parameters:

  • X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

    Matrix containing the data which have to be sampled.

  • y : array-like of shape (n_samples,)

    Corresponding label for each sample in X.

  • **params : dict

    Extra parameters to use by the sampler.

Returns:

  • X_resampled : {array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)

    The array containing the resampled data.

  • y_resampled : array-like of shape (n_samples_new,)

    The corresponding label of X_resampled.

Example Usage

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTETomek

# Create an imbalanced dataset
X, y = make_classification(
    n_classes=2, 
    class_sep=2,
    weights=[0.1, 0.9], 
    n_informative=3, 
    n_redundant=1, 
    flip_y=0,
    n_features=20, 
    n_clusters_per_class=1, 
    n_samples=1000, 
    random_state=10
)

print('Original dataset shape:', Counter(y))
# Original dataset shape: Counter({1: 900, 0: 100})

# Apply SMOTETomek
smt = SMOTETomek(random_state=42)
X_res, y_res = smt.fit_resample(X, y)

print('Resampled dataset shape:', Counter(y_res))
# Resampled dataset shape: Counter({0: 900, 1: 900})

# Using custom SMOTE and Tomek parameters
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

custom_smote = SMOTE(k_neighbors=5, random_state=42)
custom_tomek = TomekLinks(sampling_strategy='majority')

smt_custom = SMOTETomek(
    smote=custom_smote,
    tomek=custom_tomek,
    random_state=42
)
X_res_custom, y_res_custom = smt_custom.fit_resample(X, y)

Notes

  • The method was first presented in Batista et al. (2003)
  • Supports multi-class resampling following the schemes used by SMOTE and TomekLinks
  • Tomek links removal focuses on cleaning the decision boundary by removing ambiguous samples
  • Generally preserves more samples than SMOTEENN since Tomek links removal is less aggressive than ENN

Comparison: SMOTEENN vs SMOTETomek

AspectSMOTEENNSMOTETomek
Cleaning MethodEdited Nearest NeighboursTomek Links
Cleaning AggressivenessMore aggressiveLess aggressive
Typical Sample ReductionHigherLower
FocusRemoves misclassified samplesRemoves boundary ambiguous samples
Best Use CaseNoisy datasetsClean decision boundaries

When to Use Each Method

Use SMOTEENN when:

  • Your dataset contains significant noise
  • You want more aggressive cleaning
  • Class boundaries are poorly defined
  • You can afford to lose more samples for better quality

Use SMOTETomek when:

  • Your dataset is relatively clean
  • You want to preserve more samples
  • You need to clean decision boundaries
  • Class overlap is the main issue

Algorithm Workflow

Both methods follow the same general workflow:

  1. Input: Imbalanced dataset (X, y)
  2. SMOTE Phase: Apply SMOTE over-sampling to generate synthetic minority class samples
  3. Cleaning Phase:
    • SMOTEENN: Apply ENN to remove misclassified samples
    • SMOTETomek: Remove Tomek links from the dataset
  4. Output: Balanced and cleaned dataset

This sequential approach ensures that the benefits of both techniques are realized: balanced class distribution from SMOTE and improved data quality from the cleaning step.

Install with Tessl CLI

npx tessl i tessl/pypi-imbalanced-learn

docs

combination.md

datasets.md

deep-learning.md

ensemble.md

index.md

metrics.md

model-selection.md

over-sampling.md

pipeline.md

under-sampling.md

utilities.md

tile.json