tessl/pypi-imbalanced-learn

Toolbox for imbalanced dataset in machine learning

—

Pending

Overview

Eval results

Files

Combination Methods

Name: tessl/pypi-imbalanced-learn
Author: tessl

Combination methods in imbalanced-learn provide a powerful approach to handling imbalanced datasets by sequentially applying both over-sampling and under-sampling techniques. These hybrid methods first generate synthetic samples to balance the dataset, then remove noisy or problematic samples to improve data quality.

Overview

Combination methods work by:

Over-sampling phase: Generate synthetic samples using techniques like SMOTE to increase minority class representation
Under-sampling phase: Remove noisy, borderline, or problematic samples using cleaning techniques like Edited Nearest Neighbours or Tomek Links removal

This two-step approach aims to achieve both balanced class distribution and improved data quality, potentially leading to better classifier performance than using either technique alone.

Available Methods

The imblearn.combine module provides two main combination methods:

SMOTEENN: Combines SMOTE over-sampling with Edited Nearest Neighbours cleaning
SMOTETomek: Combines SMOTE over-sampling with Tomek Links removal

Both methods follow the same general pattern: apply SMOTE first to generate synthetic samples, then apply a cleaning technique to remove noisy samples from the augmented dataset.

SMOTEENN

class SMOTEENN(
    *,
    sampling_strategy="auto",
    random_state=None,
    smote=None,
    enn=None,
    n_jobs=None
)

Over-sampling using SMOTE and cleaning using Edited Nearest Neighbours.

This method combines the SMOTE over-sampling technique with Edited Nearest Neighbours (ENN) cleaning. It first applies SMOTE to generate synthetic samples for minority classes, then uses ENN to remove noisy samples from the resulting dataset.

Parameters

sampling_strategy : float, str, dict or callable, default='auto'

Sampling information to resample the data set.
- When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as α_os = N_rm / N_M where N_rm is the number of samples in the minority class after resampling and N_M is the number of samples in the majority class.
  
  Warning: float is only available for binary classification. An error is raised for multi-class classification.
- When str, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:
  - 'minority': resample only the minority class
  - 'not minority': resample all classes but the minority class
  - 'not majority': resample all classes but the majority class
  - 'all': resample all classes
  - 'auto': equivalent to 'not majority'
- When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
- When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
random_state : int, RandomState instance, default=None

Control the randomization of the algorithm.
- If int, random_state is the seed used by the random number generator
- If RandomState instance, random_state is the random number generator
- If None, the random number generator is the RandomState instance used by np.random
smote : sampler object, default=None

The SMOTE object to use. If not given, a SMOTE object with default parameters will be used.
enn : sampler object, default=None

The EditedNearestNeighbours object to use. If not given, an EditedNearestNeighbours object with sampling_strategy='all' will be used.
n_jobs : int, default=None

Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

Attributes

sampling_strategy_ : dict

Dictionary containing the information to sample the dataset. The keys correspond to the class labels from which to sample and the values are the number of samples to sample.
smote_ : sampler object

The validated SMOTE instance.
enn_ : sampler object

The validated EditedNearestNeighbours instance.
n_features_in_ : int

Number of features in the input dataset.
feature_names_in_ : ndarray of shape (n_features_in_,)

Names of features seen during fit. Defined only when X has feature names that are all strings.

Methods

def fit_resample(X, y, **params)

Resample the dataset.

Parameters:

X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

Matrix containing the data which have to be sampled.
y : array-like of shape (n_samples,)

Corresponding label for each sample in X.
**params : dict

Extra parameters to use by the sampler.

Returns:

X_resampled : {array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)

The array containing the resampled data.
y_resampled : array-like of shape (n_samples_new,)

The corresponding label of X_resampled.

Example Usage

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTEENN

# Create an imbalanced dataset
X, y = make_classification(
    n_classes=2, 
    class_sep=2,
    weights=[0.1, 0.9], 
    n_informative=3, 
    n_redundant=1, 
    flip_y=0,
    n_features=20, 
    n_clusters_per_class=1, 
    n_samples=1000, 
    random_state=10
)

print('Original dataset shape:', Counter(y))
# Original dataset shape: Counter({1: 900, 0: 100})

# Apply SMOTEENN
sme = SMOTEENN(random_state=42)
X_res, y_res = sme.fit_resample(X, y)

print('Resampled dataset shape:', Counter(y_res))
# Resampled dataset shape: Counter({0: 900, 1: 881})

# Using custom SMOTE and ENN parameters
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import EditedNearestNeighbours

custom_smote = SMOTE(k_neighbors=3, random_state=42)
custom_enn = EditedNearestNeighbours(n_neighbors=5, kind_sel='mode')

sme_custom = SMOTEENN(
    smote=custom_smote,
    enn=custom_enn,
    random_state=42
)
X_res_custom, y_res_custom = sme_custom.fit_resample(X, y)

Notes

The method was first presented in Batista et al. (2004)
Supports multi-class resampling following the schemes used by SMOTE and ENN
The ENN cleaning step removes samples that are misclassified by their nearest neighbors, which can help remove both noisy samples and borderline cases created by SMOTE
The final dataset size is typically smaller than what SMOTE alone would produce due to the cleaning step

SMOTETomek

class SMOTETomek(
    *,
    sampling_strategy="auto",
    random_state=None,
    smote=None,
    tomek=None,
    n_jobs=None
)

Over-sampling using SMOTE and cleaning using Tomek links.

This method combines the SMOTE over-sampling technique with Tomek links removal. It first applies SMOTE to generate synthetic samples for minority classes, then removes Tomek links (pairs of nearest neighbors from different classes) from the resulting dataset.

Parameters

sampling_strategy : float, str, dict or callable, default='auto'

Sampling information to resample the data set.
- When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as α_os = N_rm / N_M where N_rm is the number of samples in the minority class after resampling and N_M is the number of samples in the majority class.
  
  Warning: float is only available for binary classification. An error is raised for multi-class classification.
- When str, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:
  - 'minority': resample only the minority class
  - 'not minority': resample all classes but the minority class
  - 'not majority': resample all classes but the majority class
  - 'all': resample all classes
  - 'auto': equivalent to 'not majority'
- When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
- When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
random_state : int, RandomState instance, default=None

Control the randomization of the algorithm.
- If int, random_state is the seed used by the random number generator
- If RandomState instance, random_state is the random number generator
- If None, the random number generator is the RandomState instance used by np.random
smote : sampler object, default=None

The SMOTE object to use. If not given, a SMOTE object with default parameters will be used.
tomek : sampler object, default=None

The TomekLinks object to use. If not given, a TomekLinks object with sampling_strategy='all' will be used.
n_jobs : int, default=None

Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

Attributes

sampling_strategy_ : dict

Dictionary containing the information to sample the dataset. The keys correspond to the class labels from which to sample and the values are the number of samples to sample.
smote_ : sampler object

The validated SMOTE instance.
tomek_ : sampler object

The validated TomekLinks instance.
n_features_in_ : int

Number of features in the input dataset.
feature_names_in_ : ndarray of shape (n_features_in_,)

Names of features seen during fit. Defined only when X has feature names that are all strings.

Methods

def fit_resample(X, y, **params)

Resample the dataset.

Parameters:

X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)

Matrix containing the data which have to be sampled.
y : array-like of shape (n_samples,)

Corresponding label for each sample in X.
**params : dict

Extra parameters to use by the sampler.

Returns:

X_resampled : {array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)

The array containing the resampled data.
y_resampled : array-like of shape (n_samples_new,)

The corresponding label of X_resampled.

Example Usage

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTETomek

# Create an imbalanced dataset
X, y = make_classification(
    n_classes=2, 
    class_sep=2,
    weights=[0.1, 0.9], 
    n_informative=3, 
    n_redundant=1, 
    flip_y=0,
    n_features=20, 
    n_clusters_per_class=1, 
    n_samples=1000, 
    random_state=10
)

print('Original dataset shape:', Counter(y))
# Original dataset shape: Counter({1: 900, 0: 100})

# Apply SMOTETomek
smt = SMOTETomek(random_state=42)
X_res, y_res = smt.fit_resample(X, y)

print('Resampled dataset shape:', Counter(y_res))
# Resampled dataset shape: Counter({0: 900, 1: 900})

# Using custom SMOTE and Tomek parameters
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

custom_smote = SMOTE(k_neighbors=5, random_state=42)
custom_tomek = TomekLinks(sampling_strategy='majority')

smt_custom = SMOTETomek(
    smote=custom_smote,
    tomek=custom_tomek,
    random_state=42
)
X_res_custom, y_res_custom = smt_custom.fit_resample(X, y)

Notes

The method was first presented in Batista et al. (2003)
Supports multi-class resampling following the schemes used by SMOTE and TomekLinks
Tomek links removal focuses on cleaning the decision boundary by removing ambiguous samples
Generally preserves more samples than SMOTEENN since Tomek links removal is less aggressive than ENN

Comparison: SMOTEENN vs SMOTETomek

Aspect	SMOTEENN	SMOTETomek
Cleaning Method	Edited Nearest Neighbours	Tomek Links
Cleaning Aggressiveness	More aggressive	Less aggressive
Typical Sample Reduction	Higher	Lower
Focus	Removes misclassified samples	Removes boundary ambiguous samples
Best Use Case	Noisy datasets	Clean decision boundaries

When to Use Each Method

Use SMOTEENN when:

Your dataset contains significant noise
You want more aggressive cleaning
Class boundaries are poorly defined
You can afford to lose more samples for better quality

Use SMOTETomek when:

Your dataset is relatively clean
You want to preserve more samples
You need to clean decision boundaries
Class overlap is the main issue

Algorithm Workflow

Both methods follow the same general workflow:

Input: Imbalanced dataset (X, y)
SMOTE Phase: Apply SMOTE over-sampling to generate synthetic minority class samples
Cleaning Phase:
- SMOTEENN: Apply ENN to remove misclassified samples
- SMOTETomek: Remove Tomek links from the dataset
Output: Balanced and cleaned dataset

This sequential approach ensures that the benefits of both techniques are realized: balanced class distribution from SMOTE and improved data quality from the cleaning step.

Install with Tessl CLI