Toolbox for imbalanced dataset in machine learning
—
Combination methods in imbalanced-learn provide a powerful approach to handling imbalanced datasets by sequentially applying both over-sampling and under-sampling techniques. These hybrid methods first generate synthetic samples to balance the dataset, then remove noisy or problematic samples to improve data quality.
Combination methods work by:
This two-step approach aims to achieve both balanced class distribution and improved data quality, potentially leading to better classifier performance than using either technique alone.
The imblearn.combine module provides two main combination methods:
Both methods follow the same general pattern: apply SMOTE first to generate synthetic samples, then apply a cleaning technique to remove noisy samples from the augmented dataset.
class SMOTEENN(
*,
sampling_strategy="auto",
random_state=None,
smote=None,
enn=None,
n_jobs=None
)Over-sampling using SMOTE and cleaning using Edited Nearest Neighbours.
This method combines the SMOTE over-sampling technique with Edited Nearest Neighbours (ENN) cleaning. It first applies SMOTE to generate synthetic samples for minority classes, then uses ENN to remove noisy samples from the resulting dataset.
sampling_strategy : float, str, dict or callable, default='auto'
Sampling information to resample the data set.
When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as α_os = N_rm / N_M where N_rm is the number of samples in the minority class after resampling and N_M is the number of samples in the majority class.
Warning: float is only available for binary classification. An error is raised for multi-class classification.
When str, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:
'minority': resample only the minority class'not minority': resample all classes but the minority class'not majority': resample all classes but the majority class'all': resample all classes'auto': equivalent to 'not majority'When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
random_state : int, RandomState instance, default=None
Control the randomization of the algorithm.
int, random_state is the seed used by the random number generatorRandomState instance, random_state is the random number generatorNone, the random number generator is the RandomState instance used by np.randomsmote : sampler object, default=None
The SMOTE object to use. If not given, a SMOTE object with default parameters will be used.
enn : sampler object, default=None
The EditedNearestNeighbours object to use. If not given, an EditedNearestNeighbours object with sampling_strategy='all' will be used.
n_jobs : int, default=None
Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
sampling_strategy_ : dict
Dictionary containing the information to sample the dataset. The keys correspond to the class labels from which to sample and the values are the number of samples to sample.
smote_ : sampler object
The validated SMOTE instance.
enn_ : sampler object
The validated EditedNearestNeighbours instance.
n_features_in_ : int
Number of features in the input dataset.
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
def fit_resample(X, y, **params)Resample the dataset.
Parameters:
X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
y : array-like of shape (n_samples,)
Corresponding label for each sample in X.
**params : dict
Extra parameters to use by the sampler.
Returns:
X_resampled : {array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : array-like of shape (n_samples_new,)
The corresponding label of X_resampled.
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTEENN
# Create an imbalanced dataset
X, y = make_classification(
n_classes=2,
class_sep=2,
weights=[0.1, 0.9],
n_informative=3,
n_redundant=1,
flip_y=0,
n_features=20,
n_clusters_per_class=1,
n_samples=1000,
random_state=10
)
print('Original dataset shape:', Counter(y))
# Original dataset shape: Counter({1: 900, 0: 100})
# Apply SMOTEENN
sme = SMOTEENN(random_state=42)
X_res, y_res = sme.fit_resample(X, y)
print('Resampled dataset shape:', Counter(y_res))
# Resampled dataset shape: Counter({0: 900, 1: 881})
# Using custom SMOTE and ENN parameters
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import EditedNearestNeighbours
custom_smote = SMOTE(k_neighbors=3, random_state=42)
custom_enn = EditedNearestNeighbours(n_neighbors=5, kind_sel='mode')
sme_custom = SMOTEENN(
smote=custom_smote,
enn=custom_enn,
random_state=42
)
X_res_custom, y_res_custom = sme_custom.fit_resample(X, y)class SMOTETomek(
*,
sampling_strategy="auto",
random_state=None,
smote=None,
tomek=None,
n_jobs=None
)Over-sampling using SMOTE and cleaning using Tomek links.
This method combines the SMOTE over-sampling technique with Tomek links removal. It first applies SMOTE to generate synthetic samples for minority classes, then removes Tomek links (pairs of nearest neighbors from different classes) from the resulting dataset.
sampling_strategy : float, str, dict or callable, default='auto'
Sampling information to resample the data set.
When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as α_os = N_rm / N_M where N_rm is the number of samples in the minority class after resampling and N_M is the number of samples in the majority class.
Warning: float is only available for binary classification. An error is raised for multi-class classification.
When str, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:
'minority': resample only the minority class'not minority': resample all classes but the minority class'not majority': resample all classes but the majority class'all': resample all classes'auto': equivalent to 'not majority'When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
random_state : int, RandomState instance, default=None
Control the randomization of the algorithm.
int, random_state is the seed used by the random number generatorRandomState instance, random_state is the random number generatorNone, the random number generator is the RandomState instance used by np.randomsmote : sampler object, default=None
The SMOTE object to use. If not given, a SMOTE object with default parameters will be used.
tomek : sampler object, default=None
The TomekLinks object to use. If not given, a TomekLinks object with sampling_strategy='all' will be used.
n_jobs : int, default=None
Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
sampling_strategy_ : dict
Dictionary containing the information to sample the dataset. The keys correspond to the class labels from which to sample and the values are the number of samples to sample.
smote_ : sampler object
The validated SMOTE instance.
tomek_ : sampler object
The validated TomekLinks instance.
n_features_in_ : int
Number of features in the input dataset.
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
def fit_resample(X, y, **params)Resample the dataset.
Parameters:
X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
y : array-like of shape (n_samples,)
Corresponding label for each sample in X.
**params : dict
Extra parameters to use by the sampler.
Returns:
X_resampled : {array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : array-like of shape (n_samples_new,)
The corresponding label of X_resampled.
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTETomek
# Create an imbalanced dataset
X, y = make_classification(
n_classes=2,
class_sep=2,
weights=[0.1, 0.9],
n_informative=3,
n_redundant=1,
flip_y=0,
n_features=20,
n_clusters_per_class=1,
n_samples=1000,
random_state=10
)
print('Original dataset shape:', Counter(y))
# Original dataset shape: Counter({1: 900, 0: 100})
# Apply SMOTETomek
smt = SMOTETomek(random_state=42)
X_res, y_res = smt.fit_resample(X, y)
print('Resampled dataset shape:', Counter(y_res))
# Resampled dataset shape: Counter({0: 900, 1: 900})
# Using custom SMOTE and Tomek parameters
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
custom_smote = SMOTE(k_neighbors=5, random_state=42)
custom_tomek = TomekLinks(sampling_strategy='majority')
smt_custom = SMOTETomek(
smote=custom_smote,
tomek=custom_tomek,
random_state=42
)
X_res_custom, y_res_custom = smt_custom.fit_resample(X, y)| Aspect | SMOTEENN | SMOTETomek |
|---|---|---|
| Cleaning Method | Edited Nearest Neighbours | Tomek Links |
| Cleaning Aggressiveness | More aggressive | Less aggressive |
| Typical Sample Reduction | Higher | Lower |
| Focus | Removes misclassified samples | Removes boundary ambiguous samples |
| Best Use Case | Noisy datasets | Clean decision boundaries |
Use SMOTEENN when:
Use SMOTETomek when:
Both methods follow the same general workflow:
This sequential approach ensures that the benefits of both techniques are realized: balanced class distribution from SMOTE and improved data quality from the cleaning step.
Install with Tessl CLI
npx tessl i tessl/pypi-imbalanced-learn