Toolbox for imbalanced dataset in machine learning
—
Over-sampling techniques address class imbalance by generating synthetic samples for minority classes. Unlike under-sampling, which removes samples, over-sampling increases the dataset size by creating new instances that follow the distribution patterns of existing minority class samples.
The imbalanced-learn library provides several sophisticated over-sampling algorithms that use different strategies for synthetic sample generation:
All over-sampling methods inherit from the BaseOverSampler class and implement the standard fit_resample(X, y) interface.
Random over-sampling with optional smoothed bootstrap generation.
{ .api }
class RandomOverSampler(BaseOverSampler):
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
shrinkage=None,
):
"""
Parameters
----------
sampling_strategy : float, str, dict or callable, default='auto'
Sampling information to resample the data set.
random_state : int, RandomState instance or None, default=None
Control the randomization of the algorithm.
shrinkage : float or dict, default=None
Parameter controlling the shrinkage applied to the covariance matrix
when a smoothed bootstrap is generated. If None, normal bootstrap
without perturbation. If float, same shrinkage for all classes.
If dict, class-specific shrinkage factors.
"""
def fit_resample(self, X, y):
"""
Resample the dataset.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples.
y : array-like of shape (n_samples,)
The input targets.
Returns
-------
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : array-like of shape (n_samples_new,)
The corresponding label of `X_resampled`.
"""The RandomOverSampler performs basic over-sampling by selecting samples at random with replacement. When shrinkage is specified, it generates smoothed bootstrap samples by adding small perturbations, also known as Random Over-Sampling Examples (ROSE).
Synthetic Minority Over-sampling Technique - the original algorithm for generating synthetic samples.
{ .api }
class SMOTE(BaseSMOTE):
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
k_neighbors=5,
):
"""
Parameters
----------
sampling_strategy : float, str, dict or callable, default='auto'
Sampling information to resample the data set.
random_state : int, RandomState instance or None, default=None
Control the randomization of the algorithm.
k_neighbors : int or object, default=5
The nearest neighbors used to define the neighborhood of samples
for generating synthetic samples. Can be int for number of neighbors
or a fitted neighbors estimator with kneighbors and kneighbors_graph methods.
"""
def fit_resample(self, X, y):
"""
Resample the dataset.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples.
y : array-like of shape (n_samples,)
The input targets.
Returns
-------
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : array-like of shape (n_samples_new,)
The corresponding label of `X_resampled`.
"""SMOTE generates synthetic samples by interpolating between a minority sample and its k nearest neighbors. For each minority sample, it selects one of its k nearest neighbors randomly and creates a synthetic sample somewhere along the line segment between them.
SMOTE for datasets containing both numerical and categorical features.
{ .api }
class SMOTENC(SMOTE):
def __init__(
self,
categorical_features,
*,
categorical_encoder=None,
sampling_strategy="auto",
random_state=None,
k_neighbors=5,
):
"""
Parameters
----------
categorical_features : "auto" or array-like of shape (n_cat_features,) or (n_features,)
Specified which features are categorical. Can be:
- "auto" to automatically detect from pandas DataFrame with CategoricalDtype
- array of int corresponding to categorical feature indices
- array of str corresponding to feature names (requires pandas DataFrame)
- boolean mask array of shape (n_features,)
categorical_encoder : estimator, default=None
One-hot encoder used to encode categorical features. If None,
uses OneHotEncoder with handle_unknown='ignore'.
sampling_strategy : float, str, dict or callable, default='auto'
Sampling information to resample the data set.
random_state : int, RandomState instance or None, default=None
Control the randomization of the algorithm.
k_neighbors : int or object, default=5
The nearest neighbors used for generating synthetic samples.
"""
def fit_resample(self, X, y):
"""
Resample the dataset.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples.
y : array-like of shape (n_samples,)
The input targets.
Returns
-------
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : array-like of shape (n_samples_new,)
The corresponding label of `X_resampled`.
"""SMOTENC handles mixed-type datasets by applying standard SMOTE interpolation to numerical features while using mode-based selection for categorical features. Categorical features are encoded with one-hot encoding during processing.
SMOTE variant specifically designed for categorical features only.
{ .api }
class SMOTEN(SMOTE):
def __init__(
self,
categorical_encoder=None,
*,
sampling_strategy="auto",
random_state=None,
k_neighbors=5,
):
"""
Parameters
----------
categorical_encoder : estimator, default=None
Ordinal encoder used to encode categorical features. If None,
uses OrdinalEncoder with default parameters.
sampling_strategy : float, str, dict or callable, default='auto'
Sampling information to resample the data set.
random_state : int, RandomState instance or None, default=None
Control the randomization of the algorithm.
k_neighbors : int or object, default=5
The nearest neighbors used for generating synthetic samples.
"""
def fit_resample(self, X, y):
"""
Resample the dataset.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples.
y : array-like of shape (n_samples,)
The input targets.
Returns
-------
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : array-like of shape (n_samples_new,)
The corresponding label of `X_resampled`.
"""SMOTEN works exclusively with categorical features and uses the Value Difference Metric (VDM) to compute distances between categorical samples. Synthetic samples are generated by selecting the most frequent category among nearest neighbors for each feature.
SMOTE variant that focuses on samples near class boundaries.
{ .api }
class BorderlineSMOTE(BaseSMOTE):
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
k_neighbors=5,
m_neighbors=10,
kind="borderline-1",
):
"""
Parameters
----------
sampling_strategy : float, str, dict or callable, default='auto'
Sampling information to resample the data set.
random_state : int, RandomState instance or None, default=None
Control the randomization of the algorithm.
k_neighbors : int or object, default=5
The nearest neighbors used for generating synthetic samples.
m_neighbors : int or object, default=10
The nearest neighbors used to determine if a minority sample
is in "danger" (near the boundary).
kind : {"borderline-1", "borderline-2"}, default='borderline-1'
The type of borderline SMOTE algorithm:
- "borderline-1": considers only positive class for neighbor selection
- "borderline-2": considers whole dataset, applies weight adjustments
"""
def fit_resample(self, X, y):
"""
Resample the dataset.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples.
y : array-like of shape (n_samples,)
The input targets.
Returns
-------
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : array-like of shape (n_samples_new,)
The corresponding label of `X_resampled`.
"""BorderlineSMOTE identifies "danger" samples that are close to the decision boundary (having more majority class neighbors than minority). It generates synthetic samples only from these borderline cases, focusing oversampling where it's most needed.
SVM-based SMOTE that uses support vectors to identify critical samples.
{ .api }
class SVMSMOTE(BaseSMOTE):
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
k_neighbors=5,
m_neighbors=10,
svm_estimator=None,
out_step=0.5,
):
"""
Parameters
----------
sampling_strategy : float, str, dict or callable, default='auto'
Sampling information to resample the data set.
random_state : int, RandomState instance or None, default=None
Control the randomization of the algorithm.
k_neighbors : int or object, default=5
The nearest neighbors used for generating synthetic samples.
m_neighbors : int or object, default=10
The nearest neighbors used to determine sample safety/danger status.
svm_estimator : estimator object, default=SVC()
SVM classifier used to identify support vectors. Must expose
support_ attribute after fitting.
out_step : float, default=0.5
Step size when extrapolating from safe support vectors.
"""
def fit_resample(self, X, y):
"""
Resample the dataset.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples.
y : array-like of shape (n_samples,)
The input targets.
Returns
-------
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : array-like of shape (n_samples_new,)
The corresponding label of `X_resampled`.
"""SVMSMOTE trains an SVM classifier and uses the minority class support vectors as seed points for synthetic sample generation. It classifies support vectors as "safe" or "danger" and applies different generation strategies accordingly.
Adaptive Synthetic Sampling approach that adjusts generation density based on local distributions.
{ .api }
class ADASYN(BaseOverSampler):
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
n_neighbors=5,
):
"""
Parameters
----------
sampling_strategy : float, str, dict or callable, default='auto'
Sampling information to resample the data set.
random_state : int, RandomState instance or None, default=None
Control the randomization of the algorithm.
n_neighbors : int or estimator object, default=5
The nearest neighbors used to determine local distribution and
generate synthetic samples. Can be int for number of neighbors
or fitted neighbors estimator.
"""
def fit_resample(self, X, y):
"""
Resample the dataset.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples.
y : array-like of shape (n_samples,)
The input targets.
Returns
-------
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : array-like of shape (n_samples_new,)
The corresponding label of `X_resampled`.
"""ADASYN calculates a difficulty coefficient for each minority sample based on the ratio of majority class neighbors. Samples in more difficult regions (surrounded by majority samples) generate more synthetic samples, adapting to local class distributions.
Applies K-Means clustering before SMOTE generation to handle complex data distributions.
{ .api }
class KMeansSMOTE(BaseSMOTE):
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
k_neighbors=2,
n_jobs=None,
kmeans_estimator=None,
cluster_balance_threshold="auto",
density_exponent="auto",
):
"""
Parameters
----------
sampling_strategy : float, str, dict or callable, default='auto'
Sampling information to resample the data set.
random_state : int, RandomState instance or None, default=None
Control the randomization of the algorithm.
k_neighbors : int or object, default=2
The nearest neighbors used for generating synthetic samples.
n_jobs : int, default=None
Number of CPU cores used during the cross-validation loop.
kmeans_estimator : int or object, default=None
K-Means clustering estimator or number of clusters. If None,
uses MiniBatchKMeans. If int, creates MiniBatchKMeans with
that number of clusters.
cluster_balance_threshold : "auto" or float, default="auto"
Threshold for determining balanced clusters. If "auto",
determined by class ratios. Manual threshold can be set.
density_exponent : "auto" or float, default="auto"
Exponent for cluster density calculation. If "auto", uses
feature-length based exponent.
"""
def fit_resample(self, X, y):
"""
Resample the dataset.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples.
y : array-like of shape (n_samples,)
The input targets.
Returns
-------
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : array-like of shape (n_samples_new,)
The corresponding label of `X_resampled`.
"""KMeansSMOTE first clusters the data, then identifies imbalanced clusters where the minority class representation falls below a threshold. It applies SMOTE within these clusters, distributing synthetic samples based on cluster sparsity to achieve better balance in complex, multimodal datasets.
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
# Create imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3,
n_redundant=1, flip_y=0, n_features=20,
n_clusters_per_class=1, n_samples=1000,
random_state=10)
print('Original dataset shape %s' % Counter(y))
# Original dataset shape Counter({1: 900, 0: 100})
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))
# Resampled dataset shape Counter({0: 900, 1: 900})import numpy as np
from numpy.random import RandomState
from imblearn.over_sampling import SMOTENC
# Simulate mixed dataset with categorical features
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3,
n_redundant=1, flip_y=0, n_features=20,
n_clusters_per_class=1, n_samples=1000,
random_state=10)
# Make last 2 columns categorical
X[:, -2:] = RandomState(10).randint(0, 4, size=(1000, 2))
sm = SMOTENC(random_state=42, categorical_features=[18, 19])
X_res, y_res = sm.fit_resample(X, y)
print(f'Resampled dataset samples per class {Counter(y_res)}')
# Resampled dataset samples per class Counter({0: 900, 1: 900})from imblearn.over_sampling import BorderlineSMOTE
# Focus on borderline samples
sm = BorderlineSMOTE(random_state=42, kind='borderline-1')
X_res, y_res = sm.fit_resample(X, y)
print('Borderline SMOTE result %s' % Counter(y_res))
# Generates samples only from minority samples near decision boundary{ .api }
from typing import Union, Dict, Callable, Optional, Any
from numpy import ndarray
from scipy.sparse import spmatrix
from sklearn.base import BaseEstimator
ArrayLike = Union[ndarray, spmatrix]
SamplingStrategy = Union[float, str, Dict[Any, int], Callable[[ndarray], Dict[Any, int]]]
NeighborsLike = Union[int, BaseEstimator]
RandomState = Union[int, np.random.RandomState, None]All over-sampling methods share common characteristics:
Install with Tessl CLI
npx tessl i tessl/pypi-imbalanced-learn