tessl/pypi-imbalanced-learn

Toolbox for imbalanced dataset in machine learning

—

Pending

Overview

Eval results

Files

Over-sampling Methods

Name: tessl/pypi-imbalanced-learn
Author: tessl

Over-sampling techniques address class imbalance by generating synthetic samples for minority classes. Unlike under-sampling, which removes samples, over-sampling increases the dataset size by creating new instances that follow the distribution patterns of existing minority class samples.

Overview

The imbalanced-learn library provides several sophisticated over-sampling algorithms that use different strategies for synthetic sample generation:

SMOTE family: Generate synthetic samples along feature space lines between nearest neighbors
Adaptive methods: Adjust sample generation based on local class distributions
Categorical handling: Specialized algorithms for datasets with categorical features
Filtering approaches: Select specific boundary regions for enhanced sample generation

All over-sampling methods inherit from the BaseOverSampler class and implement the standard fit_resample(X, y) interface.

Basic Over-sampling

RandomOverSampler

Random over-sampling with optional smoothed bootstrap generation.

{ .api }
class RandomOverSampler(BaseOverSampler):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        shrinkage=None,
    ):
        """
        Parameters
        ----------
        sampling_strategy : float, str, dict or callable, default='auto'
            Sampling information to resample the data set.
        
        random_state : int, RandomState instance or None, default=None
            Control the randomization of the algorithm.
        
        shrinkage : float or dict, default=None
            Parameter controlling the shrinkage applied to the covariance matrix
            when a smoothed bootstrap is generated. If None, normal bootstrap
            without perturbation. If float, same shrinkage for all classes.
            If dict, class-specific shrinkage factors.
        """

    def fit_resample(self, X, y):
        """
        Resample the dataset.
        
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            The input samples.
        y : array-like of shape (n_samples,)
            The input targets.
            
        Returns
        -------
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
            The array containing the resampled data.
        y_resampled : array-like of shape (n_samples_new,)
            The corresponding label of `X_resampled`.
        """

The RandomOverSampler performs basic over-sampling by selecting samples at random with replacement. When shrinkage is specified, it generates smoothed bootstrap samples by adding small perturbations, also known as Random Over-Sampling Examples (ROSE).

SMOTE Family

SMOTE

Synthetic Minority Over-sampling Technique - the original algorithm for generating synthetic samples.

{ .api }
class SMOTE(BaseSMOTE):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        k_neighbors=5,
    ):
        """
        Parameters
        ----------
        sampling_strategy : float, str, dict or callable, default='auto'
            Sampling information to resample the data set.
        
        random_state : int, RandomState instance or None, default=None
            Control the randomization of the algorithm.
        
        k_neighbors : int or object, default=5
            The nearest neighbors used to define the neighborhood of samples
            for generating synthetic samples. Can be int for number of neighbors
            or a fitted neighbors estimator with kneighbors and kneighbors_graph methods.
        """

    def fit_resample(self, X, y):
        """
        Resample the dataset.
        
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            The input samples.
        y : array-like of shape (n_samples,)
            The input targets.
            
        Returns
        -------
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
            The array containing the resampled data.
        y_resampled : array-like of shape (n_samples_new,)
            The corresponding label of `X_resampled`.
        """

SMOTE generates synthetic samples by interpolating between a minority sample and its k nearest neighbors. For each minority sample, it selects one of its k nearest neighbors randomly and creates a synthetic sample somewhere along the line segment between them.

SMOTENC

SMOTE for datasets containing both numerical and categorical features.

{ .api }
class SMOTENC(SMOTE):
    def __init__(
        self,
        categorical_features,
        *,
        categorical_encoder=None,
        sampling_strategy="auto",
        random_state=None,
        k_neighbors=5,
    ):
        """
        Parameters
        ----------
        categorical_features : "auto" or array-like of shape (n_cat_features,) or (n_features,)
            Specified which features are categorical. Can be:
            - "auto" to automatically detect from pandas DataFrame with CategoricalDtype
            - array of int corresponding to categorical feature indices  
            - array of str corresponding to feature names (requires pandas DataFrame)
            - boolean mask array of shape (n_features,)
        
        categorical_encoder : estimator, default=None
            One-hot encoder used to encode categorical features. If None,
            uses OneHotEncoder with handle_unknown='ignore'.
        
        sampling_strategy : float, str, dict or callable, default='auto'
            Sampling information to resample the data set.
        
        random_state : int, RandomState instance or None, default=None
            Control the randomization of the algorithm.
        
        k_neighbors : int or object, default=5
            The nearest neighbors used for generating synthetic samples.
        """

    def fit_resample(self, X, y):
        """
        Resample the dataset.
        
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            The input samples.
        y : array-like of shape (n_samples,)
            The input targets.
            
        Returns
        -------
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
            The array containing the resampled data.
        y_resampled : array-like of shape (n_samples_new,)
            The corresponding label of `X_resampled`.
        """

SMOTENC handles mixed-type datasets by applying standard SMOTE interpolation to numerical features while using mode-based selection for categorical features. Categorical features are encoded with one-hot encoding during processing.

SMOTEN

SMOTE variant specifically designed for categorical features only.

{ .api }
class SMOTEN(SMOTE):
    def __init__(
        self,
        categorical_encoder=None,
        *,
        sampling_strategy="auto",
        random_state=None,
        k_neighbors=5,
    ):
        """
        Parameters
        ----------
        categorical_encoder : estimator, default=None
            Ordinal encoder used to encode categorical features. If None,
            uses OrdinalEncoder with default parameters.
        
        sampling_strategy : float, str, dict or callable, default='auto'
            Sampling information to resample the data set.
        
        random_state : int, RandomState instance or None, default=None
            Control the randomization of the algorithm.
        
        k_neighbors : int or object, default=5
            The nearest neighbors used for generating synthetic samples.
        """

    def fit_resample(self, X, y):
        """
        Resample the dataset.
        
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            The input samples.
        y : array-like of shape (n_samples,)
            The input targets.
            
        Returns
        -------
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
            The array containing the resampled data.
        y_resampled : array-like of shape (n_samples_new,)
            The corresponding label of `X_resampled`.
        """

SMOTEN works exclusively with categorical features and uses the Value Difference Metric (VDM) to compute distances between categorical samples. Synthetic samples are generated by selecting the most frequent category among nearest neighbors for each feature.

Boundary-focused Methods

BorderlineSMOTE

SMOTE variant that focuses on samples near class boundaries.

{ .api }
class BorderlineSMOTE(BaseSMOTE):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        k_neighbors=5,
        m_neighbors=10,
        kind="borderline-1",
    ):
        """
        Parameters
        ----------
        sampling_strategy : float, str, dict or callable, default='auto'
            Sampling information to resample the data set.
        
        random_state : int, RandomState instance or None, default=None
            Control the randomization of the algorithm.
        
        k_neighbors : int or object, default=5
            The nearest neighbors used for generating synthetic samples.
        
        m_neighbors : int or object, default=10
            The nearest neighbors used to determine if a minority sample
            is in "danger" (near the boundary).
        
        kind : {"borderline-1", "borderline-2"}, default='borderline-1'
            The type of borderline SMOTE algorithm:
            - "borderline-1": considers only positive class for neighbor selection
            - "borderline-2": considers whole dataset, applies weight adjustments
        """

    def fit_resample(self, X, y):
        """
        Resample the dataset.
        
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            The input samples.
        y : array-like of shape (n_samples,)
            The input targets.
            
        Returns
        -------
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
            The array containing the resampled data.
        y_resampled : array-like of shape (n_samples_new,)
            The corresponding label of `X_resampled`.
        """

BorderlineSMOTE identifies "danger" samples that are close to the decision boundary (having more majority class neighbors than minority). It generates synthetic samples only from these borderline cases, focusing oversampling where it's most needed.

SVMSMOTE

SVM-based SMOTE that uses support vectors to identify critical samples.

{ .api }
class SVMSMOTE(BaseSMOTE):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        k_neighbors=5,
        m_neighbors=10,
        svm_estimator=None,
        out_step=0.5,
    ):
        """
        Parameters
        ----------
        sampling_strategy : float, str, dict or callable, default='auto'
            Sampling information to resample the data set.
        
        random_state : int, RandomState instance or None, default=None
            Control the randomization of the algorithm.
        
        k_neighbors : int or object, default=5
            The nearest neighbors used for generating synthetic samples.
        
        m_neighbors : int or object, default=10
            The nearest neighbors used to determine sample safety/danger status.
        
        svm_estimator : estimator object, default=SVC()
            SVM classifier used to identify support vectors. Must expose
            support_ attribute after fitting.
        
        out_step : float, default=0.5
            Step size when extrapolating from safe support vectors.
        """

    def fit_resample(self, X, y):
        """
        Resample the dataset.
        
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            The input samples.
        y : array-like of shape (n_samples,)
            The input targets.
            
        Returns
        -------
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
            The array containing the resampled data.
        y_resampled : array-like of shape (n_samples_new,)
            The corresponding label of `X_resampled`.
        """

SVMSMOTE trains an SVM classifier and uses the minority class support vectors as seed points for synthetic sample generation. It classifies support vectors as "safe" or "danger" and applies different generation strategies accordingly.

Adaptive Methods

ADASYN

Adaptive Synthetic Sampling approach that adjusts generation density based on local distributions.

{ .api }
class ADASYN(BaseOverSampler):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        n_neighbors=5,
    ):
        """
        Parameters
        ----------
        sampling_strategy : float, str, dict or callable, default='auto'
            Sampling information to resample the data set.
        
        random_state : int, RandomState instance or None, default=None
            Control the randomization of the algorithm.
        
        n_neighbors : int or estimator object, default=5
            The nearest neighbors used to determine local distribution and
            generate synthetic samples. Can be int for number of neighbors
            or fitted neighbors estimator.
        """

    def fit_resample(self, X, y):
        """
        Resample the dataset.
        
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            The input samples.
        y : array-like of shape (n_samples,)
            The input targets.
            
        Returns
        -------
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
            The array containing the resampled data.
        y_resampled : array-like of shape (n_samples_new,)
            The corresponding label of `X_resampled`.
        """

ADASYN calculates a difficulty coefficient for each minority sample based on the ratio of majority class neighbors. Samples in more difficult regions (surrounded by majority samples) generate more synthetic samples, adapting to local class distributions.

Cluster-based Methods

KMeansSMOTE

Applies K-Means clustering before SMOTE generation to handle complex data distributions.

{ .api }
class KMeansSMOTE(BaseSMOTE):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        k_neighbors=2,
        n_jobs=None,
        kmeans_estimator=None,
        cluster_balance_threshold="auto",
        density_exponent="auto",
    ):
        """
        Parameters
        ----------
        sampling_strategy : float, str, dict or callable, default='auto'
            Sampling information to resample the data set.
        
        random_state : int, RandomState instance or None, default=None
            Control the randomization of the algorithm.
        
        k_neighbors : int or object, default=2
            The nearest neighbors used for generating synthetic samples.
        
        n_jobs : int, default=None
            Number of CPU cores used during the cross-validation loop.
        
        kmeans_estimator : int or object, default=None
            K-Means clustering estimator or number of clusters. If None,
            uses MiniBatchKMeans. If int, creates MiniBatchKMeans with
            that number of clusters.
        
        cluster_balance_threshold : "auto" or float, default="auto"
            Threshold for determining balanced clusters. If "auto",
            determined by class ratios. Manual threshold can be set.
        
        density_exponent : "auto" or float, default="auto"
            Exponent for cluster density calculation. If "auto", uses
            feature-length based exponent.
        """

    def fit_resample(self, X, y):
        """
        Resample the dataset.
        
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            The input samples.
        y : array-like of shape (n_samples,)
            The input targets.
            
        Returns
        -------
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
            The array containing the resampled data.
        y_resampled : array-like of shape (n_samples_new,)
            The corresponding label of `X_resampled`.
        """

KMeansSMOTE first clusters the data, then identifies imbalanced clusters where the minority class representation falls below a threshold. It applies SMOTE within these clusters, distributing synthetic samples based on cluster sparsity to achieve better balance in complex, multimodal datasets.

Usage Examples

Basic SMOTE

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

# Create imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
                          weights=[0.1, 0.9], n_informative=3, 
                          n_redundant=1, flip_y=0, n_features=20, 
                          n_clusters_per_class=1, n_samples=1000, 
                          random_state=10)

print('Original dataset shape %s' % Counter(y))
# Original dataset shape Counter({1: 900, 0: 100})

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)

print('Resampled dataset shape %s' % Counter(y_res))
# Resampled dataset shape Counter({0: 900, 1: 900})

Mixed-type Data with SMOTENC

import numpy as np
from numpy.random import RandomState
from imblearn.over_sampling import SMOTENC

# Simulate mixed dataset with categorical features
X, y = make_classification(n_classes=2, class_sep=2,
                          weights=[0.1, 0.9], n_informative=3,
                          n_redundant=1, flip_y=0, n_features=20,
                          n_clusters_per_class=1, n_samples=1000, 
                          random_state=10)

# Make last 2 columns categorical
X[:, -2:] = RandomState(10).randint(0, 4, size=(1000, 2))

sm = SMOTENC(random_state=42, categorical_features=[18, 19])
X_res, y_res = sm.fit_resample(X, y)

print(f'Resampled dataset samples per class {Counter(y_res)}')
# Resampled dataset samples per class Counter({0: 900, 1: 900})

Boundary-focused Oversampling

from imblearn.over_sampling import BorderlineSMOTE

# Focus on borderline samples
sm = BorderlineSMOTE(random_state=42, kind='borderline-1')
X_res, y_res = sm.fit_resample(X, y)

print('Borderline SMOTE result %s' % Counter(y_res))
# Generates samples only from minority samples near decision boundary

Type Definitions

{ .api }
from typing import Union, Dict, Callable, Optional, Any
from numpy import ndarray
from scipy.sparse import spmatrix
from sklearn.base import BaseEstimator

ArrayLike = Union[ndarray, spmatrix]
SamplingStrategy = Union[float, str, Dict[Any, int], Callable[[ndarray], Dict[Any, int]]]
NeighborsLike = Union[int, BaseEstimator]
RandomState = Union[int, np.random.RandomState, None]

All over-sampling methods share common characteristics: