tessl/pypi-imbalanced-learn

Toolbox for imbalanced dataset in machine learning

—

Pending

Overview

Eval results

Files

Under-Sampling Methods

Name: tessl/pypi-imbalanced-learn
Author: tessl

Under-sampling methods reduce the size of the majority class(es) to address class imbalance. These techniques remove samples from the dataset, either randomly or using intelligent selection criteria to preserve important boundary information.

Categories of Under-Sampling Methods

Random Under-Sampling

Methods that randomly select samples to remove from majority classes.

Prototype Generation

Methods that generate new synthetic samples to represent the original data distribution.

Prototype Selection

Methods that intelligently select which samples to keep based on neighborhood analysis, distance metrics, or classification difficulty.

Neighborhood Cleaning

Methods that remove noisy samples or samples that negatively affect classification performance.

Random Under-Sampling

RandomUnderSampler

Random under-sampling of majority class samples with or without replacement.

class RandomUnderSampler:
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        replacement=False
    ):

Parameters:

sampling_strategy (str, dict, list): Strategy to control sampling. Default is "auto".
random_state (int, RandomState, None): Random number generator seed for reproducibility.
replacement (bool): Whether sampling is with or without replacement. Default is False.

Attributes:

sampling_strategy_ (dict): Dictionary containing sampling information per class.
sample_indices_ (ndarray): Indices of selected samples.
n_features_in_ (int): Number of input features.
feature_names_in_ (ndarray): Names of input features when available.

Methods:

fit_resample(X, y): Fit the sampler and resample the dataset.

Usage Example:

from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# Create random under-sampler
rus = RandomUnderSampler(random_state=42)

# Apply under-sampling
X_resampled, y_resampled = rus.fit_resample(X, y)
print(f"Original: {Counter(y)}")
print(f"Resampled: {Counter(y_resampled)}")

Prototype Generation

ClusterCentroids

Under-sample by generating centroids based on clustering methods. Replaces clusters of majority samples with their centroids.

class ClusterCentroids:
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        estimator=None,
        voting="auto"
    ):

Parameters:

sampling_strategy (str, dict, list): Strategy to control sampling. Default is "auto".
random_state (int, RandomState, None): Random number generator seed.
estimator (estimator object): Clustering estimator with n_clusters parameter and cluster_centers_ attribute. Defaults to KMeans.
voting (str): Voting strategy for generating new samples:
- "hard": Use nearest neighbors of centroids
- "soft": Use centroids directly
- "auto": Choose based on input sparsity

Attributes:

sampling_strategy_ (dict): Dictionary containing sampling information per class.
estimator_ (estimator object): The validated clustering estimator.
voting_ (str): The validated voting strategy.
n_features_in_ (int): Number of input features.
feature_names_in_ (ndarray): Names of input features when available.

Methods:

fit_resample(X, y): Fit the sampler and resample the dataset.

Usage Example:

from imblearn.under_sampling import ClusterCentroids
from sklearn.cluster import MiniBatchKMeans

# Create cluster centroids sampler with custom estimator
cc = ClusterCentroids(
    estimator=MiniBatchKMeans(n_init=1, random_state=0),
    random_state=42
)

# Apply cluster-based under-sampling
X_resampled, y_resampled = cc.fit_resample(X, y)

Prototype Selection Methods

NearMiss

Under-sample based on NearMiss methods that select samples based on distance to minority class samples.

class NearMiss:
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        version=1,
        n_neighbors=3,
        n_neighbors_ver3=3,
        n_jobs=None
    ):

Parameters:

sampling_strategy (str, dict, list): Strategy to control sampling.
version (int): NearMiss version (1, 2, or 3):
- Version 1: Select samples closest to minority class samples
- Version 2: Select samples closest to farthest minority class samples
- Version 3: Two-step process with neighborhood selection
n_neighbors (int, estimator): Number of neighbors or KNN estimator.
n_neighbors_ver3 (int, estimator): Number of neighbors for version 3 pre-selection.
n_jobs (int): Number of parallel jobs.

Attributes:

sampling_strategy_ (dict): Dictionary containing sampling information.
nn_ (estimator object): Validated K-nearest neighbors estimator.
nn_ver3_ (estimator object): K-nearest neighbors estimator for version 3.
sample_indices_ (ndarray): Indices of selected samples.

Usage Example:

from imblearn.under_sampling import NearMiss

# NearMiss version 1 (select closest to minority)
nm1 = NearMiss(version=1)
X_res1, y_res1 = nm1.fit_resample(X, y)

# NearMiss version 3 (two-step selection)
nm3 = NearMiss(version=3, n_neighbors=3, n_neighbors_ver3=3)
X_res3, y_res3 = nm3.fit_resample(X, y)

InstanceHardnessThreshold

Under-sample based on instance hardness threshold using cross-validation predictions.

class InstanceHardnessThreshold:
    def __init__(
        self,
        *,
        estimator=None,
        sampling_strategy="auto", 
        random_state=None,
        cv=5,
        n_jobs=None
    ):

Parameters:

estimator (estimator object): Classifier with predict_proba method. Defaults to RandomForestClassifier.
sampling_strategy (str, dict, list): Strategy to control sampling.
random_state (int, RandomState, None): Random number generator seed.
cv (int): Number of cross-validation folds for hardness estimation.
n_jobs (int): Number of parallel jobs.

Attributes:

sampling_strategy_ (dict): Dictionary containing sampling information.
estimator_ (estimator object): The validated classifier.
sample_indices_ (ndarray): Indices of selected samples.

Usage Example:

from imblearn.under_sampling import InstanceHardnessThreshold
from sklearn.ensemble import RandomForestClassifier

# Use custom classifier for hardness estimation
iht = InstanceHardnessThreshold(
    estimator=RandomForestClassifier(n_estimators=50),
    cv=3,
    random_state=42
)
X_resampled, y_resampled = iht.fit_resample(X, y)

TomekLinks

Under-sample by removing Tomek's links - pairs of nearest neighbors from different classes.

class TomekLinks:
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        n_jobs=None
    ):

Parameters:

sampling_strategy (str, dict, list): Strategy to control which classes to clean.
n_jobs (int): Number of parallel jobs.

Attributes:

sampling_strategy_ (dict): Dictionary containing sampling information.
sample_indices_ (ndarray): Indices of selected samples.

Methods:

fit_resample(X, y): Remove Tomek links from the dataset.
is_tomek(y, nn_index, class_type): Static method to detect Tomek pairs.

Usage Example:

from imblearn.under_sampling import TomekLinks

# Remove Tomek links (noisy border samples)
tl = TomekLinks()
X_cleaned, y_cleaned = tl.fit_resample(X, y)
print(f"Removed {len(y) - len(y_cleaned)} Tomek links")

EditedNearestNeighbours

Under-sample by removing samples whose neighborhood contains samples from different classes.

class EditedNearestNeighbours:
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        n_neighbors=3,
        kind_sel="all",
        n_jobs=None
    ):

Parameters:

sampling_strategy (str, dict, list): Strategy to control sampling.
n_neighbors (int, estimator): Number of neighbors to examine or KNN estimator.
kind_sel (str): Selection strategy:
- "all": Remove if any neighbor is from different class
- "mode": Remove if most neighbors are from different class
n_jobs (int): Number of parallel jobs.

Attributes:

sampling_strategy_ (dict): Dictionary containing sampling information.
nn_ (estimator object): Validated K-nearest neighbors estimator.
sample_indices_ (ndarray): Indices of selected samples.

Usage Example:

from imblearn.under_sampling import EditedNearestNeighbours

# Conservative cleaning (remove if any neighbor differs)
enn_all = EditedNearestNeighbours(kind_sel="all", n_neighbors=3)
X_clean_all, y_clean_all = enn_all.fit_resample(X, y)

# Less aggressive cleaning (remove if majority neighbors differ)  
enn_mode = EditedNearestNeighbours(kind_sel="mode", n_neighbors=5)
X_clean_mode, y_clean_mode = enn_mode.fit_resample(X, y)

RepeatedEditedNearestNeighbours

Repeated application of EditedNearestNeighbours until convergence or stopping criteria.

class RepeatedEditedNearestNeighbours:
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        n_neighbors=3,
        max_iter=100,
        kind_sel="all", 
        n_jobs=None
    ):

Parameters:

sampling_strategy (str, dict, list): Strategy to control sampling.
n_neighbors (int, estimator): Number of neighbors or KNN estimator.
max_iter (int): Maximum number of iterations.
kind_sel (str): Selection strategy ("all" or "mode").
n_jobs (int): Number of parallel jobs.

Attributes:

sampling_strategy_ (dict): Dictionary containing sampling information.
nn_ (estimator object): Validated K-nearest neighbors estimator.
enn_ (sampler object): The EditedNearestNeighbours instance.
sample_indices_ (ndarray): Indices of selected samples.
n_iter_ (int): Number of iterations performed.

Usage Example:

from imblearn.under_sampling import RepeatedEditedNearestNeighbours

# Repeat ENN until convergence
renn = RepeatedEditedNearestNeighbours(
    n_neighbors=3,
    max_iter=50,
    kind_sel="all"
)
X_resampled, y_resampled = renn.fit_resample(X, y)
print(f"Converged after {renn.n_iter_} iterations")

AllKNN

Apply EditedNearestNeighbours with increasing neighborhood sizes from 1 to n_neighbors.

class AllKNN:
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        n_neighbors=3,
        kind_sel="all",
        allow_minority=False,
        n_jobs=None
    ):

Parameters:

sampling_strategy (str, dict, list): Strategy to control sampling.
n_neighbors (int, estimator): Maximum number of neighbors or KNN estimator.
kind_sel (str): Selection strategy ("all" or "mode").
allow_minority (bool): Allow majority classes to become minority classes.
n_jobs (int): Number of parallel jobs.

Attributes:

sampling_strategy_ (dict): Dictionary containing sampling information.
nn_ (estimator object): Validated K-nearest neighbors estimator.
enn_ (sampler object): The EditedNearestNeighbours instance.
sample_indices_ (ndarray): Indices of selected samples.

Usage Example:

from imblearn.under_sampling import AllKNN

# Progressive neighborhood cleaning  
allknn = AllKNN(n_neighbors=5, kind_sel="all")
X_resampled, y_resampled = allknn.fit_resample(X, y)

OneSidedSelection

Under-sample using one-sided selection method combining CNN and Tomek links.

class OneSidedSelection:
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        n_neighbors=None,
        n_seeds_S=1,
        n_jobs=None
    ):

Parameters:

sampling_strategy (str, dict, list): Strategy to control sampling.
random_state (int, RandomState, None): Random number generator seed.
n_neighbors (int, estimator, None): Number of neighbors or KNN estimator. Defaults to 1-NN.
n_seeds_S (int): Number of seed samples to extract for set S.
n_jobs (int): Number of parallel jobs.

Attributes:

sampling_strategy_ (dict): Dictionary containing sampling information.
estimators_ (list): List of KNN estimators used per class.
sample_indices_ (ndarray): Indices of selected samples.

Usage Example:

from imblearn.under_sampling import OneSidedSelection

# One-sided selection with custom parameters
oss = OneSidedSelection(
    n_neighbors=3,
    n_seeds_S=1,
    random_state=42
)
X_resampled, y_resampled = oss.fit_resample(X, y)

CondensedNearestNeighbour

Under-sample using condensed nearest neighbor rule to find consistent subset.

class CondensedNearestNeighbour:
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        n_neighbors=None,
        n_seeds_S=1,
        n_jobs=None
    ):

Parameters:

sampling_strategy (str, dict, list): Strategy to control sampling.
random_state (int, RandomState, None): Random number generator seed.
n_neighbors (int, estimator, None): Number of neighbors or KNN estimator. Defaults to 1-NN.
n_seeds_S (int): Number of seed samples for set S initialization.
n_jobs (int): Number of parallel jobs.

Attributes:

sampling_strategy_ (dict): Dictionary containing sampling information.
estimators_ (list): List of KNN estimators used per class.
sample_indices_ (ndarray): Indices of selected samples.

Usage Example:

from imblearn.under_sampling import CondensedNearestNeighbour

# Condensed nearest neighbor selection
cnn = CondensedNearestNeighbour(
    n_neighbors=1,
    n_seeds_S=1,
    random_state=42
)
X_resampled, y_resampled = cnn.fit_resample(X, y)

Neighborhood Cleaning Methods

NeighbourhoodCleaningRule

Under-sample using neighborhood cleaning rule that combines ENN and KNN for noise removal.

class NeighbourhoodCleaningRule:
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        edited_nearest_neighbours=None,
        n_neighbors=3,
        threshold_cleaning=0.5,
        n_jobs=None
    ):

Parameters:

sampling_strategy (str, dict, list): Strategy to control sampling.
edited_nearest_neighbours (estimator, None): ENN estimator for initial cleaning. Defaults to ENN with kind_sel="mode".
n_neighbors (int, estimator): Number of neighbors or KNN estimator.
threshold_cleaning (float): Threshold for considering classes in second cleaning phase: Ci > C × threshold.
n_jobs (int): Number of parallel jobs.

Attributes:

sampling_strategy_ (dict): Dictionary containing sampling information.
edited_nearest_neighbours_ (estimator): The ENN object for first cleaning phase.
nn_ (estimator object): Validated K-nearest neighbors estimator.
classes_to_clean_ (list): Classes considered for second cleaning phase.
sample_indices_ (ndarray): Indices of selected samples.

Usage Example:

from imblearn.under_sampling import NeighbourhoodCleaningRule
from imblearn.under_sampling import EditedNearestNeighbours

# Default neighborhood cleaning
ncr = NeighbourhoodCleaningRule()
X_cleaned, y_cleaned = ncr.fit_resample(X, y)

# Custom ENN for first phase
custom_enn = EditedNearestNeighbours(kind_sel="all", n_neighbors=5)
ncr_custom = NeighbourhoodCleaningRule(
    edited_nearest_neighbours=custom_enn,
    threshold_cleaning=0.3
)
X_cleaned_custom, y_cleaned_custom = ncr_custom.fit_resample(X, y)

Method Selection Guidelines

When to Use Each Method

Random Under-Sampling:

Simple baseline approach
When computational resources are limited
For initial experimentation

Prototype Generation (ClusterCentroids):

When you want to preserve cluster structure
For high-dimensional data where centroids can represent regions well
When interpretability of synthetic samples is important

Prototype Selection (NearMiss, ENN variants):

When preserving decision boundary information is crucial
For datasets where border samples are informative
When you want to remove noisy/outlier samples

Neighborhood Cleaning:

When dataset contains significant noise
For improving classifier performance through data cleaning
When combining multiple cleaning strategies

Computational Complexity

RandomUnderSampler: O(n) - fastest
ClusterCentroids: O(n × k × iterations) - depends on clustering algorithm
NearMiss: O(n²) - distance calculations between all samples
ENN variants: O(n × k × neighbors) - depends on neighborhood size
TomekLinks: O(n²) - pairwise distance calculations
CNN/OSS: O(n²) - iterative neighbor searches

Multi-Class Support

All methods support multi-class resampling:

One-vs.-rest: NearMiss, ENN variants, TomekLinks, NeighbourhoodCleaningRule
One-vs.-one: OneSidedSelection, CondensedNearestNeighbour
Independent sampling: RandomUnderSampler, ClusterCentroids, InstanceHardnessThreshold

Pipeline Integration

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler

# Create preprocessing pipeline
pipeline = Pipeline([
    ('sampler', RandomUnderSampler(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fit pipeline
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

Install with Tessl CLI