Toolbox for imbalanced dataset in machine learning
—
Under-sampling methods reduce the size of the majority class(es) to address class imbalance. These techniques remove samples from the dataset, either randomly or using intelligent selection criteria to preserve important boundary information.
Methods that randomly select samples to remove from majority classes.
Methods that generate new synthetic samples to represent the original data distribution.
Methods that intelligently select which samples to keep based on neighborhood analysis, distance metrics, or classification difficulty.
Methods that remove noisy samples or samples that negatively affect classification performance.
Random under-sampling of majority class samples with or without replacement.
class RandomUnderSampler:
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
replacement=False
):Parameters:
sampling_strategy (str, dict, list): Strategy to control sampling. Default is "auto".random_state (int, RandomState, None): Random number generator seed for reproducibility.replacement (bool): Whether sampling is with or without replacement. Default is False.Attributes:
sampling_strategy_ (dict): Dictionary containing sampling information per class.sample_indices_ (ndarray): Indices of selected samples.n_features_in_ (int): Number of input features.feature_names_in_ (ndarray): Names of input features when available.Methods:
fit_resample(X, y): Fit the sampler and resample the dataset.Usage Example:
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
# Create random under-sampler
rus = RandomUnderSampler(random_state=42)
# Apply under-sampling
X_resampled, y_resampled = rus.fit_resample(X, y)
print(f"Original: {Counter(y)}")
print(f"Resampled: {Counter(y_resampled)}")Under-sample by generating centroids based on clustering methods. Replaces clusters of majority samples with their centroids.
class ClusterCentroids:
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
estimator=None,
voting="auto"
):Parameters:
sampling_strategy (str, dict, list): Strategy to control sampling. Default is "auto".random_state (int, RandomState, None): Random number generator seed.estimator (estimator object): Clustering estimator with n_clusters parameter and cluster_centers_ attribute. Defaults to KMeans.voting (str): Voting strategy for generating new samples:
Attributes:
sampling_strategy_ (dict): Dictionary containing sampling information per class.estimator_ (estimator object): The validated clustering estimator.voting_ (str): The validated voting strategy.n_features_in_ (int): Number of input features.feature_names_in_ (ndarray): Names of input features when available.Methods:
fit_resample(X, y): Fit the sampler and resample the dataset.Usage Example:
from imblearn.under_sampling import ClusterCentroids
from sklearn.cluster import MiniBatchKMeans
# Create cluster centroids sampler with custom estimator
cc = ClusterCentroids(
estimator=MiniBatchKMeans(n_init=1, random_state=0),
random_state=42
)
# Apply cluster-based under-sampling
X_resampled, y_resampled = cc.fit_resample(X, y)Under-sample based on NearMiss methods that select samples based on distance to minority class samples.
class NearMiss:
def __init__(
self,
*,
sampling_strategy="auto",
version=1,
n_neighbors=3,
n_neighbors_ver3=3,
n_jobs=None
):Parameters:
sampling_strategy (str, dict, list): Strategy to control sampling.version (int): NearMiss version (1, 2, or 3):
n_neighbors (int, estimator): Number of neighbors or KNN estimator.n_neighbors_ver3 (int, estimator): Number of neighbors for version 3 pre-selection.n_jobs (int): Number of parallel jobs.Attributes:
sampling_strategy_ (dict): Dictionary containing sampling information.nn_ (estimator object): Validated K-nearest neighbors estimator.nn_ver3_ (estimator object): K-nearest neighbors estimator for version 3.sample_indices_ (ndarray): Indices of selected samples.Usage Example:
from imblearn.under_sampling import NearMiss
# NearMiss version 1 (select closest to minority)
nm1 = NearMiss(version=1)
X_res1, y_res1 = nm1.fit_resample(X, y)
# NearMiss version 3 (two-step selection)
nm3 = NearMiss(version=3, n_neighbors=3, n_neighbors_ver3=3)
X_res3, y_res3 = nm3.fit_resample(X, y)Under-sample based on instance hardness threshold using cross-validation predictions.
class InstanceHardnessThreshold:
def __init__(
self,
*,
estimator=None,
sampling_strategy="auto",
random_state=None,
cv=5,
n_jobs=None
):Parameters:
estimator (estimator object): Classifier with predict_proba method. Defaults to RandomForestClassifier.sampling_strategy (str, dict, list): Strategy to control sampling.random_state (int, RandomState, None): Random number generator seed.cv (int): Number of cross-validation folds for hardness estimation.n_jobs (int): Number of parallel jobs.Attributes:
sampling_strategy_ (dict): Dictionary containing sampling information.estimator_ (estimator object): The validated classifier.sample_indices_ (ndarray): Indices of selected samples.Usage Example:
from imblearn.under_sampling import InstanceHardnessThreshold
from sklearn.ensemble import RandomForestClassifier
# Use custom classifier for hardness estimation
iht = InstanceHardnessThreshold(
estimator=RandomForestClassifier(n_estimators=50),
cv=3,
random_state=42
)
X_resampled, y_resampled = iht.fit_resample(X, y)Under-sample by removing Tomek's links - pairs of nearest neighbors from different classes.
class TomekLinks:
def __init__(
self,
*,
sampling_strategy="auto",
n_jobs=None
):Parameters:
sampling_strategy (str, dict, list): Strategy to control which classes to clean.n_jobs (int): Number of parallel jobs.Attributes:
sampling_strategy_ (dict): Dictionary containing sampling information.sample_indices_ (ndarray): Indices of selected samples.Methods:
fit_resample(X, y): Remove Tomek links from the dataset.is_tomek(y, nn_index, class_type): Static method to detect Tomek pairs.Usage Example:
from imblearn.under_sampling import TomekLinks
# Remove Tomek links (noisy border samples)
tl = TomekLinks()
X_cleaned, y_cleaned = tl.fit_resample(X, y)
print(f"Removed {len(y) - len(y_cleaned)} Tomek links")Under-sample by removing samples whose neighborhood contains samples from different classes.
class EditedNearestNeighbours:
def __init__(
self,
*,
sampling_strategy="auto",
n_neighbors=3,
kind_sel="all",
n_jobs=None
):Parameters:
sampling_strategy (str, dict, list): Strategy to control sampling.n_neighbors (int, estimator): Number of neighbors to examine or KNN estimator.kind_sel (str): Selection strategy:
n_jobs (int): Number of parallel jobs.Attributes:
sampling_strategy_ (dict): Dictionary containing sampling information.nn_ (estimator object): Validated K-nearest neighbors estimator.sample_indices_ (ndarray): Indices of selected samples.Usage Example:
from imblearn.under_sampling import EditedNearestNeighbours
# Conservative cleaning (remove if any neighbor differs)
enn_all = EditedNearestNeighbours(kind_sel="all", n_neighbors=3)
X_clean_all, y_clean_all = enn_all.fit_resample(X, y)
# Less aggressive cleaning (remove if majority neighbors differ)
enn_mode = EditedNearestNeighbours(kind_sel="mode", n_neighbors=5)
X_clean_mode, y_clean_mode = enn_mode.fit_resample(X, y)Repeated application of EditedNearestNeighbours until convergence or stopping criteria.
class RepeatedEditedNearestNeighbours:
def __init__(
self,
*,
sampling_strategy="auto",
n_neighbors=3,
max_iter=100,
kind_sel="all",
n_jobs=None
):Parameters:
sampling_strategy (str, dict, list): Strategy to control sampling.n_neighbors (int, estimator): Number of neighbors or KNN estimator.max_iter (int): Maximum number of iterations.kind_sel (str): Selection strategy ("all" or "mode").n_jobs (int): Number of parallel jobs.Attributes:
sampling_strategy_ (dict): Dictionary containing sampling information.nn_ (estimator object): Validated K-nearest neighbors estimator.enn_ (sampler object): The EditedNearestNeighbours instance.sample_indices_ (ndarray): Indices of selected samples.n_iter_ (int): Number of iterations performed.Usage Example:
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
# Repeat ENN until convergence
renn = RepeatedEditedNearestNeighbours(
n_neighbors=3,
max_iter=50,
kind_sel="all"
)
X_resampled, y_resampled = renn.fit_resample(X, y)
print(f"Converged after {renn.n_iter_} iterations")Apply EditedNearestNeighbours with increasing neighborhood sizes from 1 to n_neighbors.
class AllKNN:
def __init__(
self,
*,
sampling_strategy="auto",
n_neighbors=3,
kind_sel="all",
allow_minority=False,
n_jobs=None
):Parameters:
sampling_strategy (str, dict, list): Strategy to control sampling.n_neighbors (int, estimator): Maximum number of neighbors or KNN estimator.kind_sel (str): Selection strategy ("all" or "mode").allow_minority (bool): Allow majority classes to become minority classes.n_jobs (int): Number of parallel jobs.Attributes:
sampling_strategy_ (dict): Dictionary containing sampling information.nn_ (estimator object): Validated K-nearest neighbors estimator.enn_ (sampler object): The EditedNearestNeighbours instance.sample_indices_ (ndarray): Indices of selected samples.Usage Example:
from imblearn.under_sampling import AllKNN
# Progressive neighborhood cleaning
allknn = AllKNN(n_neighbors=5, kind_sel="all")
X_resampled, y_resampled = allknn.fit_resample(X, y)Under-sample using one-sided selection method combining CNN and Tomek links.
class OneSidedSelection:
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
n_neighbors=None,
n_seeds_S=1,
n_jobs=None
):Parameters:
sampling_strategy (str, dict, list): Strategy to control sampling.random_state (int, RandomState, None): Random number generator seed.n_neighbors (int, estimator, None): Number of neighbors or KNN estimator. Defaults to 1-NN.n_seeds_S (int): Number of seed samples to extract for set S.n_jobs (int): Number of parallel jobs.Attributes:
sampling_strategy_ (dict): Dictionary containing sampling information.estimators_ (list): List of KNN estimators used per class.sample_indices_ (ndarray): Indices of selected samples.Usage Example:
from imblearn.under_sampling import OneSidedSelection
# One-sided selection with custom parameters
oss = OneSidedSelection(
n_neighbors=3,
n_seeds_S=1,
random_state=42
)
X_resampled, y_resampled = oss.fit_resample(X, y)Under-sample using condensed nearest neighbor rule to find consistent subset.
class CondensedNearestNeighbour:
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
n_neighbors=None,
n_seeds_S=1,
n_jobs=None
):Parameters:
sampling_strategy (str, dict, list): Strategy to control sampling.random_state (int, RandomState, None): Random number generator seed.n_neighbors (int, estimator, None): Number of neighbors or KNN estimator. Defaults to 1-NN.n_seeds_S (int): Number of seed samples for set S initialization.n_jobs (int): Number of parallel jobs.Attributes:
sampling_strategy_ (dict): Dictionary containing sampling information.estimators_ (list): List of KNN estimators used per class.sample_indices_ (ndarray): Indices of selected samples.Usage Example:
from imblearn.under_sampling import CondensedNearestNeighbour
# Condensed nearest neighbor selection
cnn = CondensedNearestNeighbour(
n_neighbors=1,
n_seeds_S=1,
random_state=42
)
X_resampled, y_resampled = cnn.fit_resample(X, y)Under-sample using neighborhood cleaning rule that combines ENN and KNN for noise removal.
class NeighbourhoodCleaningRule:
def __init__(
self,
*,
sampling_strategy="auto",
edited_nearest_neighbours=None,
n_neighbors=3,
threshold_cleaning=0.5,
n_jobs=None
):Parameters:
sampling_strategy (str, dict, list): Strategy to control sampling.edited_nearest_neighbours (estimator, None): ENN estimator for initial cleaning. Defaults to ENN with kind_sel="mode".n_neighbors (int, estimator): Number of neighbors or KNN estimator.threshold_cleaning (float): Threshold for considering classes in second cleaning phase: Ci > C × threshold.n_jobs (int): Number of parallel jobs.Attributes:
sampling_strategy_ (dict): Dictionary containing sampling information.edited_nearest_neighbours_ (estimator): The ENN object for first cleaning phase.nn_ (estimator object): Validated K-nearest neighbors estimator.classes_to_clean_ (list): Classes considered for second cleaning phase.sample_indices_ (ndarray): Indices of selected samples.Usage Example:
from imblearn.under_sampling import NeighbourhoodCleaningRule
from imblearn.under_sampling import EditedNearestNeighbours
# Default neighborhood cleaning
ncr = NeighbourhoodCleaningRule()
X_cleaned, y_cleaned = ncr.fit_resample(X, y)
# Custom ENN for first phase
custom_enn = EditedNearestNeighbours(kind_sel="all", n_neighbors=5)
ncr_custom = NeighbourhoodCleaningRule(
edited_nearest_neighbours=custom_enn,
threshold_cleaning=0.3
)
X_cleaned_custom, y_cleaned_custom = ncr_custom.fit_resample(X, y)Random Under-Sampling:
Prototype Generation (ClusterCentroids):
Prototype Selection (NearMiss, ENN variants):
Neighborhood Cleaning:
All methods support multi-class resampling:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
# Create preprocessing pipeline
pipeline = Pipeline([
('sampler', RandomUnderSampler(random_state=42)),
('classifier', RandomForestClassifier(random_state=42))
])
# Fit pipeline
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)Install with Tessl CLI
npx tessl i tessl/pypi-imbalanced-learn