Toolbox for imbalanced dataset in machine learning
—
Utilities for handling imbalanced datasets in deep learning frameworks, providing balanced batch generators for Keras and TensorFlow that ensure fair representation of all classes during training.
Imbalanced-learn provides specialized batch generators for deep learning frameworks that address class imbalance by creating balanced batches during training. These tools integrate seamlessly with Keras and TensorFlow workflows while maintaining the benefits of sampling techniques.
BalancedBatchGenerator class and balanced_batch_generator functionbalanced_batch_generator function{ .api }
class BalancedBatchGenerator:
def __init__(
self,
X,
y,
*,
sample_weight=None,
sampler=None,
batch_size=32,
keep_sparse=False,
random_state=None
): ...
def __len__(self): ...
def __getitem__(self, index): ...Create balanced batches when training a keras model using the Sequence API.
Parameters:
ndarray of shape (n_samples, n_features)): Original imbalanced datasetndarray of shape (n_samples,) or (n_samples, n_classes)): Associated targetsndarray of shape (n_samples,), default=None): Sample weightsampler object, default=None): A sampler instance which has an attribute sample_indices_. By default, the sampler used is a RandomUnderSamplerint, default=32): Number of samples per gradient updatebool, default=False): Either or not to conserve or not the sparsity of the input. By default, the returned batches will be denseint, RandomState instance or None, default=None): Control the randomization of the algorithmAttributes:
sampler object): The sampler used to balance the datasetndarray of shape (n_samples, n_features)): The indices of the samples selected during samplingMethods:
def __len__(self) -> intReturns the number of batches per epoch.
def __getitem__(self, index) -> tuple[ndarray, ndarray] | tuple[ndarray, ndarray, ndarray]Generate one batch of data.
Parameters:
int): Batch indexReturns:
tuple): Either (X_batch, y_batch) or (X_batch, y_batch, sample_weight_batch) if sample weights are providedUsage with Keras:
The class implements the Keras Sequence interface for use with model.fit():
from imblearn.keras import BalancedBatchGenerator
from imblearn.under_sampling import NearMiss
import tensorflow.keras as keras
# Create balanced batch generator
training_generator = BalancedBatchGenerator(
X, y,
sampler=NearMiss(),
batch_size=32,
random_state=42
)
# Use with Keras model
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(training_generator, epochs=10){ .api }
def balanced_batch_generator(
X,
y,
*,
sample_weight=None,
sampler=None,
batch_size=32,
keep_sparse=False,
random_state=None
) -> tuple[Generator, int]Create a balanced batch generator to train keras model.
Parameters:
ndarray of shape (n_samples, n_features)): Original imbalanced datasetndarray of shape (n_samples,) or (n_samples, n_classes)): Associated targetsndarray of shape (n_samples,), default=None): Sample weightsampler object, default=None): A sampler instance which has an attribute sample_indices_. By default, the sampler used is a RandomUnderSamplerint, default=32): Number of samples per gradient updatebool, default=False): Either or not to conserve or not the sparsity of the input. By default, the returned batches will be denseint, RandomState instance or None, default=None): Control the randomization of the algorithmReturns:
generator of tuple): Generate batch of data. The tuple generated are either (X_batch, y_batch) or (X_batch, y_batch, sample_weight_batch)int): The number of samples per epoch. Required by fit_generator in kerasUsage Example:
from imblearn.keras import balanced_batch_generator
from imblearn.under_sampling import EditedNearestNeighbours
training_generator, steps_per_epoch = balanced_batch_generator(
X, y,
sampler=EditedNearestNeighbours(),
batch_size=64,
random_state=42
)
# Use with older Keras API
history = model.fit_generator(
training_generator,
steps_per_epoch=steps_per_epoch,
epochs=20
){ .api }
def balanced_batch_generator(
X,
y,
*,
sample_weight=None,
sampler=None,
batch_size=32,
keep_sparse=False,
random_state=None
) -> tuple[Generator, int]Create a balanced batch generator to train tensorflow model.
Parameters:
ndarray of shape (n_samples, n_features)): Original imbalanced datasetndarray of shape (n_samples,) or (n_samples, n_classes)): Associated targetsndarray of shape (n_samples,), default=None): Sample weightsampler object, default=None): A sampler instance which has an attribute sample_indices_. By default, the sampler used is a RandomUnderSamplerint, default=32): Number of samples per gradient updatebool, default=False): Either or not to conserve or not the sparsity of the input X. By default, the returned batches will be denseint, RandomState instance or None, default=None): Control the randomization of the algorithmReturns:
generator of tuple): Generate batch of data. The tuple generated are either (X_batch, y_batch) or (X_batch, y_batch, sample_weight_batch)int): The number of samples per epochGenerator Function: The returned generator infinitely loops through balanced batches:
Usage with TensorFlow:
from imblearn.tensorflow import balanced_batch_generator
from imblearn.over_sampling import SMOTE
import tensorflow as tf
# Create generator
training_generator, steps_per_epoch = balanced_batch_generator(
X, y,
sampler=SMOTE(random_state=42),
batch_size=128,
random_state=42
)
# Use with tf.keras
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(3, activation='softmax')
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
history = model.fit(
training_generator,
steps_per_epoch=steps_per_epoch,
epochs=50,
validation_data=(X_val, y_val)
)All imblearn samplers with the sample_indices_ attribute can be used:
Over-sampling Methods:
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.keras import BalancedBatchGenerator
# Using SMOTE
generator = BalancedBatchGenerator(X, y, sampler=SMOTE(k_neighbors=3))
# Using ADASYN
generator = BalancedBatchGenerator(X, y, sampler=ADASYN(n_neighbors=5))Under-sampling Methods:
from imblearn.under_sampling import RandomUnderSampler, TomekLinks, EditedNearestNeighbours
# Using random under-sampling
generator = BalancedBatchGenerator(X, y, sampler=RandomUnderSampler())
# Using Tomek links cleaning
generator = BalancedBatchGenerator(X, y, sampler=TomekLinks())Combination Methods:
from imblearn.combine import SMOTEENN, SMOTETomek
# Using SMOTE + Edited Nearest Neighbours
generator = BalancedBatchGenerator(X, y, sampler=SMOTEENN())
# Using SMOTE + Tomek links
generator = BalancedBatchGenerator(X, y, sampler=SMOTETomek())from sklearn.datasets import make_classification
from imblearn.keras import BalancedBatchGenerator
from imblearn.over_sampling import SMOTE
import tensorflow.keras as keras
# Create multi-class imbalanced dataset
X, y = make_classification(
n_classes=3,
n_informative=5,
weights=[0.7, 0.2, 0.1],
n_samples=2000,
random_state=42
)
# Convert to categorical
y_cat = keras.utils.to_categorical(y, 3)
# Create balanced generator
generator = BalancedBatchGenerator(
X, y_cat,
sampler=SMOTE(random_state=42),
batch_size=64,
random_state=42
)
# Multi-class model
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(3, activation='softmax')
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy', 'categorical_accuracy']
)
history = model.fit(generator, epochs=100, verbose=1)from scipy.sparse import csr_matrix
from imblearn.tensorflow import balanced_batch_generator
# Convert to sparse matrix
X_sparse = csr_matrix(X)
# Keep data sparse during batch generation
generator, steps = balanced_batch_generator(
X_sparse, y,
keep_sparse=True,
batch_size=32
)
# Use with TensorFlow model that handles sparse input
for batch_X, batch_y in generator:
if batch_X.issparse():
batch_X = batch_X.toarray() # Convert if needed
# Train with batchfrom sklearn.utils.class_weight import compute_sample_weight
# Compute sample weights
sample_weights = compute_sample_weight('balanced', y)
# Use with generator
generator = BalancedBatchGenerator(
X, y,
sample_weight=sample_weights,
sampler=SMOTE(),
batch_size=32
)
# Each batch will include sample weights
for batch_data in generator:
X_batch, y_batch, weights_batch = batch_data
# Use weights in training| Feature | Keras BalancedBatchGenerator | TensorFlow balanced_batch_generator |
|---|---|---|
| API | Keras Sequence interface | Plain generator function |
| Integration | model.fit(generator) | model.fit(generator, steps_per_epoch=steps) |
| Memory | Sequence protocol | Manual iteration control |
| Features | Full Keras integration | More flexible, lower-level |
random_state for consistent resultsComplete Training Example:
from imblearn.keras import BalancedBatchGenerator
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
import tensorflow.keras as keras
# Split data
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Create balanced training generator
train_generator = BalancedBatchGenerator(
X_train, y_train,
sampler=SMOTE(random_state=42),
batch_size=64,
random_state=42
)
# Build model
model = keras.Sequential([
keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.5),
keras.layers.Dense(64, activation='relu'),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
keras.layers.Dense(1, activation='sigmoid')
])
# Compile with class-aware metrics
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy', 'precision', 'recall']
)
# Train with early stopping
callbacks = [
keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True),
keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
]
history = model.fit(
train_generator,
validation_data=(X_val, y_val),
epochs=100,
callbacks=callbacks,
verbose=1
)Install with Tessl CLI
npx tessl i tessl/pypi-imbalanced-learn