tessl/pypi-imbalanced-learn

Toolbox for imbalanced dataset in machine learning

—

Pending

Overview

Eval results

Files

Deep Learning Integration

Name: tessl/pypi-imbalanced-learn
Author: tessl

Utilities for handling imbalanced datasets in deep learning frameworks, providing balanced batch generators for Keras and TensorFlow that ensure fair representation of all classes during training.

Overview

Imbalanced-learn provides specialized batch generators for deep learning frameworks that address class imbalance by creating balanced batches during training. These tools integrate seamlessly with Keras and TensorFlow workflows while maintaining the benefits of sampling techniques.

Key Features

Balanced batch generation: Ensures each batch contains balanced class representation
Framework compatibility: Native support for Keras and TensorFlow
Sampling integration: Uses imblearn samplers for batch balancing
Memory efficiency: Generates balanced batches on-demand without duplicating entire dataset
Sparse data support: Handles both dense and sparse input matrices

Supported Frameworks

Keras: Via BalancedBatchGenerator class and balanced_batch_generator function
TensorFlow: Via balanced_batch_generator function

Keras Integration

BalancedBatchGenerator

{ .api }
class BalancedBatchGenerator:
    def __init__(
        self,
        X,
        y,
        *,
        sample_weight=None,
        sampler=None,
        batch_size=32,
        keep_sparse=False,
        random_state=None
    ): ...
    def __len__(self): ...
    def __getitem__(self, index): ...

Create balanced batches when training a keras model using the Sequence API.

Parameters:

X (ndarray of shape (n_samples, n_features)): Original imbalanced dataset
y (ndarray of shape (n_samples,) or (n_samples, n_classes)): Associated targets
sample_weight (ndarray of shape (n_samples,), default=None): Sample weight
sampler (sampler object, default=None): A sampler instance which has an attribute sample_indices_. By default, the sampler used is a RandomUnderSampler
batch_size (int, default=32): Number of samples per gradient update
keep_sparse (bool, default=False): Either or not to conserve or not the sparsity of the input. By default, the returned batches will be dense
random_state (int, RandomState instance or None, default=None): Control the randomization of the algorithm

Attributes:

sampler_ (sampler object): The sampler used to balance the dataset
indices_ (ndarray of shape (n_samples, n_features)): The indices of the samples selected during sampling

Methods:

len

def __len__(self) -> int

Returns the number of batches per epoch.

getitem

def __getitem__(self, index) -> tuple[ndarray, ndarray] | tuple[ndarray, ndarray, ndarray]

Generate one batch of data.

Parameters:

index (int): Batch index

Returns:

batch (tuple): Either (X_batch, y_batch) or (X_batch, y_batch, sample_weight_batch) if sample weights are provided

Usage with Keras: The class implements the Keras Sequence interface for use with model.fit():

from imblearn.keras import BalancedBatchGenerator
from imblearn.under_sampling import NearMiss
import tensorflow.keras as keras

# Create balanced batch generator
training_generator = BalancedBatchGenerator(
    X, y, 
    sampler=NearMiss(), 
    batch_size=32, 
    random_state=42
)

# Use with Keras model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(training_generator, epochs=10)

balanced_batch_generator (Keras)

{ .api }
def balanced_batch_generator(
    X,
    y,
    *,
    sample_weight=None,
    sampler=None,
    batch_size=32,
    keep_sparse=False,
    random_state=None
) -> tuple[Generator, int]

Create a balanced batch generator to train keras model.

Parameters:

X (ndarray of shape (n_samples, n_features)): Original imbalanced dataset
y (ndarray of shape (n_samples,) or (n_samples, n_classes)): Associated targets
sample_weight (ndarray of shape (n_samples,), default=None): Sample weight
sampler (sampler object, default=None): A sampler instance which has an attribute sample_indices_. By default, the sampler used is a RandomUnderSampler
batch_size (int, default=32): Number of samples per gradient update
keep_sparse (bool, default=False): Either or not to conserve or not the sparsity of the input. By default, the returned batches will be dense
random_state (int, RandomState instance or None, default=None): Control the randomization of the algorithm

Returns:

generator (generator of tuple): Generate batch of data. The tuple generated are either (X_batch, y_batch) or (X_batch, y_batch, sample_weight_batch)
steps_per_epoch (int): The number of samples per epoch. Required by fit_generator in keras

Usage Example:

from imblearn.keras import balanced_batch_generator
from imblearn.under_sampling import EditedNearestNeighbours

training_generator, steps_per_epoch = balanced_batch_generator(
    X, y, 
    sampler=EditedNearestNeighbours(), 
    batch_size=64, 
    random_state=42
)

# Use with older Keras API
history = model.fit_generator(
    training_generator,
    steps_per_epoch=steps_per_epoch,
    epochs=20
)

TensorFlow Integration

balanced_batch_generator (TensorFlow)

balanced_batch_generator

{ .api }
def balanced_batch_generator(
    X,
    y,
    *,
    sample_weight=None,
    sampler=None,
    batch_size=32,
    keep_sparse=False,
    random_state=None
) -> tuple[Generator, int]

Create a balanced batch generator to train tensorflow model.

Parameters:

X (ndarray of shape (n_samples, n_features)): Original imbalanced dataset
y (ndarray of shape (n_samples,) or (n_samples, n_classes)): Associated targets
sample_weight (ndarray of shape (n_samples,), default=None): Sample weight
sampler (sampler object, default=None): A sampler instance which has an attribute sample_indices_. By default, the sampler used is a RandomUnderSampler
batch_size (int, default=32): Number of samples per gradient update
keep_sparse (bool, default=False): Either or not to conserve or not the sparsity of the input X. By default, the returned batches will be dense
random_state (int, RandomState instance or None, default=None): Control the randomization of the algorithm

Returns:

generator (generator of tuple): Generate batch of data. The tuple generated are either (X_batch, y_batch) or (X_batch, y_batch, sample_weight_batch)
steps_per_epoch (int): The number of samples per epoch

Generator Function: The returned generator infinitely loops through balanced batches:

Applies the sampler to balance the dataset
Shuffles the resampled indices
Creates batches of the specified size
Yields batches cyclically for training

Usage with TensorFlow:

from imblearn.tensorflow import balanced_batch_generator
from imblearn.over_sampling import SMOTE
import tensorflow as tf

# Create generator
training_generator, steps_per_epoch = balanced_batch_generator(
    X, y,
    sampler=SMOTE(random_state=42),
    batch_size=128,
    random_state=42
)

# Use with tf.keras
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(3, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='categorical_crossentropy', 
    metrics=['accuracy']
)

history = model.fit(
    training_generator,
    steps_per_epoch=steps_per_epoch,
    epochs=50,
    validation_data=(X_val, y_val)
)

Sampler Integration

Compatible Samplers

All imblearn samplers with the sample_indices_ attribute can be used:

Over-sampling Methods:

from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.keras import BalancedBatchGenerator

# Using SMOTE
generator = BalancedBatchGenerator(X, y, sampler=SMOTE(k_neighbors=3))

# Using ADASYN  
generator = BalancedBatchGenerator(X, y, sampler=ADASYN(n_neighbors=5))

Under-sampling Methods:

from imblearn.under_sampling import RandomUnderSampler, TomekLinks, EditedNearestNeighbours

# Using random under-sampling
generator = BalancedBatchGenerator(X, y, sampler=RandomUnderSampler())

# Using Tomek links cleaning
generator = BalancedBatchGenerator(X, y, sampler=TomekLinks())

Combination Methods:

from imblearn.combine import SMOTEENN, SMOTETomek

# Using SMOTE + Edited Nearest Neighbours
generator = BalancedBatchGenerator(X, y, sampler=SMOTEENN())

# Using SMOTE + Tomek links
generator = BalancedBatchGenerator(X, y, sampler=SMOTETomek())

Advanced Usage Patterns

Multi-Class Classification

from sklearn.datasets import make_classification
from imblearn.keras import BalancedBatchGenerator
from imblearn.over_sampling import SMOTE
import tensorflow.keras as keras

# Create multi-class imbalanced dataset
X, y = make_classification(
    n_classes=3, 
    n_informative=5,
    weights=[0.7, 0.2, 0.1],
    n_samples=2000,
    random_state=42
)

# Convert to categorical
y_cat = keras.utils.to_categorical(y, 3)

# Create balanced generator
generator = BalancedBatchGenerator(
    X, y_cat,
    sampler=SMOTE(random_state=42),
    batch_size=64,
    random_state=42
)

# Multi-class model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(3, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy', 'categorical_accuracy']
)

history = model.fit(generator, epochs=100, verbose=1)

Sparse Data Handling

from scipy.sparse import csr_matrix
from imblearn.tensorflow import balanced_batch_generator

# Convert to sparse matrix
X_sparse = csr_matrix(X)

# Keep data sparse during batch generation
generator, steps = balanced_batch_generator(
    X_sparse, y,
    keep_sparse=True,
    batch_size=32
)

# Use with TensorFlow model that handles sparse input
for batch_X, batch_y in generator:
    if batch_X.issparse():
        batch_X = batch_X.toarray()  # Convert if needed
    # Train with batch

Sample Weight Integration

from sklearn.utils.class_weight import compute_sample_weight

# Compute sample weights
sample_weights = compute_sample_weight('balanced', y)

# Use with generator
generator = BalancedBatchGenerator(
    X, y,
    sample_weight=sample_weights,
    sampler=SMOTE(),
    batch_size=32
)

# Each batch will include sample weights
for batch_data in generator:
    X_batch, y_batch, weights_batch = batch_data
    # Use weights in training

Framework Comparison

Keras vs TensorFlow Generators

Feature	Keras BalancedBatchGenerator	TensorFlow balanced_batch_generator
API	Keras Sequence interface	Plain generator function
Integration	`model.fit(generator)`	`model.fit(generator, steps_per_epoch=steps)`
Memory	Sequence protocol	Manual iteration control
Features	Full Keras integration	More flexible, lower-level

Best Practices

Choose appropriate sampler: Match sampler to your problem characteristics
Batch size considerations: Balance memory usage with training stability
Reproducibility: Always set random_state for consistent results
Validation strategy: Use separate validation data, don't apply sampling to validation
Monitor class distribution: Verify balanced batches are being generated

Complete Training Example:

from imblearn.keras import BalancedBatchGenerator
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
import tensorflow.keras as keras

# Split data
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Create balanced training generator
train_generator = BalancedBatchGenerator(
    X_train, y_train,
    sampler=SMOTE(random_state=42),
    batch_size=64,
    random_state=42
)

# Build model
model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.BatchNormalization(), 
    keras.layers.Dropout(0.3),
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile with class-aware metrics
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy', 'precision', 'recall']
)

# Train with early stopping
callbacks = [
    keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True),
    keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
]

history = model.fit(
    train_generator,
    validation_data=(X_val, y_val),
    epochs=100,
    callbacks=callbacks,
    verbose=1
)

Install with Tessl CLI