CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-imbalanced-learn

Toolbox for imbalanced dataset in machine learning

Pending
Overview
Eval results
Files

deep-learning.mddocs/

Deep Learning Integration

Utilities for handling imbalanced datasets in deep learning frameworks, providing balanced batch generators for Keras and TensorFlow that ensure fair representation of all classes during training.

Overview

Imbalanced-learn provides specialized batch generators for deep learning frameworks that address class imbalance by creating balanced batches during training. These tools integrate seamlessly with Keras and TensorFlow workflows while maintaining the benefits of sampling techniques.

Key Features

  • Balanced batch generation: Ensures each batch contains balanced class representation
  • Framework compatibility: Native support for Keras and TensorFlow
  • Sampling integration: Uses imblearn samplers for batch balancing
  • Memory efficiency: Generates balanced batches on-demand without duplicating entire dataset
  • Sparse data support: Handles both dense and sparse input matrices

Supported Frameworks

  • Keras: Via BalancedBatchGenerator class and balanced_batch_generator function
  • TensorFlow: Via balanced_batch_generator function

Keras Integration

BalancedBatchGenerator

BalancedBatchGenerator

{ .api }
class BalancedBatchGenerator:
    def __init__(
        self,
        X,
        y,
        *,
        sample_weight=None,
        sampler=None,
        batch_size=32,
        keep_sparse=False,
        random_state=None
    ): ...
    def __len__(self): ...
    def __getitem__(self, index): ...

Create balanced batches when training a keras model using the Sequence API.

Parameters:

  • X (ndarray of shape (n_samples, n_features)): Original imbalanced dataset
  • y (ndarray of shape (n_samples,) or (n_samples, n_classes)): Associated targets
  • sample_weight (ndarray of shape (n_samples,), default=None): Sample weight
  • sampler (sampler object, default=None): A sampler instance which has an attribute sample_indices_. By default, the sampler used is a RandomUnderSampler
  • batch_size (int, default=32): Number of samples per gradient update
  • keep_sparse (bool, default=False): Either or not to conserve or not the sparsity of the input. By default, the returned batches will be dense
  • random_state (int, RandomState instance or None, default=None): Control the randomization of the algorithm

Attributes:

  • sampler_ (sampler object): The sampler used to balance the dataset
  • indices_ (ndarray of shape (n_samples, n_features)): The indices of the samples selected during sampling

Methods:

len
def __len__(self) -> int

Returns the number of batches per epoch.

getitem
def __getitem__(self, index) -> tuple[ndarray, ndarray] | tuple[ndarray, ndarray, ndarray]

Generate one batch of data.

Parameters:

  • index (int): Batch index

Returns:

  • batch (tuple): Either (X_batch, y_batch) or (X_batch, y_batch, sample_weight_batch) if sample weights are provided

Usage with Keras: The class implements the Keras Sequence interface for use with model.fit():

from imblearn.keras import BalancedBatchGenerator
from imblearn.under_sampling import NearMiss
import tensorflow.keras as keras

# Create balanced batch generator
training_generator = BalancedBatchGenerator(
    X, y, 
    sampler=NearMiss(), 
    batch_size=32, 
    random_state=42
)

# Use with Keras model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(training_generator, epochs=10)

balanced_batch_generator (Keras)

{ .api }
def balanced_batch_generator(
    X,
    y,
    *,
    sample_weight=None,
    sampler=None,
    batch_size=32,
    keep_sparse=False,
    random_state=None
) -> tuple[Generator, int]

Create a balanced batch generator to train keras model.

Parameters:

  • X (ndarray of shape (n_samples, n_features)): Original imbalanced dataset
  • y (ndarray of shape (n_samples,) or (n_samples, n_classes)): Associated targets
  • sample_weight (ndarray of shape (n_samples,), default=None): Sample weight
  • sampler (sampler object, default=None): A sampler instance which has an attribute sample_indices_. By default, the sampler used is a RandomUnderSampler
  • batch_size (int, default=32): Number of samples per gradient update
  • keep_sparse (bool, default=False): Either or not to conserve or not the sparsity of the input. By default, the returned batches will be dense
  • random_state (int, RandomState instance or None, default=None): Control the randomization of the algorithm

Returns:

  • generator (generator of tuple): Generate batch of data. The tuple generated are either (X_batch, y_batch) or (X_batch, y_batch, sample_weight_batch)
  • steps_per_epoch (int): The number of samples per epoch. Required by fit_generator in keras

Usage Example:

from imblearn.keras import balanced_batch_generator
from imblearn.under_sampling import EditedNearestNeighbours

training_generator, steps_per_epoch = balanced_batch_generator(
    X, y, 
    sampler=EditedNearestNeighbours(), 
    batch_size=64, 
    random_state=42
)

# Use with older Keras API
history = model.fit_generator(
    training_generator,
    steps_per_epoch=steps_per_epoch,
    epochs=20
)

TensorFlow Integration

balanced_batch_generator (TensorFlow)

balanced_batch_generator

{ .api }
def balanced_batch_generator(
    X,
    y,
    *,
    sample_weight=None,
    sampler=None,
    batch_size=32,
    keep_sparse=False,
    random_state=None
) -> tuple[Generator, int]

Create a balanced batch generator to train tensorflow model.

Parameters:

  • X (ndarray of shape (n_samples, n_features)): Original imbalanced dataset
  • y (ndarray of shape (n_samples,) or (n_samples, n_classes)): Associated targets
  • sample_weight (ndarray of shape (n_samples,), default=None): Sample weight
  • sampler (sampler object, default=None): A sampler instance which has an attribute sample_indices_. By default, the sampler used is a RandomUnderSampler
  • batch_size (int, default=32): Number of samples per gradient update
  • keep_sparse (bool, default=False): Either or not to conserve or not the sparsity of the input X. By default, the returned batches will be dense
  • random_state (int, RandomState instance or None, default=None): Control the randomization of the algorithm

Returns:

  • generator (generator of tuple): Generate batch of data. The tuple generated are either (X_batch, y_batch) or (X_batch, y_batch, sample_weight_batch)
  • steps_per_epoch (int): The number of samples per epoch

Generator Function: The returned generator infinitely loops through balanced batches:

  1. Applies the sampler to balance the dataset
  2. Shuffles the resampled indices
  3. Creates batches of the specified size
  4. Yields batches cyclically for training

Usage with TensorFlow:

from imblearn.tensorflow import balanced_batch_generator
from imblearn.over_sampling import SMOTE
import tensorflow as tf

# Create generator
training_generator, steps_per_epoch = balanced_batch_generator(
    X, y,
    sampler=SMOTE(random_state=42),
    batch_size=128,
    random_state=42
)

# Use with tf.keras
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(3, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='categorical_crossentropy', 
    metrics=['accuracy']
)

history = model.fit(
    training_generator,
    steps_per_epoch=steps_per_epoch,
    epochs=50,
    validation_data=(X_val, y_val)
)

Sampler Integration

Compatible Samplers

All imblearn samplers with the sample_indices_ attribute can be used:

Over-sampling Methods:

from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.keras import BalancedBatchGenerator

# Using SMOTE
generator = BalancedBatchGenerator(X, y, sampler=SMOTE(k_neighbors=3))

# Using ADASYN  
generator = BalancedBatchGenerator(X, y, sampler=ADASYN(n_neighbors=5))

Under-sampling Methods:

from imblearn.under_sampling import RandomUnderSampler, TomekLinks, EditedNearestNeighbours

# Using random under-sampling
generator = BalancedBatchGenerator(X, y, sampler=RandomUnderSampler())

# Using Tomek links cleaning
generator = BalancedBatchGenerator(X, y, sampler=TomekLinks())

Combination Methods:

from imblearn.combine import SMOTEENN, SMOTETomek

# Using SMOTE + Edited Nearest Neighbours
generator = BalancedBatchGenerator(X, y, sampler=SMOTEENN())

# Using SMOTE + Tomek links
generator = BalancedBatchGenerator(X, y, sampler=SMOTETomek())

Advanced Usage Patterns

Multi-Class Classification

from sklearn.datasets import make_classification
from imblearn.keras import BalancedBatchGenerator
from imblearn.over_sampling import SMOTE
import tensorflow.keras as keras

# Create multi-class imbalanced dataset
X, y = make_classification(
    n_classes=3, 
    n_informative=5,
    weights=[0.7, 0.2, 0.1],
    n_samples=2000,
    random_state=42
)

# Convert to categorical
y_cat = keras.utils.to_categorical(y, 3)

# Create balanced generator
generator = BalancedBatchGenerator(
    X, y_cat,
    sampler=SMOTE(random_state=42),
    batch_size=64,
    random_state=42
)

# Multi-class model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(3, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy', 'categorical_accuracy']
)

history = model.fit(generator, epochs=100, verbose=1)

Sparse Data Handling

from scipy.sparse import csr_matrix
from imblearn.tensorflow import balanced_batch_generator

# Convert to sparse matrix
X_sparse = csr_matrix(X)

# Keep data sparse during batch generation
generator, steps = balanced_batch_generator(
    X_sparse, y,
    keep_sparse=True,
    batch_size=32
)

# Use with TensorFlow model that handles sparse input
for batch_X, batch_y in generator:
    if batch_X.issparse():
        batch_X = batch_X.toarray()  # Convert if needed
    # Train with batch

Sample Weight Integration

from sklearn.utils.class_weight import compute_sample_weight

# Compute sample weights
sample_weights = compute_sample_weight('balanced', y)

# Use with generator
generator = BalancedBatchGenerator(
    X, y,
    sample_weight=sample_weights,
    sampler=SMOTE(),
    batch_size=32
)

# Each batch will include sample weights
for batch_data in generator:
    X_batch, y_batch, weights_batch = batch_data
    # Use weights in training

Framework Comparison

Keras vs TensorFlow Generators

FeatureKeras BalancedBatchGeneratorTensorFlow balanced_batch_generator
APIKeras Sequence interfacePlain generator function
Integrationmodel.fit(generator)model.fit(generator, steps_per_epoch=steps)
MemorySequence protocolManual iteration control
FeaturesFull Keras integrationMore flexible, lower-level

Best Practices

  1. Choose appropriate sampler: Match sampler to your problem characteristics
  2. Batch size considerations: Balance memory usage with training stability
  3. Reproducibility: Always set random_state for consistent results
  4. Validation strategy: Use separate validation data, don't apply sampling to validation
  5. Monitor class distribution: Verify balanced batches are being generated

Complete Training Example:

from imblearn.keras import BalancedBatchGenerator
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
import tensorflow.keras as keras

# Split data
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Create balanced training generator
train_generator = BalancedBatchGenerator(
    X_train, y_train,
    sampler=SMOTE(random_state=42),
    batch_size=64,
    random_state=42
)

# Build model
model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.BatchNormalization(), 
    keras.layers.Dropout(0.3),
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile with class-aware metrics
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy', 'precision', 'recall']
)

# Train with early stopping
callbacks = [
    keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True),
    keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
]

history = model.fit(
    train_generator,
    validation_data=(X_val, y_val),
    epochs=100,
    callbacks=callbacks,
    verbose=1
)

Install with Tessl CLI

npx tessl i tessl/pypi-imbalanced-learn

docs

combination.md

datasets.md

deep-learning.md

ensemble.md

index.md

metrics.md

model-selection.md

over-sampling.md

pipeline.md

under-sampling.md

utilities.md

tile.json