0
# Deep Learning Integration
1
2
Utilities for handling imbalanced datasets in deep learning frameworks, providing balanced batch generators for Keras and TensorFlow that ensure fair representation of all classes during training.
3
4
## Overview
5
6
Imbalanced-learn provides specialized batch generators for deep learning frameworks that address class imbalance by creating balanced batches during training. These tools integrate seamlessly with Keras and TensorFlow workflows while maintaining the benefits of sampling techniques.
7
8
### Key Features
9
- **Balanced batch generation**: Ensures each batch contains balanced class representation
10
- **Framework compatibility**: Native support for Keras and TensorFlow
11
- **Sampling integration**: Uses imblearn samplers for batch balancing
12
- **Memory efficiency**: Generates balanced batches on-demand without duplicating entire dataset
13
- **Sparse data support**: Handles both dense and sparse input matrices
14
15
### Supported Frameworks
16
- **Keras**: Via `BalancedBatchGenerator` class and `balanced_batch_generator` function
17
- **TensorFlow**: Via `balanced_batch_generator` function
18
19
## Keras Integration
20
21
### BalancedBatchGenerator
22
23
#### BalancedBatchGenerator
24
25
```python
26
{ .api }
27
class BalancedBatchGenerator:
28
def __init__(
29
self,
30
X,
31
y,
32
*,
33
sample_weight=None,
34
sampler=None,
35
batch_size=32,
36
keep_sparse=False,
37
random_state=None
38
): ...
39
def __len__(self): ...
40
def __getitem__(self, index): ...
41
```
42
43
Create balanced batches when training a keras model using the Sequence API.
44
45
**Parameters:**
46
- **X** (`ndarray` of shape `(n_samples, n_features)`): Original imbalanced dataset
47
- **y** (`ndarray` of shape `(n_samples,)` or `(n_samples, n_classes)`): Associated targets
48
- **sample_weight** (`ndarray` of shape `(n_samples,)`, default=`None`): Sample weight
49
- **sampler** (`sampler object`, default=`None`): A sampler instance which has an attribute `sample_indices_`. By default, the sampler used is a `RandomUnderSampler`
50
- **batch_size** (`int`, default=`32`): Number of samples per gradient update
51
- **keep_sparse** (`bool`, default=`False`): Either or not to conserve or not the sparsity of the input. By default, the returned batches will be dense
52
- **random_state** (`int`, `RandomState` instance or `None`, default=`None`): Control the randomization of the algorithm
53
54
**Attributes:**
55
- **sampler_** (`sampler object`): The sampler used to balance the dataset
56
- **indices_** (`ndarray` of shape `(n_samples, n_features)`): The indices of the samples selected during sampling
57
58
**Methods:**
59
60
##### __len__
61
62
```python
63
def __len__(self) -> int
64
```
65
66
Returns the number of batches per epoch.
67
68
##### __getitem__
69
70
```python
71
def __getitem__(self, index) -> tuple[ndarray, ndarray] | tuple[ndarray, ndarray, ndarray]
72
```
73
74
Generate one batch of data.
75
76
**Parameters:**
77
- **index** (`int`): Batch index
78
79
**Returns:**
80
- **batch** (`tuple`): Either `(X_batch, y_batch)` or `(X_batch, y_batch, sample_weight_batch)` if sample weights are provided
81
82
**Usage with Keras:**
83
The class implements the Keras `Sequence` interface for use with `model.fit()`:
84
85
```python
86
from imblearn.keras import BalancedBatchGenerator
87
from imblearn.under_sampling import NearMiss
88
import tensorflow.keras as keras
89
90
# Create balanced batch generator
91
training_generator = BalancedBatchGenerator(
92
X, y,
93
sampler=NearMiss(),
94
batch_size=32,
95
random_state=42
96
)
97
98
# Use with Keras model
99
model = keras.Sequential([
100
keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
101
keras.layers.Dense(1, activation='sigmoid')
102
])
103
104
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
105
history = model.fit(training_generator, epochs=10)
106
```
107
108
#### balanced_batch_generator (Keras)
109
110
```python
111
{ .api }
112
def balanced_batch_generator(
113
X,
114
y,
115
*,
116
sample_weight=None,
117
sampler=None,
118
batch_size=32,
119
keep_sparse=False,
120
random_state=None
121
) -> tuple[Generator, int]
122
```
123
124
Create a balanced batch generator to train keras model.
125
126
**Parameters:**
127
- **X** (`ndarray` of shape `(n_samples, n_features)`): Original imbalanced dataset
128
- **y** (`ndarray` of shape `(n_samples,)` or `(n_samples, n_classes)`): Associated targets
129
- **sample_weight** (`ndarray` of shape `(n_samples,)`, default=`None`): Sample weight
130
- **sampler** (`sampler object`, default=`None`): A sampler instance which has an attribute `sample_indices_`. By default, the sampler used is a `RandomUnderSampler`
131
- **batch_size** (`int`, default=`32`): Number of samples per gradient update
132
- **keep_sparse** (`bool`, default=`False`): Either or not to conserve or not the sparsity of the input. By default, the returned batches will be dense
133
- **random_state** (`int`, `RandomState` instance or `None`, default=`None`): Control the randomization of the algorithm
134
135
**Returns:**
136
- **generator** (`generator` of `tuple`): Generate batch of data. The tuple generated are either `(X_batch, y_batch)` or `(X_batch, y_batch, sample_weight_batch)`
137
- **steps_per_epoch** (`int`): The number of samples per epoch. Required by `fit_generator` in keras
138
139
**Usage Example:**
140
```python
141
from imblearn.keras import balanced_batch_generator
142
from imblearn.under_sampling import EditedNearestNeighbours
143
144
training_generator, steps_per_epoch = balanced_batch_generator(
145
X, y,
146
sampler=EditedNearestNeighbours(),
147
batch_size=64,
148
random_state=42
149
)
150
151
# Use with older Keras API
152
history = model.fit_generator(
153
training_generator,
154
steps_per_epoch=steps_per_epoch,
155
epochs=20
156
)
157
```
158
159
## TensorFlow Integration
160
161
### balanced_batch_generator (TensorFlow)
162
163
#### balanced_batch_generator
164
165
```python
166
{ .api }
167
def balanced_batch_generator(
168
X,
169
y,
170
*,
171
sample_weight=None,
172
sampler=None,
173
batch_size=32,
174
keep_sparse=False,
175
random_state=None
176
) -> tuple[Generator, int]
177
```
178
179
Create a balanced batch generator to train tensorflow model.
180
181
**Parameters:**
182
- **X** (`ndarray` of shape `(n_samples, n_features)`): Original imbalanced dataset
183
- **y** (`ndarray` of shape `(n_samples,)` or `(n_samples, n_classes)`): Associated targets
184
- **sample_weight** (`ndarray` of shape `(n_samples,)`, default=`None`): Sample weight
185
- **sampler** (`sampler object`, default=`None`): A sampler instance which has an attribute `sample_indices_`. By default, the sampler used is a `RandomUnderSampler`
186
- **batch_size** (`int`, default=`32`): Number of samples per gradient update
187
- **keep_sparse** (`bool`, default=`False`): Either or not to conserve or not the sparsity of the input `X`. By default, the returned batches will be dense
188
- **random_state** (`int`, `RandomState` instance or `None`, default=`None`): Control the randomization of the algorithm
189
190
**Returns:**
191
- **generator** (`generator` of `tuple`): Generate batch of data. The tuple generated are either `(X_batch, y_batch)` or `(X_batch, y_batch, sample_weight_batch)`
192
- **steps_per_epoch** (`int`): The number of samples per epoch
193
194
**Generator Function:**
195
The returned generator infinitely loops through balanced batches:
196
1. Applies the sampler to balance the dataset
197
2. Shuffles the resampled indices
198
3. Creates batches of the specified size
199
4. Yields batches cyclically for training
200
201
**Usage with TensorFlow:**
202
```python
203
from imblearn.tensorflow import balanced_batch_generator
204
from imblearn.over_sampling import SMOTE
205
import tensorflow as tf
206
207
# Create generator
208
training_generator, steps_per_epoch = balanced_batch_generator(
209
X, y,
210
sampler=SMOTE(random_state=42),
211
batch_size=128,
212
random_state=42
213
)
214
215
# Use with tf.keras
216
model = tf.keras.Sequential([
217
tf.keras.layers.Dense(128, activation='relu'),
218
tf.keras.layers.Dropout(0.3),
219
tf.keras.layers.Dense(3, activation='softmax')
220
])
221
222
model.compile(
223
optimizer='adam',
224
loss='categorical_crossentropy',
225
metrics=['accuracy']
226
)
227
228
history = model.fit(
229
training_generator,
230
steps_per_epoch=steps_per_epoch,
231
epochs=50,
232
validation_data=(X_val, y_val)
233
)
234
```
235
236
## Sampler Integration
237
238
### Compatible Samplers
239
240
All imblearn samplers with the `sample_indices_` attribute can be used:
241
242
**Over-sampling Methods:**
243
```python
244
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
245
from imblearn.keras import BalancedBatchGenerator
246
247
# Using SMOTE
248
generator = BalancedBatchGenerator(X, y, sampler=SMOTE(k_neighbors=3))
249
250
# Using ADASYN
251
generator = BalancedBatchGenerator(X, y, sampler=ADASYN(n_neighbors=5))
252
```
253
254
**Under-sampling Methods:**
255
```python
256
from imblearn.under_sampling import RandomUnderSampler, TomekLinks, EditedNearestNeighbours
257
258
# Using random under-sampling
259
generator = BalancedBatchGenerator(X, y, sampler=RandomUnderSampler())
260
261
# Using Tomek links cleaning
262
generator = BalancedBatchGenerator(X, y, sampler=TomekLinks())
263
```
264
265
**Combination Methods:**
266
```python
267
from imblearn.combine import SMOTEENN, SMOTETomek
268
269
# Using SMOTE + Edited Nearest Neighbours
270
generator = BalancedBatchGenerator(X, y, sampler=SMOTEENN())
271
272
# Using SMOTE + Tomek links
273
generator = BalancedBatchGenerator(X, y, sampler=SMOTETomek())
274
```
275
276
## Advanced Usage Patterns
277
278
### Multi-Class Classification
279
280
```python
281
from sklearn.datasets import make_classification
282
from imblearn.keras import BalancedBatchGenerator
283
from imblearn.over_sampling import SMOTE
284
import tensorflow.keras as keras
285
286
# Create multi-class imbalanced dataset
287
X, y = make_classification(
288
n_classes=3,
289
n_informative=5,
290
weights=[0.7, 0.2, 0.1],
291
n_samples=2000,
292
random_state=42
293
)
294
295
# Convert to categorical
296
y_cat = keras.utils.to_categorical(y, 3)
297
298
# Create balanced generator
299
generator = BalancedBatchGenerator(
300
X, y_cat,
301
sampler=SMOTE(random_state=42),
302
batch_size=64,
303
random_state=42
304
)
305
306
# Multi-class model
307
model = keras.Sequential([
308
keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
309
keras.layers.BatchNormalization(),
310
keras.layers.Dropout(0.3),
311
keras.layers.Dense(32, activation='relu'),
312
keras.layers.Dense(3, activation='softmax')
313
])
314
315
model.compile(
316
optimizer='adam',
317
loss='categorical_crossentropy',
318
metrics=['accuracy', 'categorical_accuracy']
319
)
320
321
history = model.fit(generator, epochs=100, verbose=1)
322
```
323
324
### Sparse Data Handling
325
326
```python
327
from scipy.sparse import csr_matrix
328
from imblearn.tensorflow import balanced_batch_generator
329
330
# Convert to sparse matrix
331
X_sparse = csr_matrix(X)
332
333
# Keep data sparse during batch generation
334
generator, steps = balanced_batch_generator(
335
X_sparse, y,
336
keep_sparse=True,
337
batch_size=32
338
)
339
340
# Use with TensorFlow model that handles sparse input
341
for batch_X, batch_y in generator:
342
if batch_X.issparse():
343
batch_X = batch_X.toarray() # Convert if needed
344
# Train with batch
345
```
346
347
### Sample Weight Integration
348
349
```python
350
from sklearn.utils.class_weight import compute_sample_weight
351
352
# Compute sample weights
353
sample_weights = compute_sample_weight('balanced', y)
354
355
# Use with generator
356
generator = BalancedBatchGenerator(
357
X, y,
358
sample_weight=sample_weights,
359
sampler=SMOTE(),
360
batch_size=32
361
)
362
363
# Each batch will include sample weights
364
for batch_data in generator:
365
X_batch, y_batch, weights_batch = batch_data
366
# Use weights in training
367
```
368
369
## Framework Comparison
370
371
### Keras vs TensorFlow Generators
372
373
| Feature | Keras BalancedBatchGenerator | TensorFlow balanced_batch_generator |
374
|---------|------------------------------|-------------------------------------|
375
| **API** | Keras Sequence interface | Plain generator function |
376
| **Integration** | `model.fit(generator)` | `model.fit(generator, steps_per_epoch=steps)` |
377
| **Memory** | Sequence protocol | Manual iteration control |
378
| **Features** | Full Keras integration | More flexible, lower-level |
379
380
## Best Practices
381
382
1. **Choose appropriate sampler**: Match sampler to your problem characteristics
383
2. **Batch size considerations**: Balance memory usage with training stability
384
3. **Reproducibility**: Always set `random_state` for consistent results
385
4. **Validation strategy**: Use separate validation data, don't apply sampling to validation
386
5. **Monitor class distribution**: Verify balanced batches are being generated
387
388
**Complete Training Example:**
389
```python
390
from imblearn.keras import BalancedBatchGenerator
391
from imblearn.over_sampling import SMOTE
392
from sklearn.model_selection import train_test_split
393
import tensorflow.keras as keras
394
395
# Split data
396
X_train, X_val, y_train, y_val = train_test_split(
397
X, y, test_size=0.2, stratify=y, random_state=42
398
)
399
400
# Create balanced training generator
401
train_generator = BalancedBatchGenerator(
402
X_train, y_train,
403
sampler=SMOTE(random_state=42),
404
batch_size=64,
405
random_state=42
406
)
407
408
# Build model
409
model = keras.Sequential([
410
keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
411
keras.layers.BatchNormalization(),
412
keras.layers.Dropout(0.5),
413
keras.layers.Dense(64, activation='relu'),
414
keras.layers.BatchNormalization(),
415
keras.layers.Dropout(0.3),
416
keras.layers.Dense(1, activation='sigmoid')
417
])
418
419
# Compile with class-aware metrics
420
model.compile(
421
optimizer=keras.optimizers.Adam(learning_rate=0.001),
422
loss='binary_crossentropy',
423
metrics=['accuracy', 'precision', 'recall']
424
)
425
426
# Train with early stopping
427
callbacks = [
428
keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True),
429
keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
430
]
431
432
history = model.fit(
433
train_generator,
434
validation_data=(X_val, y_val),
435
epochs=100,
436
callbacks=callbacks,
437
verbose=1
438
)
439
```