or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

combination.mddatasets.mddeep-learning.mdensemble.mdindex.mdmetrics.mdmodel-selection.mdover-sampling.mdpipeline.mdunder-sampling.mdutilities.md

deep-learning.mddocs/

0

# Deep Learning Integration

1

2

Utilities for handling imbalanced datasets in deep learning frameworks, providing balanced batch generators for Keras and TensorFlow that ensure fair representation of all classes during training.

3

4

## Overview

5

6

Imbalanced-learn provides specialized batch generators for deep learning frameworks that address class imbalance by creating balanced batches during training. These tools integrate seamlessly with Keras and TensorFlow workflows while maintaining the benefits of sampling techniques.

7

8

### Key Features

9

- **Balanced batch generation**: Ensures each batch contains balanced class representation

10

- **Framework compatibility**: Native support for Keras and TensorFlow

11

- **Sampling integration**: Uses imblearn samplers for batch balancing

12

- **Memory efficiency**: Generates balanced batches on-demand without duplicating entire dataset

13

- **Sparse data support**: Handles both dense and sparse input matrices

14

15

### Supported Frameworks

16

- **Keras**: Via `BalancedBatchGenerator` class and `balanced_batch_generator` function

17

- **TensorFlow**: Via `balanced_batch_generator` function

18

19

## Keras Integration

20

21

### BalancedBatchGenerator

22

23

#### BalancedBatchGenerator

24

25

```python

26

{ .api }

27

class BalancedBatchGenerator:

28

def __init__(

29

self,

30

X,

31

y,

32

*,

33

sample_weight=None,

34

sampler=None,

35

batch_size=32,

36

keep_sparse=False,

37

random_state=None

38

): ...

39

def __len__(self): ...

40

def __getitem__(self, index): ...

41

```

42

43

Create balanced batches when training a keras model using the Sequence API.

44

45

**Parameters:**

46

- **X** (`ndarray` of shape `(n_samples, n_features)`): Original imbalanced dataset

47

- **y** (`ndarray` of shape `(n_samples,)` or `(n_samples, n_classes)`): Associated targets

48

- **sample_weight** (`ndarray` of shape `(n_samples,)`, default=`None`): Sample weight

49

- **sampler** (`sampler object`, default=`None`): A sampler instance which has an attribute `sample_indices_`. By default, the sampler used is a `RandomUnderSampler`

50

- **batch_size** (`int`, default=`32`): Number of samples per gradient update

51

- **keep_sparse** (`bool`, default=`False`): Either or not to conserve or not the sparsity of the input. By default, the returned batches will be dense

52

- **random_state** (`int`, `RandomState` instance or `None`, default=`None`): Control the randomization of the algorithm

53

54

**Attributes:**

55

- **sampler_** (`sampler object`): The sampler used to balance the dataset

56

- **indices_** (`ndarray` of shape `(n_samples, n_features)`): The indices of the samples selected during sampling

57

58

**Methods:**

59

60

##### __len__

61

62

```python

63

def __len__(self) -> int

64

```

65

66

Returns the number of batches per epoch.

67

68

##### __getitem__

69

70

```python

71

def __getitem__(self, index) -> tuple[ndarray, ndarray] | tuple[ndarray, ndarray, ndarray]

72

```

73

74

Generate one batch of data.

75

76

**Parameters:**

77

- **index** (`int`): Batch index

78

79

**Returns:**

80

- **batch** (`tuple`): Either `(X_batch, y_batch)` or `(X_batch, y_batch, sample_weight_batch)` if sample weights are provided

81

82

**Usage with Keras:**

83

The class implements the Keras `Sequence` interface for use with `model.fit()`:

84

85

```python

86

from imblearn.keras import BalancedBatchGenerator

87

from imblearn.under_sampling import NearMiss

88

import tensorflow.keras as keras

89

90

# Create balanced batch generator

91

training_generator = BalancedBatchGenerator(

92

X, y,

93

sampler=NearMiss(),

94

batch_size=32,

95

random_state=42

96

)

97

98

# Use with Keras model

99

model = keras.Sequential([

100

keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),

101

keras.layers.Dense(1, activation='sigmoid')

102

])

103

104

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

105

history = model.fit(training_generator, epochs=10)

106

```

107

108

#### balanced_batch_generator (Keras)

109

110

```python

111

{ .api }

112

def balanced_batch_generator(

113

X,

114

y,

115

*,

116

sample_weight=None,

117

sampler=None,

118

batch_size=32,

119

keep_sparse=False,

120

random_state=None

121

) -> tuple[Generator, int]

122

```

123

124

Create a balanced batch generator to train keras model.

125

126

**Parameters:**

127

- **X** (`ndarray` of shape `(n_samples, n_features)`): Original imbalanced dataset

128

- **y** (`ndarray` of shape `(n_samples,)` or `(n_samples, n_classes)`): Associated targets

129

- **sample_weight** (`ndarray` of shape `(n_samples,)`, default=`None`): Sample weight

130

- **sampler** (`sampler object`, default=`None`): A sampler instance which has an attribute `sample_indices_`. By default, the sampler used is a `RandomUnderSampler`

131

- **batch_size** (`int`, default=`32`): Number of samples per gradient update

132

- **keep_sparse** (`bool`, default=`False`): Either or not to conserve or not the sparsity of the input. By default, the returned batches will be dense

133

- **random_state** (`int`, `RandomState` instance or `None`, default=`None`): Control the randomization of the algorithm

134

135

**Returns:**

136

- **generator** (`generator` of `tuple`): Generate batch of data. The tuple generated are either `(X_batch, y_batch)` or `(X_batch, y_batch, sample_weight_batch)`

137

- **steps_per_epoch** (`int`): The number of samples per epoch. Required by `fit_generator` in keras

138

139

**Usage Example:**

140

```python

141

from imblearn.keras import balanced_batch_generator

142

from imblearn.under_sampling import EditedNearestNeighbours

143

144

training_generator, steps_per_epoch = balanced_batch_generator(

145

X, y,

146

sampler=EditedNearestNeighbours(),

147

batch_size=64,

148

random_state=42

149

)

150

151

# Use with older Keras API

152

history = model.fit_generator(

153

training_generator,

154

steps_per_epoch=steps_per_epoch,

155

epochs=20

156

)

157

```

158

159

## TensorFlow Integration

160

161

### balanced_batch_generator (TensorFlow)

162

163

#### balanced_batch_generator

164

165

```python

166

{ .api }

167

def balanced_batch_generator(

168

X,

169

y,

170

*,

171

sample_weight=None,

172

sampler=None,

173

batch_size=32,

174

keep_sparse=False,

175

random_state=None

176

) -> tuple[Generator, int]

177

```

178

179

Create a balanced batch generator to train tensorflow model.

180

181

**Parameters:**

182

- **X** (`ndarray` of shape `(n_samples, n_features)`): Original imbalanced dataset

183

- **y** (`ndarray` of shape `(n_samples,)` or `(n_samples, n_classes)`): Associated targets

184

- **sample_weight** (`ndarray` of shape `(n_samples,)`, default=`None`): Sample weight

185

- **sampler** (`sampler object`, default=`None`): A sampler instance which has an attribute `sample_indices_`. By default, the sampler used is a `RandomUnderSampler`

186

- **batch_size** (`int`, default=`32`): Number of samples per gradient update

187

- **keep_sparse** (`bool`, default=`False`): Either or not to conserve or not the sparsity of the input `X`. By default, the returned batches will be dense

188

- **random_state** (`int`, `RandomState` instance or `None`, default=`None`): Control the randomization of the algorithm

189

190

**Returns:**

191

- **generator** (`generator` of `tuple`): Generate batch of data. The tuple generated are either `(X_batch, y_batch)` or `(X_batch, y_batch, sample_weight_batch)`

192

- **steps_per_epoch** (`int`): The number of samples per epoch

193

194

**Generator Function:**

195

The returned generator infinitely loops through balanced batches:

196

1. Applies the sampler to balance the dataset

197

2. Shuffles the resampled indices

198

3. Creates batches of the specified size

199

4. Yields batches cyclically for training

200

201

**Usage with TensorFlow:**

202

```python

203

from imblearn.tensorflow import balanced_batch_generator

204

from imblearn.over_sampling import SMOTE

205

import tensorflow as tf

206

207

# Create generator

208

training_generator, steps_per_epoch = balanced_batch_generator(

209

X, y,

210

sampler=SMOTE(random_state=42),

211

batch_size=128,

212

random_state=42

213

)

214

215

# Use with tf.keras

216

model = tf.keras.Sequential([

217

tf.keras.layers.Dense(128, activation='relu'),

218

tf.keras.layers.Dropout(0.3),

219

tf.keras.layers.Dense(3, activation='softmax')

220

])

221

222

model.compile(

223

optimizer='adam',

224

loss='categorical_crossentropy',

225

metrics=['accuracy']

226

)

227

228

history = model.fit(

229

training_generator,

230

steps_per_epoch=steps_per_epoch,

231

epochs=50,

232

validation_data=(X_val, y_val)

233

)

234

```

235

236

## Sampler Integration

237

238

### Compatible Samplers

239

240

All imblearn samplers with the `sample_indices_` attribute can be used:

241

242

**Over-sampling Methods:**

243

```python

244

from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE

245

from imblearn.keras import BalancedBatchGenerator

246

247

# Using SMOTE

248

generator = BalancedBatchGenerator(X, y, sampler=SMOTE(k_neighbors=3))

249

250

# Using ADASYN

251

generator = BalancedBatchGenerator(X, y, sampler=ADASYN(n_neighbors=5))

252

```

253

254

**Under-sampling Methods:**

255

```python

256

from imblearn.under_sampling import RandomUnderSampler, TomekLinks, EditedNearestNeighbours

257

258

# Using random under-sampling

259

generator = BalancedBatchGenerator(X, y, sampler=RandomUnderSampler())

260

261

# Using Tomek links cleaning

262

generator = BalancedBatchGenerator(X, y, sampler=TomekLinks())

263

```

264

265

**Combination Methods:**

266

```python

267

from imblearn.combine import SMOTEENN, SMOTETomek

268

269

# Using SMOTE + Edited Nearest Neighbours

270

generator = BalancedBatchGenerator(X, y, sampler=SMOTEENN())

271

272

# Using SMOTE + Tomek links

273

generator = BalancedBatchGenerator(X, y, sampler=SMOTETomek())

274

```

275

276

## Advanced Usage Patterns

277

278

### Multi-Class Classification

279

280

```python

281

from sklearn.datasets import make_classification

282

from imblearn.keras import BalancedBatchGenerator

283

from imblearn.over_sampling import SMOTE

284

import tensorflow.keras as keras

285

286

# Create multi-class imbalanced dataset

287

X, y = make_classification(

288

n_classes=3,

289

n_informative=5,

290

weights=[0.7, 0.2, 0.1],

291

n_samples=2000,

292

random_state=42

293

)

294

295

# Convert to categorical

296

y_cat = keras.utils.to_categorical(y, 3)

297

298

# Create balanced generator

299

generator = BalancedBatchGenerator(

300

X, y_cat,

301

sampler=SMOTE(random_state=42),

302

batch_size=64,

303

random_state=42

304

)

305

306

# Multi-class model

307

model = keras.Sequential([

308

keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),

309

keras.layers.BatchNormalization(),

310

keras.layers.Dropout(0.3),

311

keras.layers.Dense(32, activation='relu'),

312

keras.layers.Dense(3, activation='softmax')

313

])

314

315

model.compile(

316

optimizer='adam',

317

loss='categorical_crossentropy',

318

metrics=['accuracy', 'categorical_accuracy']

319

)

320

321

history = model.fit(generator, epochs=100, verbose=1)

322

```

323

324

### Sparse Data Handling

325

326

```python

327

from scipy.sparse import csr_matrix

328

from imblearn.tensorflow import balanced_batch_generator

329

330

# Convert to sparse matrix

331

X_sparse = csr_matrix(X)

332

333

# Keep data sparse during batch generation

334

generator, steps = balanced_batch_generator(

335

X_sparse, y,

336

keep_sparse=True,

337

batch_size=32

338

)

339

340

# Use with TensorFlow model that handles sparse input

341

for batch_X, batch_y in generator:

342

if batch_X.issparse():

343

batch_X = batch_X.toarray() # Convert if needed

344

# Train with batch

345

```

346

347

### Sample Weight Integration

348

349

```python

350

from sklearn.utils.class_weight import compute_sample_weight

351

352

# Compute sample weights

353

sample_weights = compute_sample_weight('balanced', y)

354

355

# Use with generator

356

generator = BalancedBatchGenerator(

357

X, y,

358

sample_weight=sample_weights,

359

sampler=SMOTE(),

360

batch_size=32

361

)

362

363

# Each batch will include sample weights

364

for batch_data in generator:

365

X_batch, y_batch, weights_batch = batch_data

366

# Use weights in training

367

```

368

369

## Framework Comparison

370

371

### Keras vs TensorFlow Generators

372

373

| Feature | Keras BalancedBatchGenerator | TensorFlow balanced_batch_generator |

374

|---------|------------------------------|-------------------------------------|

375

| **API** | Keras Sequence interface | Plain generator function |

376

| **Integration** | `model.fit(generator)` | `model.fit(generator, steps_per_epoch=steps)` |

377

| **Memory** | Sequence protocol | Manual iteration control |

378

| **Features** | Full Keras integration | More flexible, lower-level |

379

380

## Best Practices

381

382

1. **Choose appropriate sampler**: Match sampler to your problem characteristics

383

2. **Batch size considerations**: Balance memory usage with training stability

384

3. **Reproducibility**: Always set `random_state` for consistent results

385

4. **Validation strategy**: Use separate validation data, don't apply sampling to validation

386

5. **Monitor class distribution**: Verify balanced batches are being generated

387

388

**Complete Training Example:**

389

```python

390

from imblearn.keras import BalancedBatchGenerator

391

from imblearn.over_sampling import SMOTE

392

from sklearn.model_selection import train_test_split

393

import tensorflow.keras as keras

394

395

# Split data

396

X_train, X_val, y_train, y_val = train_test_split(

397

X, y, test_size=0.2, stratify=y, random_state=42

398

)

399

400

# Create balanced training generator

401

train_generator = BalancedBatchGenerator(

402

X_train, y_train,

403

sampler=SMOTE(random_state=42),

404

batch_size=64,

405

random_state=42

406

)

407

408

# Build model

409

model = keras.Sequential([

410

keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),

411

keras.layers.BatchNormalization(),

412

keras.layers.Dropout(0.5),

413

keras.layers.Dense(64, activation='relu'),

414

keras.layers.BatchNormalization(),

415

keras.layers.Dropout(0.3),

416

keras.layers.Dense(1, activation='sigmoid')

417

])

418

419

# Compile with class-aware metrics

420

model.compile(

421

optimizer=keras.optimizers.Adam(learning_rate=0.001),

422

loss='binary_crossentropy',

423

metrics=['accuracy', 'precision', 'recall']

424

)

425

426

# Train with early stopping

427

callbacks = [

428

keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True),

429

keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

430

]

431

432

history = model.fit(

433

train_generator,

434

validation_data=(X_val, y_val),

435

epochs=100,

436

callbacks=callbacks,

437

verbose=1

438

)

439

```