or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

combination.mddatasets.mddeep-learning.mdensemble.mdindex.mdmetrics.mdmodel-selection.mdover-sampling.mdpipeline.mdunder-sampling.mdutilities.md

ensemble.mddocs/

0

# Ensemble Methods for Imbalanced Learning

1

2

## Overview

3

4

Ensemble methods combine multiple base learners to improve classification performance beyond what individual models can achieve. However, traditional ensemble methods often struggle with imbalanced datasets where minority classes are underrepresented. The imbalanced-learn library provides specialized ensemble classifiers that integrate resampling techniques directly into the ensemble learning process.

5

6

These ensemble methods address class imbalance by applying resampling strategies during training, ensuring that each base learner in the ensemble receives balanced training data. This approach leads to improved performance on minority classes while maintaining overall classification accuracy.

7

8

The ensemble module includes four main approaches:

9

10

- **BalancedBaggingClassifier**: Applies random under-sampling to each bootstrap sample in bagging

11

- **BalancedRandomForestClassifier**: Integrates random under-sampling into random forest construction

12

- **EasyEnsembleClassifier**: Combines multiple balanced AdaBoost classifiers

13

- **RUSBoostClassifier**: Integrates random under-sampling directly into the AdaBoost algorithm

14

15

## BalancedBaggingClassifier

16

17

A bagging classifier with additional balancing that applies resampling to each bootstrap sample before training base estimators.

18

19

```python { .api }

20

class BalancedBaggingClassifier(BaggingClassifier):

21

def __init__(

22

self,

23

estimator=None,

24

n_estimators=10,

25

*,

26

max_samples=1.0,

27

max_features=1.0,

28

bootstrap=True,

29

bootstrap_features=False,

30

oob_score=False,

31

warm_start=False,

32

sampling_strategy="auto",

33

replacement=False,

34

n_jobs=None,

35

random_state=None,

36

verbose=0,

37

sampler=None,

38

)

39

```

40

41

### Parameters

42

43

- **estimator** : estimator object, default=None

44

- The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a DecisionTreeClassifier

45

- **n_estimators** : int, default=10

46

- The number of base estimators in the ensemble

47

- **max_samples** : int or float, default=1.0

48

- The number of samples to draw from X to train each base estimator

49

- **max_features** : int or float, default=1.0

50

- The number of features to draw from X to train each base estimator

51

- **bootstrap** : bool, default=True

52

- Whether samples are drawn with replacement (applied after resampling)

53

- **bootstrap_features** : bool, default=False

54

- Whether features are drawn with replacement

55

- **oob_score** : bool, default=False

56

- Whether to use out-of-bag samples to estimate generalization error

57

- **warm_start** : bool, default=False

58

- When set to True, reuse the solution of the previous call to fit

59

- **sampling_strategy** : float, str, dict, callable, default="auto"

60

- Sampling information to resample the dataset

61

- **replacement** : bool, default=False

62

- Whether to sample randomly with replacement when using RandomUnderSampler

63

- **n_jobs** : int, default=None

64

- The number of jobs to run in parallel for both fit and predict

65

- **random_state** : int or RandomState, default=None

66

- Controls the random seed given to each base estimator

67

- **verbose** : int, default=0

68

- Controls the verbosity of the building process

69

- **sampler** : sampler object, default=None

70

- The sampler used to balance the dataset before bootstrapping. By default, RandomUnderSampler is used

71

72

### Methods

73

74

```python { .api }

75

def fit(self, X, y):

76

"""Build a Bagging ensemble of estimators from the training set (X, y).

77

78

Parameters

79

----------

80

X : {array-like, sparse matrix} of shape (n_samples, n_features)

81

The training input samples

82

y : array-like of shape (n_samples,)

83

The target values (class labels)

84

85

Returns

86

-------

87

self : object

88

Fitted estimator

89

"""

90

91

def predict(self, X):

92

"""Predict class for samples in X.

93

94

Parameters

95

----------

96

X : {array-like, sparse matrix} of shape (n_samples, n_features)

97

The input samples

98

99

Returns

100

-------

101

y : ndarray of shape (n_samples,)

102

The predicted classes

103

"""

104

105

def predict_proba(self, X):

106

"""Predict class probabilities for samples in X.

107

108

Parameters

109

----------

110

X : {array-like, sparse matrix} of shape (n_samples, n_features)

111

The input samples

112

113

Returns

114

-------

115

p : ndarray of shape (n_samples, n_classes)

116

The class probabilities of the input samples

117

"""

118

```

119

120

### Attributes

121

122

- **estimator_** : estimator - The base estimator from which the ensemble is grown

123

- **estimators_** : list of estimators - The collection of fitted base estimators

124

- **sampler_** : sampler object - The validated sampler created from the sampler parameter

125

- **estimators_samples_** : list of ndarray - The subset of drawn samples for each base estimator

126

- **estimators_features_** : list of ndarray - The subset of drawn features for each base estimator

127

- **classes_** : ndarray - The classes labels

128

- **n_classes_** : int or list - The number of classes

129

- **oob_score_** : float - Score using out-of-bag estimate (if oob_score=True)

130

131

### Example Usage

132

133

```python

134

from imblearn.ensemble import BalancedBaggingClassifier

135

from sklearn.datasets import make_classification

136

from sklearn.model_selection import train_test_split

137

138

# Create imbalanced dataset

139

X, y = make_classification(

140

n_classes=2, class_sep=2, weights=[0.1, 0.9],

141

n_informative=3, n_redundant=1, n_features=20,

142

n_clusters_per_class=1, n_samples=1000, random_state=10

143

)

144

145

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

146

147

# Train balanced bagging classifier

148

bbc = BalancedBaggingClassifier(n_estimators=10, random_state=42)

149

bbc.fit(X_train, y_train)

150

151

# Make predictions

152

y_pred = bbc.predict(X_test)

153

y_proba = bbc.predict_proba(X_test)

154

```

155

156

## BalancedRandomForestClassifier

157

158

A balanced random forest classifier that applies random under-sampling to balance each bootstrap sample during forest construction.

159

160

```python { .api }

161

class BalancedRandomForestClassifier(RandomForestClassifier):

162

def __init__(

163

self,

164

n_estimators=100,

165

*,

166

criterion="gini",

167

max_depth=None,

168

min_samples_split=2,

169

min_samples_leaf=1,

170

min_weight_fraction_leaf=0.0,

171

max_features="sqrt",

172

max_leaf_nodes=None,

173

min_impurity_decrease=0.0,

174

bootstrap=False,

175

oob_score=False,

176

sampling_strategy="all",

177

replacement=True,

178

n_jobs=None,

179

random_state=None,

180

verbose=0,

181

warm_start=False,

182

class_weight=None,

183

ccp_alpha=0.0,

184

max_samples=None,

185

monotonic_cst=None,

186

)

187

```

188

189

### Parameters

190

191

- **n_estimators** : int, default=100

192

- The number of trees in the forest

193

- **criterion** : {"gini", "entropy"}, default="gini"

194

- The function to measure the quality of a split

195

- **max_depth** : int, default=None

196

- The maximum depth of the tree

197

- **min_samples_split** : int or float, default=2

198

- The minimum number of samples required to split an internal node

199

- **min_samples_leaf** : int or float, default=1

200

- The minimum number of samples required to be at a leaf node

201

- **min_weight_fraction_leaf** : float, default=0.0

202

- The minimum weighted fraction of the sum total of weights required to be at a leaf node

203

- **max_features** : {"auto", "sqrt", "log2"}, int, float, or None, default="sqrt"

204

- The number of features to consider when looking for the best split

205

- **max_leaf_nodes** : int, default=None

206

- Grow trees with max_leaf_nodes in best-first fashion

207

- **min_impurity_decrease** : float, default=0.0

208

- A node will be split if this split induces a decrease of impurity greater than or equal to this value

209

- **bootstrap** : bool, default=False

210

- Whether bootstrap samples are used when building trees (applied after resampling)

211

- **oob_score** : bool, default=False

212

- Whether to use out-of-bag samples to estimate generalization accuracy

213

- **sampling_strategy** : float, str, dict, callable, default="all"

214

- Sampling information to resample the dataset

215

- **replacement** : bool, default=True

216

- Whether to sample randomly with replacement or not

217

- **n_jobs** : int, default=None

218

- The number of jobs to run in parallel

219

- **random_state** : int or RandomState, default=None

220

- Controls both the randomness of the bootstrap and feature sampling

221

- **verbose** : int, default=0

222

- Controls the verbosity of the tree building process

223

- **warm_start** : bool, default=False

224

- When set to True, reuse the solution of the previous call to fit

225

- **class_weight** : dict, list of dicts, {"balanced", "balanced_subsample"}, default=None

226

- Weights associated with classes

227

- **ccp_alpha** : non-negative float, default=0.0

228

- Complexity parameter used for Minimal Cost-Complexity Pruning

229

- **max_samples** : int or float, default=None

230

- The number of samples to draw from X to train each base estimator

231

- **monotonic_cst** : array-like of int, default=None

232

- Indicates the monotonicity constraint to enforce on each feature

233

234

### Methods

235

236

```python { .api }

237

def fit(self, X, y, sample_weight=None):

238

"""Build a forest of trees from the training set (X, y).

239

240

Parameters

241

----------

242

X : {array-like, sparse matrix} of shape (n_samples, n_features)

243

The training input samples

244

y : array-like of shape (n_samples,) or (n_samples, n_outputs)

245

The target values (class labels)

246

sample_weight : array-like of shape (n_samples,), default=None

247

Sample weights

248

249

Returns

250

-------

251

self : object

252

The fitted instance

253

"""

254

255

def predict(self, X):

256

"""Predict class for samples in X.

257

258

Parameters

259

----------

260

X : array-like of shape (n_samples, n_features)

261

The input samples

262

263

Returns

264

-------

265

y : ndarray of shape (n_samples,)

266

The predicted classes

267

"""

268

269

def predict_proba(self, X):

270

"""Predict class probabilities for samples in X.

271

272

Parameters

273

----------

274

X : array-like of shape (n_samples, n_features)

275

The input samples

276

277

Returns

278

-------

279

p : ndarray of shape (n_samples, n_classes)

280

The class probabilities

281

"""

282

```

283

284

### Attributes

285

286

- **estimator_** : DecisionTreeClassifier - The child estimator template used to create the collection

287

- **estimators_** : list of DecisionTreeClassifier - The collection of fitted sub-estimators

288

- **base_sampler_** : RandomUnderSampler - The base sampler used to construct subsequent samplers

289

- **samplers_** : list of RandomUnderSampler - The collection of fitted samplers

290

- **pipelines_** : list of Pipeline - The collection of fitted pipelines (samplers + trees)

291

- **classes_** : ndarray - The classes labels

292

- **n_classes_** : int or list - The number of classes

293

- **feature_importances_** : ndarray - The feature importances

294

- **oob_score_** : float - Score using out-of-bag estimate (if oob_score=True)

295

296

### Example Usage

297

298

```python

299

from imblearn.ensemble import BalancedRandomForestClassifier

300

from sklearn.datasets import make_classification

301

302

# Create imbalanced dataset

303

X, y = make_classification(

304

n_samples=1000, n_classes=3, n_informative=4,

305

weights=[0.2, 0.3, 0.5], random_state=0

306

)

307

308

# Train balanced random forest

309

brf = BalancedRandomForestClassifier(

310

n_estimators=10,

311

sampling_strategy="all",

312

replacement=True,

313

max_depth=2,

314

random_state=0,

315

bootstrap=False

316

)

317

brf.fit(X, y)

318

319

# Make predictions

320

y_pred = brf.predict(X)

321

feature_importances = brf.feature_importances_

322

```

323

324

## EasyEnsembleClassifier

325

326

Bag of balanced boosted learners, also known as EasyEnsemble. This classifier is an ensemble of AdaBoost learners trained on different balanced bootstrap samples achieved by random under-sampling.

327

328

```python { .api }

329

class EasyEnsembleClassifier(BaggingClassifier):

330

def __init__(

331

self,

332

n_estimators=10,

333

estimator=None,

334

*,

335

warm_start=False,

336

sampling_strategy="auto",

337

replacement=False,

338

n_jobs=None,

339

random_state=None,

340

verbose=0,

341

)

342

```

343

344

### Parameters

345

346

- **n_estimators** : int, default=10

347

- Number of AdaBoost learners in the ensemble

348

- **estimator** : estimator object, default=AdaBoostClassifier()

349

- The base AdaBoost classifier used in the inner ensemble. You can set the number of inner learners by passing your own instance

350

- **warm_start** : bool, default=False

351

- When set to True, reuse the solution of the previous call to fit

352

- **sampling_strategy** : float, str, dict, callable, default="auto"

353

- Sampling information to resample the dataset

354

- **replacement** : bool, default=False

355

- Whether to sample randomly with replacement or not

356

- **n_jobs** : int, default=None

357

- The number of jobs to run in parallel for both fit and predict

358

- **random_state** : int or RandomState, default=None

359

- Controls the random seed given to each base estimator

360

- **verbose** : int, default=0

361

- Controls the verbosity of the building process

362

363

### Methods

364

365

```python { .api }

366

def fit(self, X, y):

367

"""Build a Bagging ensemble of estimators from the training set (X, y).

368

369

Parameters

370

----------

371

X : {array-like, sparse matrix} of shape (n_samples, n_features)

372

The training input samples

373

y : array-like of shape (n_samples,)

374

The target values (class labels)

375

376

Returns

377

-------

378

self : object

379

Fitted estimator

380

"""

381

382

def predict(self, X):

383

"""Predict class for samples in X.

384

385

Parameters

386

----------

387

X : {array-like, sparse matrix} of shape (n_samples, n_features)

388

The input samples

389

390

Returns

391

-------

392

y : ndarray of shape (n_samples,)

393

The predicted classes

394

"""

395

396

def predict_proba(self, X):

397

"""Predict class probabilities for samples in X.

398

399

Parameters

400

----------

401

X : {array-like, sparse matrix} of shape (n_samples, n_features)

402

The input samples

403

404

Returns

405

-------

406

p : ndarray of shape (n_samples, n_classes)

407

The class probabilities

408

"""

409

```

410

411

### Attributes

412

413

- **estimator_** : estimator - The base estimator from which the ensemble is grown

414

- **estimators_** : list of estimators - The collection of fitted base estimators

415

- **estimators_samples_** : list of arrays - The subset of drawn samples for each base estimator

416

- **estimators_features_** : list of arrays - The subset of drawn features for each base estimator

417

- **classes_** : ndarray - The classes labels

418

- **n_classes_** : int or list - The number of classes

419

420

### Example Usage

421

422

```python

423

from imblearn.ensemble import EasyEnsembleClassifier

424

from sklearn.ensemble import AdaBoostClassifier

425

from sklearn.datasets import make_classification

426

from sklearn.model_selection import train_test_split

427

428

# Create imbalanced dataset

429

X, y = make_classification(

430

n_classes=2, class_sep=2, weights=[0.1, 0.9],

431

n_informative=3, n_redundant=1, n_features=20,

432

n_clusters_per_class=1, n_samples=1000, random_state=10

433

)

434

435

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

436

437

# Create custom AdaBoost estimator

438

ada_estimator = AdaBoostClassifier(n_estimators=10, algorithm="SAMME")

439

440

# Train EasyEnsemble classifier

441

eec = EasyEnsembleClassifier(

442

n_estimators=10,

443

estimator=ada_estimator,

444

random_state=42

445

)

446

eec.fit(X_train, y_train)

447

448

# Make predictions

449

y_pred = eec.predict(X_test)

450

y_proba = eec.predict_proba(X_test)

451

```

452

453

## RUSBoostClassifier

454

455

Random under-sampling integrated into the learning of AdaBoost. During learning, class balancing is alleviated by random under-sampling the dataset at each iteration of the boosting algorithm.

456

457

```python { .api }

458

class RUSBoostClassifier(AdaBoostClassifier):

459

def __init__(

460

self,

461

estimator=None,

462

*,

463

n_estimators=50,

464

learning_rate=1.0,

465

algorithm="deprecated",

466

sampling_strategy="auto",

467

replacement=False,

468

random_state=None,

469

)

470

```

471

472

### Parameters

473

474

- **estimator** : estimator object, default=None

475

- The base estimator from which the boosted ensemble is built. If None, then DecisionTreeClassifier(max_depth=1)

476

- **n_estimators** : int, default=50

477

- The maximum number of estimators at which boosting is terminated

478

- **learning_rate** : float, default=1.0

479

- Learning rate shrinks the contribution of each classifier

480

- **algorithm** : {"SAMME", "SAMME.R"}, default="deprecated"

481

- The boosting algorithm to use. SAMME.R uses real boosting algorithm, SAMME uses discrete boosting

482

- **sampling_strategy** : float, str, dict, callable, default="auto"

483

- Sampling information to resample the dataset

484

- **replacement** : bool, default=False

485

- Whether to sample randomly with replacement or not

486

- **random_state** : int or RandomState, default=None

487

- Controls the random seed given to each base estimator

488

489

### Methods

490

491

```python { .api }

492

def fit(self, X, y, sample_weight=None):

493

"""Build a boosted classifier from the training set (X, y).

494

495

Parameters

496

----------

497

X : {array-like, sparse matrix} of shape (n_samples, n_features)

498

The training input samples

499

y : array-like of shape (n_samples,)

500

The target values (class labels)

501

sample_weight : array-like of shape (n_samples,), default=None

502

Sample weights

503

504

Returns

505

-------

506

self : object

507

Returns self

508

"""

509

510

def predict(self, X):

511

"""Predict classes for samples in X.

512

513

Parameters

514

----------

515

X : {array-like, sparse matrix} of shape (n_samples, n_features)

516

The input samples

517

518

Returns

519

-------

520

y : ndarray of shape (n_samples,)

521

The predicted classes

522

"""

523

524

def predict_proba(self, X):

525

"""Predict class probabilities for samples in X.

526

527

Parameters

528

----------

529

X : {array-like, sparse matrix} of shape (n_samples, n_features)

530

The input samples

531

532

Returns

533

-------

534

p : ndarray of shape (n_samples, n_classes)

535

The class probabilities

536

"""

537

```

538

539

### Attributes

540

541

- **estimator_** : estimator - The base estimator from which the ensemble is grown

542

- **estimators_** : list of classifiers - The collection of fitted sub-estimators

543

- **base_sampler_** : RandomUnderSampler - The base sampler used to generate subsequent samplers

544

- **samplers_** : list of RandomUnderSampler - The collection of fitted samplers

545

- **pipelines_** : list of Pipeline - The collection of fitted pipelines (samplers + trees)

546

- **classes_** : ndarray - The classes labels

547

- **n_classes_** : int - The number of classes

548

- **estimator_weights_** : ndarray - Weights for each estimator in the boosted ensemble

549

- **estimator_errors_** : ndarray - Classification error for each estimator

550

- **feature_importances_** : ndarray - The feature importances (if supported by base estimator)

551

552

### Example Usage

553

554

```python

555

from imblearn.ensemble import RUSBoostClassifier

556

from sklearn.tree import DecisionTreeClassifier

557

from sklearn.datasets import make_classification

558

559

# Create imbalanced dataset

560

X, y = make_classification(

561

n_samples=1000, n_classes=3, n_informative=4,

562

weights=[0.2, 0.3, 0.5], random_state=0

563

)

564

565

# Use custom base estimator

566

base_estimator = DecisionTreeClassifier(max_depth=2)

567

568

# Train RUSBoost classifier

569

rusboost = RUSBoostClassifier(

570

estimator=base_estimator,

571

n_estimators=10,

572

learning_rate=1.0,

573

sampling_strategy="auto",

574

random_state=0

575

)

576

rusboost.fit(X, y)

577

578

# Make predictions

579

y_pred = rusboost.predict(X)

580

y_proba = rusboost.predict_proba(X)

581

582

# Access ensemble information

583

print(f"Estimator weights: {rusboost.estimator_weights_}")

584

print(f"Estimator errors: {rusboost.estimator_errors_}")

585

```

586

587

## Algorithm Details and Relationships

588

589

### Relationship to Scikit-learn

590

591

All imbalanced-learn ensemble classifiers extend their corresponding scikit-learn base classes:

592

593

- **BalancedBaggingClassifier** extends `sklearn.ensemble.BaggingClassifier`

594

- **BalancedRandomForestClassifier** extends `sklearn.ensemble.RandomForestClassifier`

595

- **EasyEnsembleClassifier** extends `sklearn.ensemble.BaggingClassifier`

596

- **RUSBoostClassifier** extends `sklearn.ensemble.AdaBoostClassifier`

597

598

This inheritance ensures compatibility with scikit-learn's API while adding resampling capabilities.

599

600

### Resampling Integration

601

602

Each ensemble method integrates resampling differently:

603

604

1. **Bagging approaches** (BalancedBaggingClassifier, EasyEnsembleClassifier) apply resampling to each bootstrap sample before training individual estimators

605

606

2. **Random Forest** (BalancedRandomForestClassifier) applies resampling before constructing each tree, then optionally applies additional bootstrapping

607

608

3. **Boosting** (RUSBoostClassifier) applies resampling at each boosting iteration, ensuring balanced training data throughout the adaptive process

609

610

### Performance Considerations

611

612

- **BalancedRandomForestClassifier** typically provides the best balance of performance and training speed

613

- **RUSBoostClassifier** can be more sensitive to noise but often performs well on structured data

614

- **EasyEnsembleClassifier** provides good performance but requires more computational resources

615

- **BalancedBaggingClassifier** offers the most flexibility in base estimator selection

616

617

### Best Practices

618

619

1. **Start with BalancedRandomForestClassifier** for most imbalanced classification tasks

620

2. **Use sampling_strategy="all"** with replacement=True for BalancedRandomForestClassifier to follow the original algorithm

621

3. **Consider RUSBoostClassifier** for problems where boosting has shown advantages

622

4. **Tune n_estimators** based on dataset size and computational constraints

623

5. **Use cross-validation** with appropriate metrics (balanced accuracy, F1-score, geometric mean) for model selection

624

625

### Integration with Pipelines

626

627

All ensemble classifiers can be used within scikit-learn pipelines:

628

629

```python

630

from sklearn.pipeline import Pipeline

631

from sklearn.preprocessing import StandardScaler

632

from imblearn.ensemble import BalancedRandomForestClassifier

633

634

pipeline = Pipeline([

635

('scaler', StandardScaler()),

636

('classifier', BalancedRandomForestClassifier(random_state=42))

637

])

638

639

pipeline.fit(X_train, y_train)

640

y_pred = pipeline.predict(X_test)

641

```

642

643

This modular design enables easy integration into existing machine learning workflows while providing the benefits of balanced ensemble learning for imbalanced datasets.