0
# Ensemble Methods for Imbalanced Learning
1
2
## Overview
3
4
Ensemble methods combine multiple base learners to improve classification performance beyond what individual models can achieve. However, traditional ensemble methods often struggle with imbalanced datasets where minority classes are underrepresented. The imbalanced-learn library provides specialized ensemble classifiers that integrate resampling techniques directly into the ensemble learning process.
5
6
These ensemble methods address class imbalance by applying resampling strategies during training, ensuring that each base learner in the ensemble receives balanced training data. This approach leads to improved performance on minority classes while maintaining overall classification accuracy.
7
8
The ensemble module includes four main approaches:
9
10
- **BalancedBaggingClassifier**: Applies random under-sampling to each bootstrap sample in bagging
11
- **BalancedRandomForestClassifier**: Integrates random under-sampling into random forest construction
12
- **EasyEnsembleClassifier**: Combines multiple balanced AdaBoost classifiers
13
- **RUSBoostClassifier**: Integrates random under-sampling directly into the AdaBoost algorithm
14
15
## BalancedBaggingClassifier
16
17
A bagging classifier with additional balancing that applies resampling to each bootstrap sample before training base estimators.
18
19
```python { .api }
20
class BalancedBaggingClassifier(BaggingClassifier):
21
def __init__(
22
self,
23
estimator=None,
24
n_estimators=10,
25
*,
26
max_samples=1.0,
27
max_features=1.0,
28
bootstrap=True,
29
bootstrap_features=False,
30
oob_score=False,
31
warm_start=False,
32
sampling_strategy="auto",
33
replacement=False,
34
n_jobs=None,
35
random_state=None,
36
verbose=0,
37
sampler=None,
38
)
39
```
40
41
### Parameters
42
43
- **estimator** : estimator object, default=None
44
- The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a DecisionTreeClassifier
45
- **n_estimators** : int, default=10
46
- The number of base estimators in the ensemble
47
- **max_samples** : int or float, default=1.0
48
- The number of samples to draw from X to train each base estimator
49
- **max_features** : int or float, default=1.0
50
- The number of features to draw from X to train each base estimator
51
- **bootstrap** : bool, default=True
52
- Whether samples are drawn with replacement (applied after resampling)
53
- **bootstrap_features** : bool, default=False
54
- Whether features are drawn with replacement
55
- **oob_score** : bool, default=False
56
- Whether to use out-of-bag samples to estimate generalization error
57
- **warm_start** : bool, default=False
58
- When set to True, reuse the solution of the previous call to fit
59
- **sampling_strategy** : float, str, dict, callable, default="auto"
60
- Sampling information to resample the dataset
61
- **replacement** : bool, default=False
62
- Whether to sample randomly with replacement when using RandomUnderSampler
63
- **n_jobs** : int, default=None
64
- The number of jobs to run in parallel for both fit and predict
65
- **random_state** : int or RandomState, default=None
66
- Controls the random seed given to each base estimator
67
- **verbose** : int, default=0
68
- Controls the verbosity of the building process
69
- **sampler** : sampler object, default=None
70
- The sampler used to balance the dataset before bootstrapping. By default, RandomUnderSampler is used
71
72
### Methods
73
74
```python { .api }
75
def fit(self, X, y):
76
"""Build a Bagging ensemble of estimators from the training set (X, y).
77
78
Parameters
79
----------
80
X : {array-like, sparse matrix} of shape (n_samples, n_features)
81
The training input samples
82
y : array-like of shape (n_samples,)
83
The target values (class labels)
84
85
Returns
86
-------
87
self : object
88
Fitted estimator
89
"""
90
91
def predict(self, X):
92
"""Predict class for samples in X.
93
94
Parameters
95
----------
96
X : {array-like, sparse matrix} of shape (n_samples, n_features)
97
The input samples
98
99
Returns
100
-------
101
y : ndarray of shape (n_samples,)
102
The predicted classes
103
"""
104
105
def predict_proba(self, X):
106
"""Predict class probabilities for samples in X.
107
108
Parameters
109
----------
110
X : {array-like, sparse matrix} of shape (n_samples, n_features)
111
The input samples
112
113
Returns
114
-------
115
p : ndarray of shape (n_samples, n_classes)
116
The class probabilities of the input samples
117
"""
118
```
119
120
### Attributes
121
122
- **estimator_** : estimator - The base estimator from which the ensemble is grown
123
- **estimators_** : list of estimators - The collection of fitted base estimators
124
- **sampler_** : sampler object - The validated sampler created from the sampler parameter
125
- **estimators_samples_** : list of ndarray - The subset of drawn samples for each base estimator
126
- **estimators_features_** : list of ndarray - The subset of drawn features for each base estimator
127
- **classes_** : ndarray - The classes labels
128
- **n_classes_** : int or list - The number of classes
129
- **oob_score_** : float - Score using out-of-bag estimate (if oob_score=True)
130
131
### Example Usage
132
133
```python
134
from imblearn.ensemble import BalancedBaggingClassifier
135
from sklearn.datasets import make_classification
136
from sklearn.model_selection import train_test_split
137
138
# Create imbalanced dataset
139
X, y = make_classification(
140
n_classes=2, class_sep=2, weights=[0.1, 0.9],
141
n_informative=3, n_redundant=1, n_features=20,
142
n_clusters_per_class=1, n_samples=1000, random_state=10
143
)
144
145
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
146
147
# Train balanced bagging classifier
148
bbc = BalancedBaggingClassifier(n_estimators=10, random_state=42)
149
bbc.fit(X_train, y_train)
150
151
# Make predictions
152
y_pred = bbc.predict(X_test)
153
y_proba = bbc.predict_proba(X_test)
154
```
155
156
## BalancedRandomForestClassifier
157
158
A balanced random forest classifier that applies random under-sampling to balance each bootstrap sample during forest construction.
159
160
```python { .api }
161
class BalancedRandomForestClassifier(RandomForestClassifier):
162
def __init__(
163
self,
164
n_estimators=100,
165
*,
166
criterion="gini",
167
max_depth=None,
168
min_samples_split=2,
169
min_samples_leaf=1,
170
min_weight_fraction_leaf=0.0,
171
max_features="sqrt",
172
max_leaf_nodes=None,
173
min_impurity_decrease=0.0,
174
bootstrap=False,
175
oob_score=False,
176
sampling_strategy="all",
177
replacement=True,
178
n_jobs=None,
179
random_state=None,
180
verbose=0,
181
warm_start=False,
182
class_weight=None,
183
ccp_alpha=0.0,
184
max_samples=None,
185
monotonic_cst=None,
186
)
187
```
188
189
### Parameters
190
191
- **n_estimators** : int, default=100
192
- The number of trees in the forest
193
- **criterion** : {"gini", "entropy"}, default="gini"
194
- The function to measure the quality of a split
195
- **max_depth** : int, default=None
196
- The maximum depth of the tree
197
- **min_samples_split** : int or float, default=2
198
- The minimum number of samples required to split an internal node
199
- **min_samples_leaf** : int or float, default=1
200
- The minimum number of samples required to be at a leaf node
201
- **min_weight_fraction_leaf** : float, default=0.0
202
- The minimum weighted fraction of the sum total of weights required to be at a leaf node
203
- **max_features** : {"auto", "sqrt", "log2"}, int, float, or None, default="sqrt"
204
- The number of features to consider when looking for the best split
205
- **max_leaf_nodes** : int, default=None
206
- Grow trees with max_leaf_nodes in best-first fashion
207
- **min_impurity_decrease** : float, default=0.0
208
- A node will be split if this split induces a decrease of impurity greater than or equal to this value
209
- **bootstrap** : bool, default=False
210
- Whether bootstrap samples are used when building trees (applied after resampling)
211
- **oob_score** : bool, default=False
212
- Whether to use out-of-bag samples to estimate generalization accuracy
213
- **sampling_strategy** : float, str, dict, callable, default="all"
214
- Sampling information to resample the dataset
215
- **replacement** : bool, default=True
216
- Whether to sample randomly with replacement or not
217
- **n_jobs** : int, default=None
218
- The number of jobs to run in parallel
219
- **random_state** : int or RandomState, default=None
220
- Controls both the randomness of the bootstrap and feature sampling
221
- **verbose** : int, default=0
222
- Controls the verbosity of the tree building process
223
- **warm_start** : bool, default=False
224
- When set to True, reuse the solution of the previous call to fit
225
- **class_weight** : dict, list of dicts, {"balanced", "balanced_subsample"}, default=None
226
- Weights associated with classes
227
- **ccp_alpha** : non-negative float, default=0.0
228
- Complexity parameter used for Minimal Cost-Complexity Pruning
229
- **max_samples** : int or float, default=None
230
- The number of samples to draw from X to train each base estimator
231
- **monotonic_cst** : array-like of int, default=None
232
- Indicates the monotonicity constraint to enforce on each feature
233
234
### Methods
235
236
```python { .api }
237
def fit(self, X, y, sample_weight=None):
238
"""Build a forest of trees from the training set (X, y).
239
240
Parameters
241
----------
242
X : {array-like, sparse matrix} of shape (n_samples, n_features)
243
The training input samples
244
y : array-like of shape (n_samples,) or (n_samples, n_outputs)
245
The target values (class labels)
246
sample_weight : array-like of shape (n_samples,), default=None
247
Sample weights
248
249
Returns
250
-------
251
self : object
252
The fitted instance
253
"""
254
255
def predict(self, X):
256
"""Predict class for samples in X.
257
258
Parameters
259
----------
260
X : array-like of shape (n_samples, n_features)
261
The input samples
262
263
Returns
264
-------
265
y : ndarray of shape (n_samples,)
266
The predicted classes
267
"""
268
269
def predict_proba(self, X):
270
"""Predict class probabilities for samples in X.
271
272
Parameters
273
----------
274
X : array-like of shape (n_samples, n_features)
275
The input samples
276
277
Returns
278
-------
279
p : ndarray of shape (n_samples, n_classes)
280
The class probabilities
281
"""
282
```
283
284
### Attributes
285
286
- **estimator_** : DecisionTreeClassifier - The child estimator template used to create the collection
287
- **estimators_** : list of DecisionTreeClassifier - The collection of fitted sub-estimators
288
- **base_sampler_** : RandomUnderSampler - The base sampler used to construct subsequent samplers
289
- **samplers_** : list of RandomUnderSampler - The collection of fitted samplers
290
- **pipelines_** : list of Pipeline - The collection of fitted pipelines (samplers + trees)
291
- **classes_** : ndarray - The classes labels
292
- **n_classes_** : int or list - The number of classes
293
- **feature_importances_** : ndarray - The feature importances
294
- **oob_score_** : float - Score using out-of-bag estimate (if oob_score=True)
295
296
### Example Usage
297
298
```python
299
from imblearn.ensemble import BalancedRandomForestClassifier
300
from sklearn.datasets import make_classification
301
302
# Create imbalanced dataset
303
X, y = make_classification(
304
n_samples=1000, n_classes=3, n_informative=4,
305
weights=[0.2, 0.3, 0.5], random_state=0
306
)
307
308
# Train balanced random forest
309
brf = BalancedRandomForestClassifier(
310
n_estimators=10,
311
sampling_strategy="all",
312
replacement=True,
313
max_depth=2,
314
random_state=0,
315
bootstrap=False
316
)
317
brf.fit(X, y)
318
319
# Make predictions
320
y_pred = brf.predict(X)
321
feature_importances = brf.feature_importances_
322
```
323
324
## EasyEnsembleClassifier
325
326
Bag of balanced boosted learners, also known as EasyEnsemble. This classifier is an ensemble of AdaBoost learners trained on different balanced bootstrap samples achieved by random under-sampling.
327
328
```python { .api }
329
class EasyEnsembleClassifier(BaggingClassifier):
330
def __init__(
331
self,
332
n_estimators=10,
333
estimator=None,
334
*,
335
warm_start=False,
336
sampling_strategy="auto",
337
replacement=False,
338
n_jobs=None,
339
random_state=None,
340
verbose=0,
341
)
342
```
343
344
### Parameters
345
346
- **n_estimators** : int, default=10
347
- Number of AdaBoost learners in the ensemble
348
- **estimator** : estimator object, default=AdaBoostClassifier()
349
- The base AdaBoost classifier used in the inner ensemble. You can set the number of inner learners by passing your own instance
350
- **warm_start** : bool, default=False
351
- When set to True, reuse the solution of the previous call to fit
352
- **sampling_strategy** : float, str, dict, callable, default="auto"
353
- Sampling information to resample the dataset
354
- **replacement** : bool, default=False
355
- Whether to sample randomly with replacement or not
356
- **n_jobs** : int, default=None
357
- The number of jobs to run in parallel for both fit and predict
358
- **random_state** : int or RandomState, default=None
359
- Controls the random seed given to each base estimator
360
- **verbose** : int, default=0
361
- Controls the verbosity of the building process
362
363
### Methods
364
365
```python { .api }
366
def fit(self, X, y):
367
"""Build a Bagging ensemble of estimators from the training set (X, y).
368
369
Parameters
370
----------
371
X : {array-like, sparse matrix} of shape (n_samples, n_features)
372
The training input samples
373
y : array-like of shape (n_samples,)
374
The target values (class labels)
375
376
Returns
377
-------
378
self : object
379
Fitted estimator
380
"""
381
382
def predict(self, X):
383
"""Predict class for samples in X.
384
385
Parameters
386
----------
387
X : {array-like, sparse matrix} of shape (n_samples, n_features)
388
The input samples
389
390
Returns
391
-------
392
y : ndarray of shape (n_samples,)
393
The predicted classes
394
"""
395
396
def predict_proba(self, X):
397
"""Predict class probabilities for samples in X.
398
399
Parameters
400
----------
401
X : {array-like, sparse matrix} of shape (n_samples, n_features)
402
The input samples
403
404
Returns
405
-------
406
p : ndarray of shape (n_samples, n_classes)
407
The class probabilities
408
"""
409
```
410
411
### Attributes
412
413
- **estimator_** : estimator - The base estimator from which the ensemble is grown
414
- **estimators_** : list of estimators - The collection of fitted base estimators
415
- **estimators_samples_** : list of arrays - The subset of drawn samples for each base estimator
416
- **estimators_features_** : list of arrays - The subset of drawn features for each base estimator
417
- **classes_** : ndarray - The classes labels
418
- **n_classes_** : int or list - The number of classes
419
420
### Example Usage
421
422
```python
423
from imblearn.ensemble import EasyEnsembleClassifier
424
from sklearn.ensemble import AdaBoostClassifier
425
from sklearn.datasets import make_classification
426
from sklearn.model_selection import train_test_split
427
428
# Create imbalanced dataset
429
X, y = make_classification(
430
n_classes=2, class_sep=2, weights=[0.1, 0.9],
431
n_informative=3, n_redundant=1, n_features=20,
432
n_clusters_per_class=1, n_samples=1000, random_state=10
433
)
434
435
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
436
437
# Create custom AdaBoost estimator
438
ada_estimator = AdaBoostClassifier(n_estimators=10, algorithm="SAMME")
439
440
# Train EasyEnsemble classifier
441
eec = EasyEnsembleClassifier(
442
n_estimators=10,
443
estimator=ada_estimator,
444
random_state=42
445
)
446
eec.fit(X_train, y_train)
447
448
# Make predictions
449
y_pred = eec.predict(X_test)
450
y_proba = eec.predict_proba(X_test)
451
```
452
453
## RUSBoostClassifier
454
455
Random under-sampling integrated into the learning of AdaBoost. During learning, class balancing is alleviated by random under-sampling the dataset at each iteration of the boosting algorithm.
456
457
```python { .api }
458
class RUSBoostClassifier(AdaBoostClassifier):
459
def __init__(
460
self,
461
estimator=None,
462
*,
463
n_estimators=50,
464
learning_rate=1.0,
465
algorithm="deprecated",
466
sampling_strategy="auto",
467
replacement=False,
468
random_state=None,
469
)
470
```
471
472
### Parameters
473
474
- **estimator** : estimator object, default=None
475
- The base estimator from which the boosted ensemble is built. If None, then DecisionTreeClassifier(max_depth=1)
476
- **n_estimators** : int, default=50
477
- The maximum number of estimators at which boosting is terminated
478
- **learning_rate** : float, default=1.0
479
- Learning rate shrinks the contribution of each classifier
480
- **algorithm** : {"SAMME", "SAMME.R"}, default="deprecated"
481
- The boosting algorithm to use. SAMME.R uses real boosting algorithm, SAMME uses discrete boosting
482
- **sampling_strategy** : float, str, dict, callable, default="auto"
483
- Sampling information to resample the dataset
484
- **replacement** : bool, default=False
485
- Whether to sample randomly with replacement or not
486
- **random_state** : int or RandomState, default=None
487
- Controls the random seed given to each base estimator
488
489
### Methods
490
491
```python { .api }
492
def fit(self, X, y, sample_weight=None):
493
"""Build a boosted classifier from the training set (X, y).
494
495
Parameters
496
----------
497
X : {array-like, sparse matrix} of shape (n_samples, n_features)
498
The training input samples
499
y : array-like of shape (n_samples,)
500
The target values (class labels)
501
sample_weight : array-like of shape (n_samples,), default=None
502
Sample weights
503
504
Returns
505
-------
506
self : object
507
Returns self
508
"""
509
510
def predict(self, X):
511
"""Predict classes for samples in X.
512
513
Parameters
514
----------
515
X : {array-like, sparse matrix} of shape (n_samples, n_features)
516
The input samples
517
518
Returns
519
-------
520
y : ndarray of shape (n_samples,)
521
The predicted classes
522
"""
523
524
def predict_proba(self, X):
525
"""Predict class probabilities for samples in X.
526
527
Parameters
528
----------
529
X : {array-like, sparse matrix} of shape (n_samples, n_features)
530
The input samples
531
532
Returns
533
-------
534
p : ndarray of shape (n_samples, n_classes)
535
The class probabilities
536
"""
537
```
538
539
### Attributes
540
541
- **estimator_** : estimator - The base estimator from which the ensemble is grown
542
- **estimators_** : list of classifiers - The collection of fitted sub-estimators
543
- **base_sampler_** : RandomUnderSampler - The base sampler used to generate subsequent samplers
544
- **samplers_** : list of RandomUnderSampler - The collection of fitted samplers
545
- **pipelines_** : list of Pipeline - The collection of fitted pipelines (samplers + trees)
546
- **classes_** : ndarray - The classes labels
547
- **n_classes_** : int - The number of classes
548
- **estimator_weights_** : ndarray - Weights for each estimator in the boosted ensemble
549
- **estimator_errors_** : ndarray - Classification error for each estimator
550
- **feature_importances_** : ndarray - The feature importances (if supported by base estimator)
551
552
### Example Usage
553
554
```python
555
from imblearn.ensemble import RUSBoostClassifier
556
from sklearn.tree import DecisionTreeClassifier
557
from sklearn.datasets import make_classification
558
559
# Create imbalanced dataset
560
X, y = make_classification(
561
n_samples=1000, n_classes=3, n_informative=4,
562
weights=[0.2, 0.3, 0.5], random_state=0
563
)
564
565
# Use custom base estimator
566
base_estimator = DecisionTreeClassifier(max_depth=2)
567
568
# Train RUSBoost classifier
569
rusboost = RUSBoostClassifier(
570
estimator=base_estimator,
571
n_estimators=10,
572
learning_rate=1.0,
573
sampling_strategy="auto",
574
random_state=0
575
)
576
rusboost.fit(X, y)
577
578
# Make predictions
579
y_pred = rusboost.predict(X)
580
y_proba = rusboost.predict_proba(X)
581
582
# Access ensemble information
583
print(f"Estimator weights: {rusboost.estimator_weights_}")
584
print(f"Estimator errors: {rusboost.estimator_errors_}")
585
```
586
587
## Algorithm Details and Relationships
588
589
### Relationship to Scikit-learn
590
591
All imbalanced-learn ensemble classifiers extend their corresponding scikit-learn base classes:
592
593
- **BalancedBaggingClassifier** extends `sklearn.ensemble.BaggingClassifier`
594
- **BalancedRandomForestClassifier** extends `sklearn.ensemble.RandomForestClassifier`
595
- **EasyEnsembleClassifier** extends `sklearn.ensemble.BaggingClassifier`
596
- **RUSBoostClassifier** extends `sklearn.ensemble.AdaBoostClassifier`
597
598
This inheritance ensures compatibility with scikit-learn's API while adding resampling capabilities.
599
600
### Resampling Integration
601
602
Each ensemble method integrates resampling differently:
603
604
1. **Bagging approaches** (BalancedBaggingClassifier, EasyEnsembleClassifier) apply resampling to each bootstrap sample before training individual estimators
605
606
2. **Random Forest** (BalancedRandomForestClassifier) applies resampling before constructing each tree, then optionally applies additional bootstrapping
607
608
3. **Boosting** (RUSBoostClassifier) applies resampling at each boosting iteration, ensuring balanced training data throughout the adaptive process
609
610
### Performance Considerations
611
612
- **BalancedRandomForestClassifier** typically provides the best balance of performance and training speed
613
- **RUSBoostClassifier** can be more sensitive to noise but often performs well on structured data
614
- **EasyEnsembleClassifier** provides good performance but requires more computational resources
615
- **BalancedBaggingClassifier** offers the most flexibility in base estimator selection
616
617
### Best Practices
618
619
1. **Start with BalancedRandomForestClassifier** for most imbalanced classification tasks
620
2. **Use sampling_strategy="all"** with replacement=True for BalancedRandomForestClassifier to follow the original algorithm
621
3. **Consider RUSBoostClassifier** for problems where boosting has shown advantages
622
4. **Tune n_estimators** based on dataset size and computational constraints
623
5. **Use cross-validation** with appropriate metrics (balanced accuracy, F1-score, geometric mean) for model selection
624
625
### Integration with Pipelines
626
627
All ensemble classifiers can be used within scikit-learn pipelines:
628
629
```python
630
from sklearn.pipeline import Pipeline
631
from sklearn.preprocessing import StandardScaler
632
from imblearn.ensemble import BalancedRandomForestClassifier
633
634
pipeline = Pipeline([
635
('scaler', StandardScaler()),
636
('classifier', BalancedRandomForestClassifier(random_state=42))
637
])
638
639
pipeline.fit(X_train, y_train)
640
y_pred = pipeline.predict(X_test)
641
```
642
643
This modular design enables easy integration into existing machine learning workflows while providing the benefits of balanced ensemble learning for imbalanced datasets.