Tessl Tile for pypi/imbalanced-learn@0.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

combination.md datasets.md deep-learning.md ensemble.md index.md metrics.md model-selection.md over-sampling.md pipeline.md under-sampling.md utilities.md

ensemble.mddocs/

0
# Ensemble Methods for Imbalanced Learning
1

2
## Overview
3

4
Ensemble methods combine multiple base learners to improve classification performance beyond what individual models can achieve. However, traditional ensemble methods often struggle with imbalanced datasets where minority classes are underrepresented. The imbalanced-learn library provides specialized ensemble classifiers that integrate resampling techniques directly into the ensemble learning process.
5

6
These ensemble methods address class imbalance by applying resampling strategies during training, ensuring that each base learner in the ensemble receives balanced training data. This approach leads to improved performance on minority classes while maintaining overall classification accuracy.
7

8
The ensemble module includes four main approaches:
9

10
- **BalancedBaggingClassifier**: Applies random under-sampling to each bootstrap sample in bagging
11
- **BalancedRandomForestClassifier**: Integrates random under-sampling into random forest construction  
12
- **EasyEnsembleClassifier**: Combines multiple balanced AdaBoost classifiers
13
- **RUSBoostClassifier**: Integrates random under-sampling directly into the AdaBoost algorithm
14

15
## BalancedBaggingClassifier
16

17
A bagging classifier with additional balancing that applies resampling to each bootstrap sample before training base estimators.
18

19
```python { .api }
20
class BalancedBaggingClassifier(BaggingClassifier):
21
    def __init__(
22
        self,
23
        estimator=None,
24
        n_estimators=10,
25
        *,
26
        max_samples=1.0,
27
        max_features=1.0,
28
        bootstrap=True,
29
        bootstrap_features=False,
30
        oob_score=False,
31
        warm_start=False,
32
        sampling_strategy="auto",
33
        replacement=False,
34
        n_jobs=None,
35
        random_state=None,
36
        verbose=0,
37
        sampler=None,
38
    )
39
```
40

41
### Parameters
42

43
- **estimator** : estimator object, default=None
44
  - The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a DecisionTreeClassifier
45
- **n_estimators** : int, default=10
46
  - The number of base estimators in the ensemble
47
- **max_samples** : int or float, default=1.0
48
  - The number of samples to draw from X to train each base estimator
49
- **max_features** : int or float, default=1.0
50
  - The number of features to draw from X to train each base estimator
51
- **bootstrap** : bool, default=True
52
  - Whether samples are drawn with replacement (applied after resampling)
53
- **bootstrap_features** : bool, default=False
54
  - Whether features are drawn with replacement
55
- **oob_score** : bool, default=False
56
  - Whether to use out-of-bag samples to estimate generalization error
57
- **warm_start** : bool, default=False
58
  - When set to True, reuse the solution of the previous call to fit
59
- **sampling_strategy** : float, str, dict, callable, default="auto"
60
  - Sampling information to resample the dataset
61
- **replacement** : bool, default=False
62
  - Whether to sample randomly with replacement when using RandomUnderSampler
63
- **n_jobs** : int, default=None
64
  - The number of jobs to run in parallel for both fit and predict
65
- **random_state** : int or RandomState, default=None
66
  - Controls the random seed given to each base estimator
67
- **verbose** : int, default=0
68
  - Controls the verbosity of the building process
69
- **sampler** : sampler object, default=None
70
  - The sampler used to balance the dataset before bootstrapping. By default, RandomUnderSampler is used
71

72
### Methods
73

74
```python { .api }
75
def fit(self, X, y):
76
    """Build a Bagging ensemble of estimators from the training set (X, y).
77
    
78
    Parameters
79
    ----------
80
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
81
        The training input samples
82
    y : array-like of shape (n_samples,)
83
        The target values (class labels)
84
        
85
    Returns
86
    -------
87
    self : object
88
        Fitted estimator
89
    """
90

91
def predict(self, X):
92
    """Predict class for samples in X.
93
    
94
    Parameters
95
    ----------
96
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
97
        The input samples
98
        
99
    Returns
100
    -------
101
    y : ndarray of shape (n_samples,)
102
        The predicted classes
103
    """
104

105
def predict_proba(self, X):
106
    """Predict class probabilities for samples in X.
107
    
108
    Parameters
109
    ----------
110
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
111
        The input samples
112
        
113
    Returns
114
    -------
115
    p : ndarray of shape (n_samples, n_classes)
116
        The class probabilities of the input samples
117
    """
118
```
119

120
### Attributes
121

122
- **estimator_** : estimator - The base estimator from which the ensemble is grown
123
- **estimators_** : list of estimators - The collection of fitted base estimators  
124
- **sampler_** : sampler object - The validated sampler created from the sampler parameter
125
- **estimators_samples_** : list of ndarray - The subset of drawn samples for each base estimator
126
- **estimators_features_** : list of ndarray - The subset of drawn features for each base estimator
127
- **classes_** : ndarray - The classes labels
128
- **n_classes_** : int or list - The number of classes
129
- **oob_score_** : float - Score using out-of-bag estimate (if oob_score=True)
130

131
### Example Usage
132

133
```python
134
from imblearn.ensemble import BalancedBaggingClassifier
135
from sklearn.datasets import make_classification
136
from sklearn.model_selection import train_test_split
137

138
# Create imbalanced dataset
139
X, y = make_classification(
140
    n_classes=2, class_sep=2, weights=[0.1, 0.9], 
141
    n_informative=3, n_redundant=1, n_features=20, 
142
    n_clusters_per_class=1, n_samples=1000, random_state=10
143
)
144

145
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
146

147
# Train balanced bagging classifier
148
bbc = BalancedBaggingClassifier(n_estimators=10, random_state=42)
149
bbc.fit(X_train, y_train)
150

151
# Make predictions
152
y_pred = bbc.predict(X_test)
153
y_proba = bbc.predict_proba(X_test)
154
```
155

156
## BalancedRandomForestClassifier
157

158
A balanced random forest classifier that applies random under-sampling to balance each bootstrap sample during forest construction.
159

160
```python { .api }
161
class BalancedRandomForestClassifier(RandomForestClassifier):
162
    def __init__(
163
        self,
164
        n_estimators=100,
165
        *,
166
        criterion="gini",
167
        max_depth=None,
168
        min_samples_split=2,
169
        min_samples_leaf=1,
170
        min_weight_fraction_leaf=0.0,
171
        max_features="sqrt",
172
        max_leaf_nodes=None,
173
        min_impurity_decrease=0.0,
174
        bootstrap=False,
175
        oob_score=False,
176
        sampling_strategy="all",
177
        replacement=True,
178
        n_jobs=None,
179
        random_state=None,
180
        verbose=0,
181
        warm_start=False,
182
        class_weight=None,
183
        ccp_alpha=0.0,
184
        max_samples=None,
185
        monotonic_cst=None,
186
    )
187
```
188

189
### Parameters
190

191
- **n_estimators** : int, default=100
192
  - The number of trees in the forest
193
- **criterion** : {"gini", "entropy"}, default="gini"
194
  - The function to measure the quality of a split
195
- **max_depth** : int, default=None
196
  - The maximum depth of the tree
197
- **min_samples_split** : int or float, default=2
198
  - The minimum number of samples required to split an internal node
199
- **min_samples_leaf** : int or float, default=1
200
  - The minimum number of samples required to be at a leaf node
201
- **min_weight_fraction_leaf** : float, default=0.0
202
  - The minimum weighted fraction of the sum total of weights required to be at a leaf node
203
- **max_features** : {"auto", "sqrt", "log2"}, int, float, or None, default="sqrt"
204
  - The number of features to consider when looking for the best split
205
- **max_leaf_nodes** : int, default=None
206
  - Grow trees with max_leaf_nodes in best-first fashion
207
- **min_impurity_decrease** : float, default=0.0
208
  - A node will be split if this split induces a decrease of impurity greater than or equal to this value
209
- **bootstrap** : bool, default=False
210
  - Whether bootstrap samples are used when building trees (applied after resampling)
211
- **oob_score** : bool, default=False
212
  - Whether to use out-of-bag samples to estimate generalization accuracy
213
- **sampling_strategy** : float, str, dict, callable, default="all"
214
  - Sampling information to resample the dataset
215
- **replacement** : bool, default=True
216
  - Whether to sample randomly with replacement or not
217
- **n_jobs** : int, default=None
218
  - The number of jobs to run in parallel
219
- **random_state** : int or RandomState, default=None
220
  - Controls both the randomness of the bootstrap and feature sampling
221
- **verbose** : int, default=0
222
  - Controls the verbosity of the tree building process
223
- **warm_start** : bool, default=False
224
  - When set to True, reuse the solution of the previous call to fit
225
- **class_weight** : dict, list of dicts, {"balanced", "balanced_subsample"}, default=None
226
  - Weights associated with classes
227
- **ccp_alpha** : non-negative float, default=0.0
228
  - Complexity parameter used for Minimal Cost-Complexity Pruning
229
- **max_samples** : int or float, default=None
230
  - The number of samples to draw from X to train each base estimator
231
- **monotonic_cst** : array-like of int, default=None
232
  - Indicates the monotonicity constraint to enforce on each feature
233

234
### Methods
235

236
```python { .api }
237
def fit(self, X, y, sample_weight=None):
238
    """Build a forest of trees from the training set (X, y).
239
    
240
    Parameters
241
    ----------
242
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
243
        The training input samples
244
    y : array-like of shape (n_samples,) or (n_samples, n_outputs)
245
        The target values (class labels)
246
    sample_weight : array-like of shape (n_samples,), default=None
247
        Sample weights
248
        
249
    Returns
250
    -------
251
    self : object
252
        The fitted instance
253
    """
254

255
def predict(self, X):
256
    """Predict class for samples in X.
257
    
258
    Parameters
259
    ----------
260
    X : array-like of shape (n_samples, n_features)
261
        The input samples
262
        
263
    Returns
264
    -------
265
    y : ndarray of shape (n_samples,)
266
        The predicted classes
267
    """
268

269
def predict_proba(self, X):
270
    """Predict class probabilities for samples in X.
271
    
272
    Parameters
273
    ----------
274
    X : array-like of shape (n_samples, n_features)
275
        The input samples
276
        
277
    Returns
278
    -------
279
    p : ndarray of shape (n_samples, n_classes)
280
        The class probabilities
281
    """
282
```
283

284
### Attributes
285

286
- **estimator_** : DecisionTreeClassifier - The child estimator template used to create the collection
287
- **estimators_** : list of DecisionTreeClassifier - The collection of fitted sub-estimators
288
- **base_sampler_** : RandomUnderSampler - The base sampler used to construct subsequent samplers
289
- **samplers_** : list of RandomUnderSampler - The collection of fitted samplers
290
- **pipelines_** : list of Pipeline - The collection of fitted pipelines (samplers + trees)
291
- **classes_** : ndarray - The classes labels
292
- **n_classes_** : int or list - The number of classes
293
- **feature_importances_** : ndarray - The feature importances
294
- **oob_score_** : float - Score using out-of-bag estimate (if oob_score=True)
295

296
### Example Usage
297

298
```python
299
from imblearn.ensemble import BalancedRandomForestClassifier
300
from sklearn.datasets import make_classification
301

302
# Create imbalanced dataset
303
X, y = make_classification(
304
    n_samples=1000, n_classes=3, n_informative=4, 
305
    weights=[0.2, 0.3, 0.5], random_state=0
306
)
307

308
# Train balanced random forest
309
brf = BalancedRandomForestClassifier(
310
    n_estimators=10,
311
    sampling_strategy="all", 
312
    replacement=True,
313
    max_depth=2, 
314
    random_state=0,
315
    bootstrap=False
316
)
317
brf.fit(X, y)
318

319
# Make predictions
320
y_pred = brf.predict(X)
321
feature_importances = brf.feature_importances_
322
```
323

324
## EasyEnsembleClassifier
325

326
Bag of balanced boosted learners, also known as EasyEnsemble. This classifier is an ensemble of AdaBoost learners trained on different balanced bootstrap samples achieved by random under-sampling.
327

328
```python { .api }
329
class EasyEnsembleClassifier(BaggingClassifier):
330
    def __init__(
331
        self,
332
        n_estimators=10,
333
        estimator=None,
334
        *,
335
        warm_start=False,
336
        sampling_strategy="auto",
337
        replacement=False,
338
        n_jobs=None,
339
        random_state=None,
340
        verbose=0,
341
    )
342
```
343

344
### Parameters
345

346
- **n_estimators** : int, default=10
347
  - Number of AdaBoost learners in the ensemble
348
- **estimator** : estimator object, default=AdaBoostClassifier()
349
  - The base AdaBoost classifier used in the inner ensemble. You can set the number of inner learners by passing your own instance
350
- **warm_start** : bool, default=False
351
  - When set to True, reuse the solution of the previous call to fit
352
- **sampling_strategy** : float, str, dict, callable, default="auto"
353
  - Sampling information to resample the dataset
354
- **replacement** : bool, default=False
355
  - Whether to sample randomly with replacement or not
356
- **n_jobs** : int, default=None
357
  - The number of jobs to run in parallel for both fit and predict
358
- **random_state** : int or RandomState, default=None
359
  - Controls the random seed given to each base estimator
360
- **verbose** : int, default=0
361
  - Controls the verbosity of the building process
362

363
### Methods
364

365
```python { .api }
366
def fit(self, X, y):
367
    """Build a Bagging ensemble of estimators from the training set (X, y).
368
    
369
    Parameters
370
    ----------
371
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
372
        The training input samples
373
    y : array-like of shape (n_samples,)
374
        The target values (class labels)
375
        
376
    Returns
377
    -------
378
    self : object
379
        Fitted estimator
380
    """
381

382
def predict(self, X):
383
    """Predict class for samples in X.
384
    
385
    Parameters
386
    ----------
387
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
388
        The input samples
389
        
390
    Returns
391
    -------
392
    y : ndarray of shape (n_samples,)
393
        The predicted classes
394
    """
395

396
def predict_proba(self, X):
397
    """Predict class probabilities for samples in X.
398
    
399
    Parameters
400
    ----------
401
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
402
        The input samples
403
        
404
    Returns
405
    -------
406
    p : ndarray of shape (n_samples, n_classes)
407
        The class probabilities
408
    """
409
```
410

411
### Attributes
412

413
- **estimator_** : estimator - The base estimator from which the ensemble is grown
414
- **estimators_** : list of estimators - The collection of fitted base estimators
415
- **estimators_samples_** : list of arrays - The subset of drawn samples for each base estimator
416
- **estimators_features_** : list of arrays - The subset of drawn features for each base estimator
417
- **classes_** : ndarray - The classes labels
418
- **n_classes_** : int or list - The number of classes
419

420
### Example Usage
421

422
```python
423
from imblearn.ensemble import EasyEnsembleClassifier
424
from sklearn.ensemble import AdaBoostClassifier
425
from sklearn.datasets import make_classification
426
from sklearn.model_selection import train_test_split
427

428
# Create imbalanced dataset
429
X, y = make_classification(
430
    n_classes=2, class_sep=2, weights=[0.1, 0.9], 
431
    n_informative=3, n_redundant=1, n_features=20, 
432
    n_clusters_per_class=1, n_samples=1000, random_state=10
433
)
434

435
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
436

437
# Create custom AdaBoost estimator
438
ada_estimator = AdaBoostClassifier(n_estimators=10, algorithm="SAMME")
439

440
# Train EasyEnsemble classifier
441
eec = EasyEnsembleClassifier(
442
    n_estimators=10, 
443
    estimator=ada_estimator,
444
    random_state=42
445
)
446
eec.fit(X_train, y_train)
447

448
# Make predictions
449
y_pred = eec.predict(X_test)
450
y_proba = eec.predict_proba(X_test)
451
```
452

453
## RUSBoostClassifier
454

455
Random under-sampling integrated into the learning of AdaBoost. During learning, class balancing is alleviated by random under-sampling the dataset at each iteration of the boosting algorithm.
456

457
```python { .api }
458
class RUSBoostClassifier(AdaBoostClassifier):
459
    def __init__(
460
        self,
461
        estimator=None,
462
        *,
463
        n_estimators=50,
464
        learning_rate=1.0,
465
        algorithm="deprecated",
466
        sampling_strategy="auto",
467
        replacement=False,
468
        random_state=None,
469
    )
470
```
471

472
### Parameters
473

474
- **estimator** : estimator object, default=None
475
  - The base estimator from which the boosted ensemble is built. If None, then DecisionTreeClassifier(max_depth=1)
476
- **n_estimators** : int, default=50
477
  - The maximum number of estimators at which boosting is terminated
478
- **learning_rate** : float, default=1.0
479
  - Learning rate shrinks the contribution of each classifier
480
- **algorithm** : {"SAMME", "SAMME.R"}, default="deprecated"
481
  - The boosting algorithm to use. SAMME.R uses real boosting algorithm, SAMME uses discrete boosting
482
- **sampling_strategy** : float, str, dict, callable, default="auto"
483
  - Sampling information to resample the dataset
484
- **replacement** : bool, default=False
485
  - Whether to sample randomly with replacement or not
486
- **random_state** : int or RandomState, default=None
487
  - Controls the random seed given to each base estimator
488

489
### Methods
490

491
```python { .api }
492
def fit(self, X, y, sample_weight=None):
493
    """Build a boosted classifier from the training set (X, y).
494
    
495
    Parameters
496
    ----------
497
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
498
        The training input samples
499
    y : array-like of shape (n_samples,)
500
        The target values (class labels)
501
    sample_weight : array-like of shape (n_samples,), default=None
502
        Sample weights
503
        
504
    Returns
505
    -------
506
    self : object
507
        Returns self
508
    """
509

510
def predict(self, X):
511
    """Predict classes for samples in X.
512
    
513
    Parameters
514
    ----------
515
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
516
        The input samples
517
        
518
    Returns
519
    -------
520
    y : ndarray of shape (n_samples,)
521
        The predicted classes
522
    """
523

524
def predict_proba(self, X):
525
    """Predict class probabilities for samples in X.
526
    
527
    Parameters
528
    ----------
529
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
530
        The input samples
531
        
532
    Returns
533
    -------
534
    p : ndarray of shape (n_samples, n_classes)
535
        The class probabilities
536
    """
537
```
538

539
### Attributes
540

541
- **estimator_** : estimator - The base estimator from which the ensemble is grown
542
- **estimators_** : list of classifiers - The collection of fitted sub-estimators
543
- **base_sampler_** : RandomUnderSampler - The base sampler used to generate subsequent samplers
544
- **samplers_** : list of RandomUnderSampler - The collection of fitted samplers
545
- **pipelines_** : list of Pipeline - The collection of fitted pipelines (samplers + trees)
546
- **classes_** : ndarray - The classes labels
547
- **n_classes_** : int - The number of classes
548
- **estimator_weights_** : ndarray - Weights for each estimator in the boosted ensemble
549
- **estimator_errors_** : ndarray - Classification error for each estimator
550
- **feature_importances_** : ndarray - The feature importances (if supported by base estimator)
551

552
### Example Usage
553

554
```python
555
from imblearn.ensemble import RUSBoostClassifier
556
from sklearn.tree import DecisionTreeClassifier
557
from sklearn.datasets import make_classification
558

559
# Create imbalanced dataset
560
X, y = make_classification(
561
    n_samples=1000, n_classes=3, n_informative=4, 
562
    weights=[0.2, 0.3, 0.5], random_state=0
563
)
564

565
# Use custom base estimator
566
base_estimator = DecisionTreeClassifier(max_depth=2)
567

568
# Train RUSBoost classifier
569
rusboost = RUSBoostClassifier(
570
    estimator=base_estimator,
571
    n_estimators=10,
572
    learning_rate=1.0,
573
    sampling_strategy="auto",
574
    random_state=0
575
)
576
rusboost.fit(X, y)
577

578
# Make predictions
579
y_pred = rusboost.predict(X)
580
y_proba = rusboost.predict_proba(X)
581

582
# Access ensemble information
583
print(f"Estimator weights: {rusboost.estimator_weights_}")
584
print(f"Estimator errors: {rusboost.estimator_errors_}")
585
```
586

587
## Algorithm Details and Relationships
588

589
### Relationship to Scikit-learn
590

591
All imbalanced-learn ensemble classifiers extend their corresponding scikit-learn base classes:
592

593
- **BalancedBaggingClassifier** extends `sklearn.ensemble.BaggingClassifier`
594
- **BalancedRandomForestClassifier** extends `sklearn.ensemble.RandomForestClassifier`  
595
- **EasyEnsembleClassifier** extends `sklearn.ensemble.BaggingClassifier`
596
- **RUSBoostClassifier** extends `sklearn.ensemble.AdaBoostClassifier`
597

598
This inheritance ensures compatibility with scikit-learn's API while adding resampling capabilities.
599

600
### Resampling Integration
601

602
Each ensemble method integrates resampling differently:
603

604
1. **Bagging approaches** (BalancedBaggingClassifier, EasyEnsembleClassifier) apply resampling to each bootstrap sample before training individual estimators
605

606
2. **Random Forest** (BalancedRandomForestClassifier) applies resampling before constructing each tree, then optionally applies additional bootstrapping
607

608
3. **Boosting** (RUSBoostClassifier) applies resampling at each boosting iteration, ensuring balanced training data throughout the adaptive process
609

610
### Performance Considerations
611

612
- **BalancedRandomForestClassifier** typically provides the best balance of performance and training speed
613
- **RUSBoostClassifier** can be more sensitive to noise but often performs well on structured data
614
- **EasyEnsembleClassifier** provides good performance but requires more computational resources
615
- **BalancedBaggingClassifier** offers the most flexibility in base estimator selection
616

617
### Best Practices
618

619
1. **Start with BalancedRandomForestClassifier** for most imbalanced classification tasks
620
2. **Use sampling_strategy="all"** with replacement=True for BalancedRandomForestClassifier to follow the original algorithm
621
3. **Consider RUSBoostClassifier** for problems where boosting has shown advantages
622
4. **Tune n_estimators** based on dataset size and computational constraints
623
5. **Use cross-validation** with appropriate metrics (balanced accuracy, F1-score, geometric mean) for model selection
624

625
### Integration with Pipelines
626

627
All ensemble classifiers can be used within scikit-learn pipelines:
628

629
```python
630
from sklearn.pipeline import Pipeline
631
from sklearn.preprocessing import StandardScaler
632
from imblearn.ensemble import BalancedRandomForestClassifier
633

634
pipeline = Pipeline([
635
    ('scaler', StandardScaler()),
636
    ('classifier', BalancedRandomForestClassifier(random_state=42))
637
])
638

639
pipeline.fit(X_train, y_train)
640
y_pred = pipeline.predict(X_test)
641
```
642

643
This modular design enables easy integration into existing machine learning workflows while providing the benefits of balanced ensemble learning for imbalanced datasets.

Version

Tile

Files

ensemble.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

ensemble.mddocs/