0
# Combination Methods
1
2
Combination methods in imbalanced-learn provide a powerful approach to handling imbalanced datasets by sequentially applying both over-sampling and under-sampling techniques. These hybrid methods first generate synthetic samples to balance the dataset, then remove noisy or problematic samples to improve data quality.
3
4
## Overview
5
6
Combination methods work by:
7
8
1. **Over-sampling phase**: Generate synthetic samples using techniques like SMOTE to increase minority class representation
9
2. **Under-sampling phase**: Remove noisy, borderline, or problematic samples using cleaning techniques like Edited Nearest Neighbours or Tomek Links removal
10
11
This two-step approach aims to achieve both balanced class distribution and improved data quality, potentially leading to better classifier performance than using either technique alone.
12
13
## Available Methods
14
15
The `imblearn.combine` module provides two main combination methods:
16
17
- **SMOTEENN**: Combines SMOTE over-sampling with Edited Nearest Neighbours cleaning
18
- **SMOTETomek**: Combines SMOTE over-sampling with Tomek Links removal
19
20
Both methods follow the same general pattern: apply SMOTE first to generate synthetic samples, then apply a cleaning technique to remove noisy samples from the augmented dataset.
21
22
---
23
24
## SMOTEENN
25
26
```python { .api }
27
class SMOTEENN(
28
*,
29
sampling_strategy="auto",
30
random_state=None,
31
smote=None,
32
enn=None,
33
n_jobs=None
34
)
35
```
36
37
Over-sampling using SMOTE and cleaning using Edited Nearest Neighbours.
38
39
This method combines the SMOTE over-sampling technique with Edited Nearest Neighbours (ENN) cleaning. It first applies SMOTE to generate synthetic samples for minority classes, then uses ENN to remove noisy samples from the resulting dataset.
40
41
### Parameters
42
43
- **sampling_strategy** : `float`, `str`, `dict` or `callable`, default=`'auto'`
44
45
Sampling information to resample the data set.
46
47
- When `float`, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as α_os = N_rm / N_M where N_rm is the number of samples in the minority class after resampling and N_M is the number of samples in the majority class.
48
49
**Warning**: `float` is only available for **binary** classification. An error is raised for multi-class classification.
50
51
- When `str`, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:
52
- `'minority'`: resample only the minority class
53
- `'not minority'`: resample all classes but the minority class
54
- `'not majority'`: resample all classes but the majority class
55
- `'all'`: resample all classes
56
- `'auto'`: equivalent to `'not majority'`
57
58
- When `dict`, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
59
60
- When callable, function taking `y` and returns a `dict`. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
61
62
- **random_state** : `int`, `RandomState` instance, default=`None`
63
64
Control the randomization of the algorithm.
65
66
- If `int`, `random_state` is the seed used by the random number generator
67
- If `RandomState` instance, `random_state` is the random number generator
68
- If `None`, the random number generator is the `RandomState` instance used by `np.random`
69
70
- **smote** : sampler object, default=`None`
71
72
The `SMOTE` object to use. If not given, a `SMOTE` object with default parameters will be used.
73
74
- **enn** : sampler object, default=`None`
75
76
The `EditedNearestNeighbours` object to use. If not given, an `EditedNearestNeighbours` object with `sampling_strategy='all'` will be used.
77
78
- **n_jobs** : `int`, default=`None`
79
80
Number of CPU cores used during the cross-validation loop. `None` means 1 unless in a `joblib.parallel_backend` context. `-1` means using all processors.
81
82
### Attributes
83
84
- **sampling_strategy_** : `dict`
85
86
Dictionary containing the information to sample the dataset. The keys correspond to the class labels from which to sample and the values are the number of samples to sample.
87
88
- **smote_** : sampler object
89
90
The validated `SMOTE` instance.
91
92
- **enn_** : sampler object
93
94
The validated `EditedNearestNeighbours` instance.
95
96
- **n_features_in_** : `int`
97
98
Number of features in the input dataset.
99
100
- **feature_names_in_** : `ndarray` of shape `(n_features_in_,)`
101
102
Names of features seen during `fit`. Defined only when `X` has feature names that are all strings.
103
104
### Methods
105
106
```python { .api }
107
def fit_resample(X, y, **params)
108
```
109
110
Resample the dataset.
111
112
**Parameters:**
113
- **X** : `{array-like, dataframe, sparse matrix}` of shape `(n_samples, n_features)`
114
115
Matrix containing the data which have to be sampled.
116
117
- **y** : `array-like` of shape `(n_samples,)`
118
119
Corresponding label for each sample in X.
120
121
- ****params** : `dict`
122
123
Extra parameters to use by the sampler.
124
125
**Returns:**
126
- **X_resampled** : `{array-like, dataframe, sparse matrix}` of shape `(n_samples_new, n_features)`
127
128
The array containing the resampled data.
129
130
- **y_resampled** : `array-like` of shape `(n_samples_new,)`
131
132
The corresponding label of `X_resampled`.
133
134
### Example Usage
135
136
```python
137
from collections import Counter
138
from sklearn.datasets import make_classification
139
from imblearn.combine import SMOTEENN
140
141
# Create an imbalanced dataset
142
X, y = make_classification(
143
n_classes=2,
144
class_sep=2,
145
weights=[0.1, 0.9],
146
n_informative=3,
147
n_redundant=1,
148
flip_y=0,
149
n_features=20,
150
n_clusters_per_class=1,
151
n_samples=1000,
152
random_state=10
153
)
154
155
print('Original dataset shape:', Counter(y))
156
# Original dataset shape: Counter({1: 900, 0: 100})
157
158
# Apply SMOTEENN
159
sme = SMOTEENN(random_state=42)
160
X_res, y_res = sme.fit_resample(X, y)
161
162
print('Resampled dataset shape:', Counter(y_res))
163
# Resampled dataset shape: Counter({0: 900, 1: 881})
164
165
# Using custom SMOTE and ENN parameters
166
from imblearn.over_sampling import SMOTE
167
from imblearn.under_sampling import EditedNearestNeighbours
168
169
custom_smote = SMOTE(k_neighbors=3, random_state=42)
170
custom_enn = EditedNearestNeighbours(n_neighbors=5, kind_sel='mode')
171
172
sme_custom = SMOTEENN(
173
smote=custom_smote,
174
enn=custom_enn,
175
random_state=42
176
)
177
X_res_custom, y_res_custom = sme_custom.fit_resample(X, y)
178
```
179
180
### Notes
181
182
- The method was first presented in Batista et al. (2004)
183
- Supports multi-class resampling following the schemes used by SMOTE and ENN
184
- The ENN cleaning step removes samples that are misclassified by their nearest neighbors, which can help remove both noisy samples and borderline cases created by SMOTE
185
- The final dataset size is typically smaller than what SMOTE alone would produce due to the cleaning step
186
187
---
188
189
## SMOTETomek
190
191
```python { .api }
192
class SMOTETomek(
193
*,
194
sampling_strategy="auto",
195
random_state=None,
196
smote=None,
197
tomek=None,
198
n_jobs=None
199
)
200
```
201
202
Over-sampling using SMOTE and cleaning using Tomek links.
203
204
This method combines the SMOTE over-sampling technique with Tomek links removal. It first applies SMOTE to generate synthetic samples for minority classes, then removes Tomek links (pairs of nearest neighbors from different classes) from the resulting dataset.
205
206
### Parameters
207
208
- **sampling_strategy** : `float`, `str`, `dict` or `callable`, default=`'auto'`
209
210
Sampling information to resample the data set.
211
212
- When `float`, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as α_os = N_rm / N_M where N_rm is the number of samples in the minority class after resampling and N_M is the number of samples in the majority class.
213
214
**Warning**: `float` is only available for **binary** classification. An error is raised for multi-class classification.
215
216
- When `str`, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:
217
- `'minority'`: resample only the minority class
218
- `'not minority'`: resample all classes but the minority class
219
- `'not majority'`: resample all classes but the majority class
220
- `'all'`: resample all classes
221
- `'auto'`: equivalent to `'not majority'`
222
223
- When `dict`, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
224
225
- When callable, function taking `y` and returns a `dict`. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
226
227
- **random_state** : `int`, `RandomState` instance, default=`None`
228
229
Control the randomization of the algorithm.
230
231
- If `int`, `random_state` is the seed used by the random number generator
232
- If `RandomState` instance, `random_state` is the random number generator
233
- If `None`, the random number generator is the `RandomState` instance used by `np.random`
234
235
- **smote** : sampler object, default=`None`
236
237
The `SMOTE` object to use. If not given, a `SMOTE` object with default parameters will be used.
238
239
- **tomek** : sampler object, default=`None`
240
241
The `TomekLinks` object to use. If not given, a `TomekLinks` object with `sampling_strategy='all'` will be used.
242
243
- **n_jobs** : `int`, default=`None`
244
245
Number of CPU cores used during the cross-validation loop. `None` means 1 unless in a `joblib.parallel_backend` context. `-1` means using all processors.
246
247
### Attributes
248
249
- **sampling_strategy_** : `dict`
250
251
Dictionary containing the information to sample the dataset. The keys correspond to the class labels from which to sample and the values are the number of samples to sample.
252
253
- **smote_** : sampler object
254
255
The validated `SMOTE` instance.
256
257
- **tomek_** : sampler object
258
259
The validated `TomekLinks` instance.
260
261
- **n_features_in_** : `int`
262
263
Number of features in the input dataset.
264
265
- **feature_names_in_** : `ndarray` of shape `(n_features_in_,)`
266
267
Names of features seen during `fit`. Defined only when `X` has feature names that are all strings.
268
269
### Methods
270
271
```python { .api }
272
def fit_resample(X, y, **params)
273
```
274
275
Resample the dataset.
276
277
**Parameters:**
278
- **X** : `{array-like, dataframe, sparse matrix}` of shape `(n_samples, n_features)`
279
280
Matrix containing the data which have to be sampled.
281
282
- **y** : `array-like` of shape `(n_samples,)`
283
284
Corresponding label for each sample in X.
285
286
- ****params** : `dict`
287
288
Extra parameters to use by the sampler.
289
290
**Returns:**
291
- **X_resampled** : `{array-like, dataframe, sparse matrix}` of shape `(n_samples_new, n_features)`
292
293
The array containing the resampled data.
294
295
- **y_resampled** : `array-like` of shape `(n_samples_new,)`
296
297
The corresponding label of `X_resampled`.
298
299
### Example Usage
300
301
```python
302
from collections import Counter
303
from sklearn.datasets import make_classification
304
from imblearn.combine import SMOTETomek
305
306
# Create an imbalanced dataset
307
X, y = make_classification(
308
n_classes=2,
309
class_sep=2,
310
weights=[0.1, 0.9],
311
n_informative=3,
312
n_redundant=1,
313
flip_y=0,
314
n_features=20,
315
n_clusters_per_class=1,
316
n_samples=1000,
317
random_state=10
318
)
319
320
print('Original dataset shape:', Counter(y))
321
# Original dataset shape: Counter({1: 900, 0: 100})
322
323
# Apply SMOTETomek
324
smt = SMOTETomek(random_state=42)
325
X_res, y_res = smt.fit_resample(X, y)
326
327
print('Resampled dataset shape:', Counter(y_res))
328
# Resampled dataset shape: Counter({0: 900, 1: 900})
329
330
# Using custom SMOTE and Tomek parameters
331
from imblearn.over_sampling import SMOTE
332
from imblearn.under_sampling import TomekLinks
333
334
custom_smote = SMOTE(k_neighbors=5, random_state=42)
335
custom_tomek = TomekLinks(sampling_strategy='majority')
336
337
smt_custom = SMOTETomek(
338
smote=custom_smote,
339
tomek=custom_tomek,
340
random_state=42
341
)
342
X_res_custom, y_res_custom = smt_custom.fit_resample(X, y)
343
```
344
345
### Notes
346
347
- The method was first presented in Batista et al. (2003)
348
- Supports multi-class resampling following the schemes used by SMOTE and TomekLinks
349
- Tomek links removal focuses on cleaning the decision boundary by removing ambiguous samples
350
- Generally preserves more samples than SMOTEENN since Tomek links removal is less aggressive than ENN
351
352
---
353
354
## Comparison: SMOTEENN vs SMOTETomek
355
356
| Aspect | SMOTEENN | SMOTETomek |
357
|--------|----------|------------|
358
| **Cleaning Method** | Edited Nearest Neighbours | Tomek Links |
359
| **Cleaning Aggressiveness** | More aggressive | Less aggressive |
360
| **Typical Sample Reduction** | Higher | Lower |
361
| **Focus** | Removes misclassified samples | Removes boundary ambiguous samples |
362
| **Best Use Case** | Noisy datasets | Clean decision boundaries |
363
364
### When to Use Each Method
365
366
**Use SMOTEENN when:**
367
- Your dataset contains significant noise
368
- You want more aggressive cleaning
369
- Class boundaries are poorly defined
370
- You can afford to lose more samples for better quality
371
372
**Use SMOTETomek when:**
373
- Your dataset is relatively clean
374
- You want to preserve more samples
375
- You need to clean decision boundaries
376
- Class overlap is the main issue
377
378
### Algorithm Workflow
379
380
Both methods follow the same general workflow:
381
382
1. **Input**: Imbalanced dataset (X, y)
383
2. **SMOTE Phase**: Apply SMOTE over-sampling to generate synthetic minority class samples
384
3. **Cleaning Phase**:
385
- SMOTEENN: Apply ENN to remove misclassified samples
386
- SMOTETomek: Remove Tomek links from the dataset
387
4. **Output**: Balanced and cleaned dataset
388
389
This sequential approach ensures that the benefits of both techniques are realized: balanced class distribution from SMOTE and improved data quality from the cleaning step.