Tessl Tile for pypi/imbalanced-learn@0.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

combination.md datasets.md deep-learning.md ensemble.md index.md metrics.md model-selection.md over-sampling.md pipeline.md under-sampling.md utilities.md

combination.mddocs/

0
# Combination Methods
1

2
Combination methods in imbalanced-learn provide a powerful approach to handling imbalanced datasets by sequentially applying both over-sampling and under-sampling techniques. These hybrid methods first generate synthetic samples to balance the dataset, then remove noisy or problematic samples to improve data quality.
3

4
## Overview
5

6
Combination methods work by:
7

8
1. **Over-sampling phase**: Generate synthetic samples using techniques like SMOTE to increase minority class representation
9
2. **Under-sampling phase**: Remove noisy, borderline, or problematic samples using cleaning techniques like Edited Nearest Neighbours or Tomek Links removal
10

11
This two-step approach aims to achieve both balanced class distribution and improved data quality, potentially leading to better classifier performance than using either technique alone.
12

13
## Available Methods
14

15
The `imblearn.combine` module provides two main combination methods:
16

17
- **SMOTEENN**: Combines SMOTE over-sampling with Edited Nearest Neighbours cleaning
18
- **SMOTETomek**: Combines SMOTE over-sampling with Tomek Links removal
19

20
Both methods follow the same general pattern: apply SMOTE first to generate synthetic samples, then apply a cleaning technique to remove noisy samples from the augmented dataset.
21

22
---
23

24
## SMOTEENN
25

26
```python { .api }
27
class SMOTEENN(
28
    *,
29
    sampling_strategy="auto",
30
    random_state=None,
31
    smote=None,
32
    enn=None,
33
    n_jobs=None
34
)
35
```
36

37
Over-sampling using SMOTE and cleaning using Edited Nearest Neighbours.
38

39
This method combines the SMOTE over-sampling technique with Edited Nearest Neighbours (ENN) cleaning. It first applies SMOTE to generate synthetic samples for minority classes, then uses ENN to remove noisy samples from the resulting dataset.
40

41
### Parameters
42

43
- **sampling_strategy** : `float`, `str`, `dict` or `callable`, default=`'auto'`
44
  
45
  Sampling information to resample the data set.
46
  
47
  - When `float`, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as α_os = N_rm / N_M where N_rm is the number of samples in the minority class after resampling and N_M is the number of samples in the majority class.
48
    
49
    **Warning**: `float` is only available for **binary** classification. An error is raised for multi-class classification.
50
  
51
  - When `str`, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:
52
    - `'minority'`: resample only the minority class
53
    - `'not minority'`: resample all classes but the minority class
54
    - `'not majority'`: resample all classes but the majority class
55
    - `'all'`: resample all classes
56
    - `'auto'`: equivalent to `'not majority'`
57
  
58
  - When `dict`, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
59
  
60
  - When callable, function taking `y` and returns a `dict`. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
61

62
- **random_state** : `int`, `RandomState` instance, default=`None`
63
  
64
  Control the randomization of the algorithm.
65
  
66
  - If `int`, `random_state` is the seed used by the random number generator
67
  - If `RandomState` instance, `random_state` is the random number generator
68
  - If `None`, the random number generator is the `RandomState` instance used by `np.random`
69

70
- **smote** : sampler object, default=`None`
71
  
72
  The `SMOTE` object to use. If not given, a `SMOTE` object with default parameters will be used.
73

74
- **enn** : sampler object, default=`None`
75
  
76
  The `EditedNearestNeighbours` object to use. If not given, an `EditedNearestNeighbours` object with `sampling_strategy='all'` will be used.
77

78
- **n_jobs** : `int`, default=`None`
79
  
80
  Number of CPU cores used during the cross-validation loop. `None` means 1 unless in a `joblib.parallel_backend` context. `-1` means using all processors.
81

82
### Attributes
83

84
- **sampling_strategy_** : `dict`
85
  
86
  Dictionary containing the information to sample the dataset. The keys correspond to the class labels from which to sample and the values are the number of samples to sample.
87

88
- **smote_** : sampler object
89
  
90
  The validated `SMOTE` instance.
91

92
- **enn_** : sampler object
93
  
94
  The validated `EditedNearestNeighbours` instance.
95

96
- **n_features_in_** : `int`
97
  
98
  Number of features in the input dataset.
99

100
- **feature_names_in_** : `ndarray` of shape `(n_features_in_,)`
101
  
102
  Names of features seen during `fit`. Defined only when `X` has feature names that are all strings.
103

104
### Methods
105

106
```python { .api }
107
def fit_resample(X, y, **params)
108
```
109

110
Resample the dataset.
111

112
**Parameters:**
113
- **X** : `{array-like, dataframe, sparse matrix}` of shape `(n_samples, n_features)`
114
  
115
  Matrix containing the data which have to be sampled.
116

117
- **y** : `array-like` of shape `(n_samples,)`
118
  
119
  Corresponding label for each sample in X.
120

121
- ****params** : `dict`
122
  
123
  Extra parameters to use by the sampler.
124

125
**Returns:**
126
- **X_resampled** : `{array-like, dataframe, sparse matrix}` of shape `(n_samples_new, n_features)`
127
  
128
  The array containing the resampled data.
129

130
- **y_resampled** : `array-like` of shape `(n_samples_new,)`
131
  
132
  The corresponding label of `X_resampled`.
133

134
### Example Usage
135

136
```python
137
from collections import Counter
138
from sklearn.datasets import make_classification
139
from imblearn.combine import SMOTEENN
140

141
# Create an imbalanced dataset
142
X, y = make_classification(
143
    n_classes=2, 
144
    class_sep=2,
145
    weights=[0.1, 0.9], 
146
    n_informative=3, 
147
    n_redundant=1, 
148
    flip_y=0,
149
    n_features=20, 
150
    n_clusters_per_class=1, 
151
    n_samples=1000, 
152
    random_state=10
153
)
154

155
print('Original dataset shape:', Counter(y))
156
# Original dataset shape: Counter({1: 900, 0: 100})
157

158
# Apply SMOTEENN
159
sme = SMOTEENN(random_state=42)
160
X_res, y_res = sme.fit_resample(X, y)
161

162
print('Resampled dataset shape:', Counter(y_res))
163
# Resampled dataset shape: Counter({0: 900, 1: 881})
164

165
# Using custom SMOTE and ENN parameters
166
from imblearn.over_sampling import SMOTE
167
from imblearn.under_sampling import EditedNearestNeighbours
168

169
custom_smote = SMOTE(k_neighbors=3, random_state=42)
170
custom_enn = EditedNearestNeighbours(n_neighbors=5, kind_sel='mode')
171

172
sme_custom = SMOTEENN(
173
    smote=custom_smote,
174
    enn=custom_enn,
175
    random_state=42
176
)
177
X_res_custom, y_res_custom = sme_custom.fit_resample(X, y)
178
```
179

180
### Notes
181

182
- The method was first presented in Batista et al. (2004)
183
- Supports multi-class resampling following the schemes used by SMOTE and ENN
184
- The ENN cleaning step removes samples that are misclassified by their nearest neighbors, which can help remove both noisy samples and borderline cases created by SMOTE
185
- The final dataset size is typically smaller than what SMOTE alone would produce due to the cleaning step
186

187
---
188

189
## SMOTETomek
190

191
```python { .api }
192
class SMOTETomek(
193
    *,
194
    sampling_strategy="auto",
195
    random_state=None,
196
    smote=None,
197
    tomek=None,
198
    n_jobs=None
199
)
200
```
201

202
Over-sampling using SMOTE and cleaning using Tomek links.
203

204
This method combines the SMOTE over-sampling technique with Tomek links removal. It first applies SMOTE to generate synthetic samples for minority classes, then removes Tomek links (pairs of nearest neighbors from different classes) from the resulting dataset.
205

206
### Parameters
207

208
- **sampling_strategy** : `float`, `str`, `dict` or `callable`, default=`'auto'`
209
  
210
  Sampling information to resample the data set.
211
  
212
  - When `float`, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as α_os = N_rm / N_M where N_rm is the number of samples in the minority class after resampling and N_M is the number of samples in the majority class.
213
    
214
    **Warning**: `float` is only available for **binary** classification. An error is raised for multi-class classification.
215
  
216
  - When `str`, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:
217
    - `'minority'`: resample only the minority class
218
    - `'not minority'`: resample all classes but the minority class
219
    - `'not majority'`: resample all classes but the majority class
220
    - `'all'`: resample all classes
221
    - `'auto'`: equivalent to `'not majority'`
222
  
223
  - When `dict`, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
224
  
225
  - When callable, function taking `y` and returns a `dict`. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
226

227
- **random_state** : `int`, `RandomState` instance, default=`None`
228
  
229
  Control the randomization of the algorithm.
230
  
231
  - If `int`, `random_state` is the seed used by the random number generator
232
  - If `RandomState` instance, `random_state` is the random number generator
233
  - If `None`, the random number generator is the `RandomState` instance used by `np.random`
234

235
- **smote** : sampler object, default=`None`
236
  
237
  The `SMOTE` object to use. If not given, a `SMOTE` object with default parameters will be used.
238

239
- **tomek** : sampler object, default=`None`
240
  
241
  The `TomekLinks` object to use. If not given, a `TomekLinks` object with `sampling_strategy='all'` will be used.
242

243
- **n_jobs** : `int`, default=`None`
244
  
245
  Number of CPU cores used during the cross-validation loop. `None` means 1 unless in a `joblib.parallel_backend` context. `-1` means using all processors.
246

247
### Attributes
248

249
- **sampling_strategy_** : `dict`
250
  
251
  Dictionary containing the information to sample the dataset. The keys correspond to the class labels from which to sample and the values are the number of samples to sample.
252

253
- **smote_** : sampler object
254
  
255
  The validated `SMOTE` instance.
256

257
- **tomek_** : sampler object
258
  
259
  The validated `TomekLinks` instance.
260

261
- **n_features_in_** : `int`
262
  
263
  Number of features in the input dataset.
264

265
- **feature_names_in_** : `ndarray` of shape `(n_features_in_,)`
266
  
267
  Names of features seen during `fit`. Defined only when `X` has feature names that are all strings.
268

269
### Methods
270

271
```python { .api }
272
def fit_resample(X, y, **params)
273
```
274

275
Resample the dataset.
276

277
**Parameters:**
278
- **X** : `{array-like, dataframe, sparse matrix}` of shape `(n_samples, n_features)`
279
  
280
  Matrix containing the data which have to be sampled.
281

282
- **y** : `array-like` of shape `(n_samples,)`
283
  
284
  Corresponding label for each sample in X.
285

286
- ****params** : `dict`
287
  
288
  Extra parameters to use by the sampler.
289

290
**Returns:**
291
- **X_resampled** : `{array-like, dataframe, sparse matrix}` of shape `(n_samples_new, n_features)`
292
  
293
  The array containing the resampled data.
294

295
- **y_resampled** : `array-like` of shape `(n_samples_new,)`
296
  
297
  The corresponding label of `X_resampled`.
298

299
### Example Usage
300

301
```python
302
from collections import Counter
303
from sklearn.datasets import make_classification
304
from imblearn.combine import SMOTETomek
305

306
# Create an imbalanced dataset
307
X, y = make_classification(
308
    n_classes=2, 
309
    class_sep=2,
310
    weights=[0.1, 0.9], 
311
    n_informative=3, 
312
    n_redundant=1, 
313
    flip_y=0,
314
    n_features=20, 
315
    n_clusters_per_class=1, 
316
    n_samples=1000, 
317
    random_state=10
318
)
319

320
print('Original dataset shape:', Counter(y))
321
# Original dataset shape: Counter({1: 900, 0: 100})
322

323
# Apply SMOTETomek
324
smt = SMOTETomek(random_state=42)
325
X_res, y_res = smt.fit_resample(X, y)
326

327
print('Resampled dataset shape:', Counter(y_res))
328
# Resampled dataset shape: Counter({0: 900, 1: 900})
329

330
# Using custom SMOTE and Tomek parameters
331
from imblearn.over_sampling import SMOTE
332
from imblearn.under_sampling import TomekLinks
333

334
custom_smote = SMOTE(k_neighbors=5, random_state=42)
335
custom_tomek = TomekLinks(sampling_strategy='majority')
336

337
smt_custom = SMOTETomek(
338
    smote=custom_smote,
339
    tomek=custom_tomek,
340
    random_state=42
341
)
342
X_res_custom, y_res_custom = smt_custom.fit_resample(X, y)
343
```
344

345
### Notes
346

347
- The method was first presented in Batista et al. (2003)
348
- Supports multi-class resampling following the schemes used by SMOTE and TomekLinks
349
- Tomek links removal focuses on cleaning the decision boundary by removing ambiguous samples
350
- Generally preserves more samples than SMOTEENN since Tomek links removal is less aggressive than ENN
351

352
---
353

354
## Comparison: SMOTEENN vs SMOTETomek
355

356
| Aspect | SMOTEENN | SMOTETomek |
357
|--------|----------|------------|
358
| **Cleaning Method** | Edited Nearest Neighbours | Tomek Links |
359
| **Cleaning Aggressiveness** | More aggressive | Less aggressive |
360
| **Typical Sample Reduction** | Higher | Lower |
361
| **Focus** | Removes misclassified samples | Removes boundary ambiguous samples |
362
| **Best Use Case** | Noisy datasets | Clean decision boundaries |
363
364
### When to Use Each Method
365

366
**Use SMOTEENN when:**
367
- Your dataset contains significant noise
368
- You want more aggressive cleaning
369
- Class boundaries are poorly defined
370
- You can afford to lose more samples for better quality
371

372
**Use SMOTETomek when:**
373
- Your dataset is relatively clean
374
- You want to preserve more samples
375
- You need to clean decision boundaries
376
- Class overlap is the main issue
377

378
### Algorithm Workflow
379

380
Both methods follow the same general workflow:
381

382
1. **Input**: Imbalanced dataset (X, y)
383
2. **SMOTE Phase**: Apply SMOTE over-sampling to generate synthetic minority class samples
384
3. **Cleaning Phase**: 
385
   - SMOTEENN: Apply ENN to remove misclassified samples
386
   - SMOTETomek: Remove Tomek links from the dataset
387
4. **Output**: Balanced and cleaned dataset
388

389
This sequential approach ensures that the benefits of both techniques are realized: balanced class distribution from SMOTE and improved data quality from the cleaning step.

Version

Tile

Files

combination.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

combination.mddocs/