0
# Under-Sampling Methods
1
2
Under-sampling methods reduce the size of the majority class(es) to address class imbalance. These techniques remove samples from the dataset, either randomly or using intelligent selection criteria to preserve important boundary information.
3
4
## Categories of Under-Sampling Methods
5
6
### Random Under-Sampling
7
Methods that randomly select samples to remove from majority classes.
8
9
### Prototype Generation
10
Methods that generate new synthetic samples to represent the original data distribution.
11
12
### Prototype Selection
13
Methods that intelligently select which samples to keep based on neighborhood analysis, distance metrics, or classification difficulty.
14
15
### Neighborhood Cleaning
16
Methods that remove noisy samples or samples that negatively affect classification performance.
17
18
---
19
20
## Random Under-Sampling
21
22
### RandomUnderSampler
23
24
Random under-sampling of majority class samples with or without replacement.
25
26
```python { .api }
27
class RandomUnderSampler:
28
def __init__(
29
self,
30
*,
31
sampling_strategy="auto",
32
random_state=None,
33
replacement=False
34
):
35
```
36
37
**Parameters:**
38
- `sampling_strategy` (str, dict, list): Strategy to control sampling. Default is "auto".
39
- `random_state` (int, RandomState, None): Random number generator seed for reproducibility.
40
- `replacement` (bool): Whether sampling is with or without replacement. Default is False.
41
42
**Attributes:**
43
- `sampling_strategy_` (dict): Dictionary containing sampling information per class.
44
- `sample_indices_` (ndarray): Indices of selected samples.
45
- `n_features_in_` (int): Number of input features.
46
- `feature_names_in_` (ndarray): Names of input features when available.
47
48
**Methods:**
49
- `fit_resample(X, y)`: Fit the sampler and resample the dataset.
50
51
**Usage Example:**
52
```python
53
from imblearn.under_sampling import RandomUnderSampler
54
from collections import Counter
55
56
# Create random under-sampler
57
rus = RandomUnderSampler(random_state=42)
58
59
# Apply under-sampling
60
X_resampled, y_resampled = rus.fit_resample(X, y)
61
print(f"Original: {Counter(y)}")
62
print(f"Resampled: {Counter(y_resampled)}")
63
```
64
65
---
66
67
## Prototype Generation
68
69
### ClusterCentroids
70
71
Under-sample by generating centroids based on clustering methods. Replaces clusters of majority samples with their centroids.
72
73
```python { .api }
74
class ClusterCentroids:
75
def __init__(
76
self,
77
*,
78
sampling_strategy="auto",
79
random_state=None,
80
estimator=None,
81
voting="auto"
82
):
83
```
84
85
**Parameters:**
86
- `sampling_strategy` (str, dict, list): Strategy to control sampling. Default is "auto".
87
- `random_state` (int, RandomState, None): Random number generator seed.
88
- `estimator` (estimator object): Clustering estimator with `n_clusters` parameter and `cluster_centers_` attribute. Defaults to KMeans.
89
- `voting` (str): Voting strategy for generating new samples:
90
- "hard": Use nearest neighbors of centroids
91
- "soft": Use centroids directly
92
- "auto": Choose based on input sparsity
93
94
**Attributes:**
95
- `sampling_strategy_` (dict): Dictionary containing sampling information per class.
96
- `estimator_` (estimator object): The validated clustering estimator.
97
- `voting_` (str): The validated voting strategy.
98
- `n_features_in_` (int): Number of input features.
99
- `feature_names_in_` (ndarray): Names of input features when available.
100
101
**Methods:**
102
- `fit_resample(X, y)`: Fit the sampler and resample the dataset.
103
104
**Usage Example:**
105
```python
106
from imblearn.under_sampling import ClusterCentroids
107
from sklearn.cluster import MiniBatchKMeans
108
109
# Create cluster centroids sampler with custom estimator
110
cc = ClusterCentroids(
111
estimator=MiniBatchKMeans(n_init=1, random_state=0),
112
random_state=42
113
)
114
115
# Apply cluster-based under-sampling
116
X_resampled, y_resampled = cc.fit_resample(X, y)
117
```
118
119
---
120
121
## Prototype Selection Methods
122
123
### NearMiss
124
125
Under-sample based on NearMiss methods that select samples based on distance to minority class samples.
126
127
```python { .api }
128
class NearMiss:
129
def __init__(
130
self,
131
*,
132
sampling_strategy="auto",
133
version=1,
134
n_neighbors=3,
135
n_neighbors_ver3=3,
136
n_jobs=None
137
):
138
```
139
140
**Parameters:**
141
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
142
- `version` (int): NearMiss version (1, 2, or 3):
143
- Version 1: Select samples closest to minority class samples
144
- Version 2: Select samples closest to farthest minority class samples
145
- Version 3: Two-step process with neighborhood selection
146
- `n_neighbors` (int, estimator): Number of neighbors or KNN estimator.
147
- `n_neighbors_ver3` (int, estimator): Number of neighbors for version 3 pre-selection.
148
- `n_jobs` (int): Number of parallel jobs.
149
150
**Attributes:**
151
- `sampling_strategy_` (dict): Dictionary containing sampling information.
152
- `nn_` (estimator object): Validated K-nearest neighbors estimator.
153
- `nn_ver3_` (estimator object): K-nearest neighbors estimator for version 3.
154
- `sample_indices_` (ndarray): Indices of selected samples.
155
156
**Usage Example:**
157
```python
158
from imblearn.under_sampling import NearMiss
159
160
# NearMiss version 1 (select closest to minority)
161
nm1 = NearMiss(version=1)
162
X_res1, y_res1 = nm1.fit_resample(X, y)
163
164
# NearMiss version 3 (two-step selection)
165
nm3 = NearMiss(version=3, n_neighbors=3, n_neighbors_ver3=3)
166
X_res3, y_res3 = nm3.fit_resample(X, y)
167
```
168
169
### InstanceHardnessThreshold
170
171
Under-sample based on instance hardness threshold using cross-validation predictions.
172
173
```python { .api }
174
class InstanceHardnessThreshold:
175
def __init__(
176
self,
177
*,
178
estimator=None,
179
sampling_strategy="auto",
180
random_state=None,
181
cv=5,
182
n_jobs=None
183
):
184
```
185
186
**Parameters:**
187
- `estimator` (estimator object): Classifier with `predict_proba` method. Defaults to RandomForestClassifier.
188
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
189
- `random_state` (int, RandomState, None): Random number generator seed.
190
- `cv` (int): Number of cross-validation folds for hardness estimation.
191
- `n_jobs` (int): Number of parallel jobs.
192
193
**Attributes:**
194
- `sampling_strategy_` (dict): Dictionary containing sampling information.
195
- `estimator_` (estimator object): The validated classifier.
196
- `sample_indices_` (ndarray): Indices of selected samples.
197
198
**Usage Example:**
199
```python
200
from imblearn.under_sampling import InstanceHardnessThreshold
201
from sklearn.ensemble import RandomForestClassifier
202
203
# Use custom classifier for hardness estimation
204
iht = InstanceHardnessThreshold(
205
estimator=RandomForestClassifier(n_estimators=50),
206
cv=3,
207
random_state=42
208
)
209
X_resampled, y_resampled = iht.fit_resample(X, y)
210
```
211
212
### TomekLinks
213
214
Under-sample by removing Tomek's links - pairs of nearest neighbors from different classes.
215
216
```python { .api }
217
class TomekLinks:
218
def __init__(
219
self,
220
*,
221
sampling_strategy="auto",
222
n_jobs=None
223
):
224
```
225
226
**Parameters:**
227
- `sampling_strategy` (str, dict, list): Strategy to control which classes to clean.
228
- `n_jobs` (int): Number of parallel jobs.
229
230
**Attributes:**
231
- `sampling_strategy_` (dict): Dictionary containing sampling information.
232
- `sample_indices_` (ndarray): Indices of selected samples.
233
234
**Methods:**
235
- `fit_resample(X, y)`: Remove Tomek links from the dataset.
236
- `is_tomek(y, nn_index, class_type)`: Static method to detect Tomek pairs.
237
238
**Usage Example:**
239
```python
240
from imblearn.under_sampling import TomekLinks
241
242
# Remove Tomek links (noisy border samples)
243
tl = TomekLinks()
244
X_cleaned, y_cleaned = tl.fit_resample(X, y)
245
print(f"Removed {len(y) - len(y_cleaned)} Tomek links")
246
```
247
248
### EditedNearestNeighbours
249
250
Under-sample by removing samples whose neighborhood contains samples from different classes.
251
252
```python { .api }
253
class EditedNearestNeighbours:
254
def __init__(
255
self,
256
*,
257
sampling_strategy="auto",
258
n_neighbors=3,
259
kind_sel="all",
260
n_jobs=None
261
):
262
```
263
264
**Parameters:**
265
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
266
- `n_neighbors` (int, estimator): Number of neighbors to examine or KNN estimator.
267
- `kind_sel` (str): Selection strategy:
268
- "all": Remove if any neighbor is from different class
269
- "mode": Remove if most neighbors are from different class
270
- `n_jobs` (int): Number of parallel jobs.
271
272
**Attributes:**
273
- `sampling_strategy_` (dict): Dictionary containing sampling information.
274
- `nn_` (estimator object): Validated K-nearest neighbors estimator.
275
- `sample_indices_` (ndarray): Indices of selected samples.
276
277
**Usage Example:**
278
```python
279
from imblearn.under_sampling import EditedNearestNeighbours
280
281
# Conservative cleaning (remove if any neighbor differs)
282
enn_all = EditedNearestNeighbours(kind_sel="all", n_neighbors=3)
283
X_clean_all, y_clean_all = enn_all.fit_resample(X, y)
284
285
# Less aggressive cleaning (remove if majority neighbors differ)
286
enn_mode = EditedNearestNeighbours(kind_sel="mode", n_neighbors=5)
287
X_clean_mode, y_clean_mode = enn_mode.fit_resample(X, y)
288
```
289
290
### RepeatedEditedNearestNeighbours
291
292
Repeated application of EditedNearestNeighbours until convergence or stopping criteria.
293
294
```python { .api }
295
class RepeatedEditedNearestNeighbours:
296
def __init__(
297
self,
298
*,
299
sampling_strategy="auto",
300
n_neighbors=3,
301
max_iter=100,
302
kind_sel="all",
303
n_jobs=None
304
):
305
```
306
307
**Parameters:**
308
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
309
- `n_neighbors` (int, estimator): Number of neighbors or KNN estimator.
310
- `max_iter` (int): Maximum number of iterations.
311
- `kind_sel` (str): Selection strategy ("all" or "mode").
312
- `n_jobs` (int): Number of parallel jobs.
313
314
**Attributes:**
315
- `sampling_strategy_` (dict): Dictionary containing sampling information.
316
- `nn_` (estimator object): Validated K-nearest neighbors estimator.
317
- `enn_` (sampler object): The EditedNearestNeighbours instance.
318
- `sample_indices_` (ndarray): Indices of selected samples.
319
- `n_iter_` (int): Number of iterations performed.
320
321
**Usage Example:**
322
```python
323
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
324
325
# Repeat ENN until convergence
326
renn = RepeatedEditedNearestNeighbours(
327
n_neighbors=3,
328
max_iter=50,
329
kind_sel="all"
330
)
331
X_resampled, y_resampled = renn.fit_resample(X, y)
332
print(f"Converged after {renn.n_iter_} iterations")
333
```
334
335
### AllKNN
336
337
Apply EditedNearestNeighbours with increasing neighborhood sizes from 1 to n_neighbors.
338
339
```python { .api }
340
class AllKNN:
341
def __init__(
342
self,
343
*,
344
sampling_strategy="auto",
345
n_neighbors=3,
346
kind_sel="all",
347
allow_minority=False,
348
n_jobs=None
349
):
350
```
351
352
**Parameters:**
353
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
354
- `n_neighbors` (int, estimator): Maximum number of neighbors or KNN estimator.
355
- `kind_sel` (str): Selection strategy ("all" or "mode").
356
- `allow_minority` (bool): Allow majority classes to become minority classes.
357
- `n_jobs` (int): Number of parallel jobs.
358
359
**Attributes:**
360
- `sampling_strategy_` (dict): Dictionary containing sampling information.
361
- `nn_` (estimator object): Validated K-nearest neighbors estimator.
362
- `enn_` (sampler object): The EditedNearestNeighbours instance.
363
- `sample_indices_` (ndarray): Indices of selected samples.
364
365
**Usage Example:**
366
```python
367
from imblearn.under_sampling import AllKNN
368
369
# Progressive neighborhood cleaning
370
allknn = AllKNN(n_neighbors=5, kind_sel="all")
371
X_resampled, y_resampled = allknn.fit_resample(X, y)
372
```
373
374
### OneSidedSelection
375
376
Under-sample using one-sided selection method combining CNN and Tomek links.
377
378
```python { .api }
379
class OneSidedSelection:
380
def __init__(
381
self,
382
*,
383
sampling_strategy="auto",
384
random_state=None,
385
n_neighbors=None,
386
n_seeds_S=1,
387
n_jobs=None
388
):
389
```
390
391
**Parameters:**
392
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
393
- `random_state` (int, RandomState, None): Random number generator seed.
394
- `n_neighbors` (int, estimator, None): Number of neighbors or KNN estimator. Defaults to 1-NN.
395
- `n_seeds_S` (int): Number of seed samples to extract for set S.
396
- `n_jobs` (int): Number of parallel jobs.
397
398
**Attributes:**
399
- `sampling_strategy_` (dict): Dictionary containing sampling information.
400
- `estimators_` (list): List of KNN estimators used per class.
401
- `sample_indices_` (ndarray): Indices of selected samples.
402
403
**Usage Example:**
404
```python
405
from imblearn.under_sampling import OneSidedSelection
406
407
# One-sided selection with custom parameters
408
oss = OneSidedSelection(
409
n_neighbors=3,
410
n_seeds_S=1,
411
random_state=42
412
)
413
X_resampled, y_resampled = oss.fit_resample(X, y)
414
```
415
416
### CondensedNearestNeighbour
417
418
Under-sample using condensed nearest neighbor rule to find consistent subset.
419
420
```python { .api }
421
class CondensedNearestNeighbour:
422
def __init__(
423
self,
424
*,
425
sampling_strategy="auto",
426
random_state=None,
427
n_neighbors=None,
428
n_seeds_S=1,
429
n_jobs=None
430
):
431
```
432
433
**Parameters:**
434
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
435
- `random_state` (int, RandomState, None): Random number generator seed.
436
- `n_neighbors` (int, estimator, None): Number of neighbors or KNN estimator. Defaults to 1-NN.
437
- `n_seeds_S` (int): Number of seed samples for set S initialization.
438
- `n_jobs` (int): Number of parallel jobs.
439
440
**Attributes:**
441
- `sampling_strategy_` (dict): Dictionary containing sampling information.
442
- `estimators_` (list): List of KNN estimators used per class.
443
- `sample_indices_` (ndarray): Indices of selected samples.
444
445
**Usage Example:**
446
```python
447
from imblearn.under_sampling import CondensedNearestNeighbour
448
449
# Condensed nearest neighbor selection
450
cnn = CondensedNearestNeighbour(
451
n_neighbors=1,
452
n_seeds_S=1,
453
random_state=42
454
)
455
X_resampled, y_resampled = cnn.fit_resample(X, y)
456
```
457
458
---
459
460
## Neighborhood Cleaning Methods
461
462
### NeighbourhoodCleaningRule
463
464
Under-sample using neighborhood cleaning rule that combines ENN and KNN for noise removal.
465
466
```python { .api }
467
class NeighbourhoodCleaningRule:
468
def __init__(
469
self,
470
*,
471
sampling_strategy="auto",
472
edited_nearest_neighbours=None,
473
n_neighbors=3,
474
threshold_cleaning=0.5,
475
n_jobs=None
476
):
477
```
478
479
**Parameters:**
480
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
481
- `edited_nearest_neighbours` (estimator, None): ENN estimator for initial cleaning. Defaults to ENN with `kind_sel="mode"`.
482
- `n_neighbors` (int, estimator): Number of neighbors or KNN estimator.
483
- `threshold_cleaning` (float): Threshold for considering classes in second cleaning phase: `Ci > C × threshold`.
484
- `n_jobs` (int): Number of parallel jobs.
485
486
**Attributes:**
487
- `sampling_strategy_` (dict): Dictionary containing sampling information.
488
- `edited_nearest_neighbours_` (estimator): The ENN object for first cleaning phase.
489
- `nn_` (estimator object): Validated K-nearest neighbors estimator.
490
- `classes_to_clean_` (list): Classes considered for second cleaning phase.
491
- `sample_indices_` (ndarray): Indices of selected samples.
492
493
**Usage Example:**
494
```python
495
from imblearn.under_sampling import NeighbourhoodCleaningRule
496
from imblearn.under_sampling import EditedNearestNeighbours
497
498
# Default neighborhood cleaning
499
ncr = NeighbourhoodCleaningRule()
500
X_cleaned, y_cleaned = ncr.fit_resample(X, y)
501
502
# Custom ENN for first phase
503
custom_enn = EditedNearestNeighbours(kind_sel="all", n_neighbors=5)
504
ncr_custom = NeighbourhoodCleaningRule(
505
edited_nearest_neighbours=custom_enn,
506
threshold_cleaning=0.3
507
)
508
X_cleaned_custom, y_cleaned_custom = ncr_custom.fit_resample(X, y)
509
```
510
511
---
512
513
## Method Selection Guidelines
514
515
### When to Use Each Method
516
517
**Random Under-Sampling:**
518
- Simple baseline approach
519
- When computational resources are limited
520
- For initial experimentation
521
522
**Prototype Generation (ClusterCentroids):**
523
- When you want to preserve cluster structure
524
- For high-dimensional data where centroids can represent regions well
525
- When interpretability of synthetic samples is important
526
527
**Prototype Selection (NearMiss, ENN variants):**
528
- When preserving decision boundary information is crucial
529
- For datasets where border samples are informative
530
- When you want to remove noisy/outlier samples
531
532
**Neighborhood Cleaning:**
533
- When dataset contains significant noise
534
- For improving classifier performance through data cleaning
535
- When combining multiple cleaning strategies
536
537
### Computational Complexity
538
539
- **RandomUnderSampler:** O(n) - fastest
540
- **ClusterCentroids:** O(n × k × iterations) - depends on clustering algorithm
541
- **NearMiss:** O(n²) - distance calculations between all samples
542
- **ENN variants:** O(n × k × neighbors) - depends on neighborhood size
543
- **TomekLinks:** O(n²) - pairwise distance calculations
544
- **CNN/OSS:** O(n²) - iterative neighbor searches
545
546
### Multi-Class Support
547
548
All methods support multi-class resampling:
549
- **One-vs.-rest:** NearMiss, ENN variants, TomekLinks, NeighbourhoodCleaningRule
550
- **One-vs.-one:** OneSidedSelection, CondensedNearestNeighbour
551
- **Independent sampling:** RandomUnderSampler, ClusterCentroids, InstanceHardnessThreshold
552
553
### Pipeline Integration
554
555
```python
556
from sklearn.pipeline import Pipeline
557
from sklearn.ensemble import RandomForestClassifier
558
from imblearn.under_sampling import RandomUnderSampler
559
560
# Create preprocessing pipeline
561
pipeline = Pipeline([
562
('sampler', RandomUnderSampler(random_state=42)),
563
('classifier', RandomForestClassifier(random_state=42))
564
])
565
566
# Fit pipeline
567
pipeline.fit(X_train, y_train)
568
predictions = pipeline.predict(X_test)
569
```