Tessl Tile for pypi/imbalanced-learn@0.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

combination.md datasets.md deep-learning.md ensemble.md index.md metrics.md model-selection.md over-sampling.md pipeline.md under-sampling.md utilities.md

under-sampling.mddocs/

0
# Under-Sampling Methods
1

2
Under-sampling methods reduce the size of the majority class(es) to address class imbalance. These techniques remove samples from the dataset, either randomly or using intelligent selection criteria to preserve important boundary information.
3

4
## Categories of Under-Sampling Methods
5

6
### Random Under-Sampling
7
Methods that randomly select samples to remove from majority classes.
8

9
### Prototype Generation
10
Methods that generate new synthetic samples to represent the original data distribution.
11

12
### Prototype Selection
13
Methods that intelligently select which samples to keep based on neighborhood analysis, distance metrics, or classification difficulty.
14

15
### Neighborhood Cleaning
16
Methods that remove noisy samples or samples that negatively affect classification performance.
17

18
---
19

20
## Random Under-Sampling
21

22
### RandomUnderSampler
23

24
Random under-sampling of majority class samples with or without replacement.
25

26
```python { .api }
27
class RandomUnderSampler:
28
    def __init__(
29
        self,
30
        *,
31
        sampling_strategy="auto",
32
        random_state=None,
33
        replacement=False
34
    ):
35
```
36

37
**Parameters:**
38
- `sampling_strategy` (str, dict, list): Strategy to control sampling. Default is "auto".
39
- `random_state` (int, RandomState, None): Random number generator seed for reproducibility.
40
- `replacement` (bool): Whether sampling is with or without replacement. Default is False.
41

42
**Attributes:**
43
- `sampling_strategy_` (dict): Dictionary containing sampling information per class.
44
- `sample_indices_` (ndarray): Indices of selected samples.
45
- `n_features_in_` (int): Number of input features.
46
- `feature_names_in_` (ndarray): Names of input features when available.
47

48
**Methods:**
49
- `fit_resample(X, y)`: Fit the sampler and resample the dataset.
50

51
**Usage Example:**
52
```python
53
from imblearn.under_sampling import RandomUnderSampler
54
from collections import Counter
55

56
# Create random under-sampler
57
rus = RandomUnderSampler(random_state=42)
58

59
# Apply under-sampling
60
X_resampled, y_resampled = rus.fit_resample(X, y)
61
print(f"Original: {Counter(y)}")
62
print(f"Resampled: {Counter(y_resampled)}")
63
```
64

65
---
66

67
## Prototype Generation
68

69
### ClusterCentroids
70

71
Under-sample by generating centroids based on clustering methods. Replaces clusters of majority samples with their centroids.
72

73
```python { .api }
74
class ClusterCentroids:
75
    def __init__(
76
        self,
77
        *,
78
        sampling_strategy="auto",
79
        random_state=None,
80
        estimator=None,
81
        voting="auto"
82
    ):
83
```
84

85
**Parameters:**
86
- `sampling_strategy` (str, dict, list): Strategy to control sampling. Default is "auto".
87
- `random_state` (int, RandomState, None): Random number generator seed.
88
- `estimator` (estimator object): Clustering estimator with `n_clusters` parameter and `cluster_centers_` attribute. Defaults to KMeans.
89
- `voting` (str): Voting strategy for generating new samples:
90
  - "hard": Use nearest neighbors of centroids
91
  - "soft": Use centroids directly  
92
  - "auto": Choose based on input sparsity
93

94
**Attributes:**
95
- `sampling_strategy_` (dict): Dictionary containing sampling information per class.
96
- `estimator_` (estimator object): The validated clustering estimator.
97
- `voting_` (str): The validated voting strategy.
98
- `n_features_in_` (int): Number of input features.
99
- `feature_names_in_` (ndarray): Names of input features when available.
100

101
**Methods:**
102
- `fit_resample(X, y)`: Fit the sampler and resample the dataset.
103

104
**Usage Example:**
105
```python
106
from imblearn.under_sampling import ClusterCentroids
107
from sklearn.cluster import MiniBatchKMeans
108

109
# Create cluster centroids sampler with custom estimator
110
cc = ClusterCentroids(
111
    estimator=MiniBatchKMeans(n_init=1, random_state=0),
112
    random_state=42
113
)
114

115
# Apply cluster-based under-sampling
116
X_resampled, y_resampled = cc.fit_resample(X, y)
117
```
118

119
---
120

121
## Prototype Selection Methods
122

123
### NearMiss
124

125
Under-sample based on NearMiss methods that select samples based on distance to minority class samples.
126

127
```python { .api }
128
class NearMiss:
129
    def __init__(
130
        self,
131
        *,
132
        sampling_strategy="auto",
133
        version=1,
134
        n_neighbors=3,
135
        n_neighbors_ver3=3,
136
        n_jobs=None
137
    ):
138
```
139

140
**Parameters:**
141
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
142
- `version` (int): NearMiss version (1, 2, or 3):
143
  - Version 1: Select samples closest to minority class samples
144
  - Version 2: Select samples closest to farthest minority class samples  
145
  - Version 3: Two-step process with neighborhood selection
146
- `n_neighbors` (int, estimator): Number of neighbors or KNN estimator.
147
- `n_neighbors_ver3` (int, estimator): Number of neighbors for version 3 pre-selection.
148
- `n_jobs` (int): Number of parallel jobs.
149

150
**Attributes:**
151
- `sampling_strategy_` (dict): Dictionary containing sampling information.
152
- `nn_` (estimator object): Validated K-nearest neighbors estimator.
153
- `nn_ver3_` (estimator object): K-nearest neighbors estimator for version 3.
154
- `sample_indices_` (ndarray): Indices of selected samples.
155

156
**Usage Example:**
157
```python
158
from imblearn.under_sampling import NearMiss
159

160
# NearMiss version 1 (select closest to minority)
161
nm1 = NearMiss(version=1)
162
X_res1, y_res1 = nm1.fit_resample(X, y)
163

164
# NearMiss version 3 (two-step selection)
165
nm3 = NearMiss(version=3, n_neighbors=3, n_neighbors_ver3=3)
166
X_res3, y_res3 = nm3.fit_resample(X, y)
167
```
168

169
### InstanceHardnessThreshold
170

171
Under-sample based on instance hardness threshold using cross-validation predictions.
172

173
```python { .api }
174
class InstanceHardnessThreshold:
175
    def __init__(
176
        self,
177
        *,
178
        estimator=None,
179
        sampling_strategy="auto", 
180
        random_state=None,
181
        cv=5,
182
        n_jobs=None
183
    ):
184
```
185

186
**Parameters:**
187
- `estimator` (estimator object): Classifier with `predict_proba` method. Defaults to RandomForestClassifier.
188
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
189
- `random_state` (int, RandomState, None): Random number generator seed.
190
- `cv` (int): Number of cross-validation folds for hardness estimation.
191
- `n_jobs` (int): Number of parallel jobs.
192

193
**Attributes:**
194
- `sampling_strategy_` (dict): Dictionary containing sampling information.
195
- `estimator_` (estimator object): The validated classifier.
196
- `sample_indices_` (ndarray): Indices of selected samples.
197

198
**Usage Example:**
199
```python
200
from imblearn.under_sampling import InstanceHardnessThreshold
201
from sklearn.ensemble import RandomForestClassifier
202

203
# Use custom classifier for hardness estimation
204
iht = InstanceHardnessThreshold(
205
    estimator=RandomForestClassifier(n_estimators=50),
206
    cv=3,
207
    random_state=42
208
)
209
X_resampled, y_resampled = iht.fit_resample(X, y)
210
```
211

212
### TomekLinks
213

214
Under-sample by removing Tomek's links - pairs of nearest neighbors from different classes.
215

216
```python { .api }
217
class TomekLinks:
218
    def __init__(
219
        self,
220
        *,
221
        sampling_strategy="auto",
222
        n_jobs=None
223
    ):
224
```
225

226
**Parameters:**
227
- `sampling_strategy` (str, dict, list): Strategy to control which classes to clean.
228
- `n_jobs` (int): Number of parallel jobs.
229

230
**Attributes:**
231
- `sampling_strategy_` (dict): Dictionary containing sampling information.
232
- `sample_indices_` (ndarray): Indices of selected samples.
233

234
**Methods:**
235
- `fit_resample(X, y)`: Remove Tomek links from the dataset.
236
- `is_tomek(y, nn_index, class_type)`: Static method to detect Tomek pairs.
237

238
**Usage Example:**
239
```python
240
from imblearn.under_sampling import TomekLinks
241

242
# Remove Tomek links (noisy border samples)
243
tl = TomekLinks()
244
X_cleaned, y_cleaned = tl.fit_resample(X, y)
245
print(f"Removed {len(y) - len(y_cleaned)} Tomek links")
246
```
247

248
### EditedNearestNeighbours
249

250
Under-sample by removing samples whose neighborhood contains samples from different classes.
251

252
```python { .api }
253
class EditedNearestNeighbours:
254
    def __init__(
255
        self,
256
        *,
257
        sampling_strategy="auto",
258
        n_neighbors=3,
259
        kind_sel="all",
260
        n_jobs=None
261
    ):
262
```
263

264
**Parameters:**
265
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
266
- `n_neighbors` (int, estimator): Number of neighbors to examine or KNN estimator.
267
- `kind_sel` (str): Selection strategy:
268
  - "all": Remove if any neighbor is from different class
269
  - "mode": Remove if most neighbors are from different class
270
- `n_jobs` (int): Number of parallel jobs.
271

272
**Attributes:**
273
- `sampling_strategy_` (dict): Dictionary containing sampling information.
274
- `nn_` (estimator object): Validated K-nearest neighbors estimator.
275
- `sample_indices_` (ndarray): Indices of selected samples.
276

277
**Usage Example:**
278
```python
279
from imblearn.under_sampling import EditedNearestNeighbours
280

281
# Conservative cleaning (remove if any neighbor differs)
282
enn_all = EditedNearestNeighbours(kind_sel="all", n_neighbors=3)
283
X_clean_all, y_clean_all = enn_all.fit_resample(X, y)
284

285
# Less aggressive cleaning (remove if majority neighbors differ)  
286
enn_mode = EditedNearestNeighbours(kind_sel="mode", n_neighbors=5)
287
X_clean_mode, y_clean_mode = enn_mode.fit_resample(X, y)
288
```
289

290
### RepeatedEditedNearestNeighbours
291

292
Repeated application of EditedNearestNeighbours until convergence or stopping criteria.
293

294
```python { .api }
295
class RepeatedEditedNearestNeighbours:
296
    def __init__(
297
        self,
298
        *,
299
        sampling_strategy="auto",
300
        n_neighbors=3,
301
        max_iter=100,
302
        kind_sel="all", 
303
        n_jobs=None
304
    ):
305
```
306

307
**Parameters:**
308
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
309
- `n_neighbors` (int, estimator): Number of neighbors or KNN estimator.
310
- `max_iter` (int): Maximum number of iterations.
311
- `kind_sel` (str): Selection strategy ("all" or "mode").
312
- `n_jobs` (int): Number of parallel jobs.
313

314
**Attributes:**
315
- `sampling_strategy_` (dict): Dictionary containing sampling information.
316
- `nn_` (estimator object): Validated K-nearest neighbors estimator.
317
- `enn_` (sampler object): The EditedNearestNeighbours instance.
318
- `sample_indices_` (ndarray): Indices of selected samples.
319
- `n_iter_` (int): Number of iterations performed.
320

321
**Usage Example:**
322
```python
323
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
324

325
# Repeat ENN until convergence
326
renn = RepeatedEditedNearestNeighbours(
327
    n_neighbors=3,
328
    max_iter=50,
329
    kind_sel="all"
330
)
331
X_resampled, y_resampled = renn.fit_resample(X, y)
332
print(f"Converged after {renn.n_iter_} iterations")
333
```
334

335
### AllKNN
336

337
Apply EditedNearestNeighbours with increasing neighborhood sizes from 1 to n_neighbors.
338

339
```python { .api }
340
class AllKNN:
341
    def __init__(
342
        self,
343
        *,
344
        sampling_strategy="auto",
345
        n_neighbors=3,
346
        kind_sel="all",
347
        allow_minority=False,
348
        n_jobs=None
349
    ):
350
```
351

352
**Parameters:**
353
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
354
- `n_neighbors` (int, estimator): Maximum number of neighbors or KNN estimator.
355
- `kind_sel` (str): Selection strategy ("all" or "mode").
356
- `allow_minority` (bool): Allow majority classes to become minority classes.
357
- `n_jobs` (int): Number of parallel jobs.
358

359
**Attributes:**
360
- `sampling_strategy_` (dict): Dictionary containing sampling information.
361
- `nn_` (estimator object): Validated K-nearest neighbors estimator.
362
- `enn_` (sampler object): The EditedNearestNeighbours instance.
363
- `sample_indices_` (ndarray): Indices of selected samples.
364

365
**Usage Example:**
366
```python
367
from imblearn.under_sampling import AllKNN
368

369
# Progressive neighborhood cleaning  
370
allknn = AllKNN(n_neighbors=5, kind_sel="all")
371
X_resampled, y_resampled = allknn.fit_resample(X, y)
372
```
373

374
### OneSidedSelection
375

376
Under-sample using one-sided selection method combining CNN and Tomek links.
377

378
```python { .api }
379
class OneSidedSelection:
380
    def __init__(
381
        self,
382
        *,
383
        sampling_strategy="auto",
384
        random_state=None,
385
        n_neighbors=None,
386
        n_seeds_S=1,
387
        n_jobs=None
388
    ):
389
```
390

391
**Parameters:**
392
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
393
- `random_state` (int, RandomState, None): Random number generator seed.
394
- `n_neighbors` (int, estimator, None): Number of neighbors or KNN estimator. Defaults to 1-NN.
395
- `n_seeds_S` (int): Number of seed samples to extract for set S.
396
- `n_jobs` (int): Number of parallel jobs.
397

398
**Attributes:**
399
- `sampling_strategy_` (dict): Dictionary containing sampling information.
400
- `estimators_` (list): List of KNN estimators used per class.
401
- `sample_indices_` (ndarray): Indices of selected samples.
402

403
**Usage Example:**
404
```python
405
from imblearn.under_sampling import OneSidedSelection
406

407
# One-sided selection with custom parameters
408
oss = OneSidedSelection(
409
    n_neighbors=3,
410
    n_seeds_S=1,
411
    random_state=42
412
)
413
X_resampled, y_resampled = oss.fit_resample(X, y)
414
```
415

416
### CondensedNearestNeighbour
417

418
Under-sample using condensed nearest neighbor rule to find consistent subset.
419

420
```python { .api }
421
class CondensedNearestNeighbour:
422
    def __init__(
423
        self,
424
        *,
425
        sampling_strategy="auto",
426
        random_state=None,
427
        n_neighbors=None,
428
        n_seeds_S=1,
429
        n_jobs=None
430
    ):
431
```
432

433
**Parameters:**
434
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
435
- `random_state` (int, RandomState, None): Random number generator seed.
436
- `n_neighbors` (int, estimator, None): Number of neighbors or KNN estimator. Defaults to 1-NN.
437
- `n_seeds_S` (int): Number of seed samples for set S initialization.
438
- `n_jobs` (int): Number of parallel jobs.
439

440
**Attributes:**
441
- `sampling_strategy_` (dict): Dictionary containing sampling information.
442
- `estimators_` (list): List of KNN estimators used per class.
443
- `sample_indices_` (ndarray): Indices of selected samples.
444

445
**Usage Example:**
446
```python
447
from imblearn.under_sampling import CondensedNearestNeighbour
448

449
# Condensed nearest neighbor selection
450
cnn = CondensedNearestNeighbour(
451
    n_neighbors=1,
452
    n_seeds_S=1,
453
    random_state=42
454
)
455
X_resampled, y_resampled = cnn.fit_resample(X, y)
456
```
457

458
---
459

460
## Neighborhood Cleaning Methods
461

462
### NeighbourhoodCleaningRule
463

464
Under-sample using neighborhood cleaning rule that combines ENN and KNN for noise removal.
465

466
```python { .api }
467
class NeighbourhoodCleaningRule:
468
    def __init__(
469
        self,
470
        *,
471
        sampling_strategy="auto",
472
        edited_nearest_neighbours=None,
473
        n_neighbors=3,
474
        threshold_cleaning=0.5,
475
        n_jobs=None
476
    ):
477
```
478

479
**Parameters:**
480
- `sampling_strategy` (str, dict, list): Strategy to control sampling.
481
- `edited_nearest_neighbours` (estimator, None): ENN estimator for initial cleaning. Defaults to ENN with `kind_sel="mode"`.
482
- `n_neighbors` (int, estimator): Number of neighbors or KNN estimator.
483
- `threshold_cleaning` (float): Threshold for considering classes in second cleaning phase: `Ci > C × threshold`.
484
- `n_jobs` (int): Number of parallel jobs.
485

486
**Attributes:**
487
- `sampling_strategy_` (dict): Dictionary containing sampling information.
488
- `edited_nearest_neighbours_` (estimator): The ENN object for first cleaning phase.
489
- `nn_` (estimator object): Validated K-nearest neighbors estimator.
490
- `classes_to_clean_` (list): Classes considered for second cleaning phase.
491
- `sample_indices_` (ndarray): Indices of selected samples.
492

493
**Usage Example:**
494
```python
495
from imblearn.under_sampling import NeighbourhoodCleaningRule
496
from imblearn.under_sampling import EditedNearestNeighbours
497

498
# Default neighborhood cleaning
499
ncr = NeighbourhoodCleaningRule()
500
X_cleaned, y_cleaned = ncr.fit_resample(X, y)
501

502
# Custom ENN for first phase
503
custom_enn = EditedNearestNeighbours(kind_sel="all", n_neighbors=5)
504
ncr_custom = NeighbourhoodCleaningRule(
505
    edited_nearest_neighbours=custom_enn,
506
    threshold_cleaning=0.3
507
)
508
X_cleaned_custom, y_cleaned_custom = ncr_custom.fit_resample(X, y)
509
```
510

511
---
512

513
## Method Selection Guidelines
514

515
### When to Use Each Method
516

517
**Random Under-Sampling:**
518
- Simple baseline approach
519
- When computational resources are limited
520
- For initial experimentation
521

522
**Prototype Generation (ClusterCentroids):**
523
- When you want to preserve cluster structure
524
- For high-dimensional data where centroids can represent regions well
525
- When interpretability of synthetic samples is important
526

527
**Prototype Selection (NearMiss, ENN variants):**
528
- When preserving decision boundary information is crucial  
529
- For datasets where border samples are informative
530
- When you want to remove noisy/outlier samples
531

532
**Neighborhood Cleaning:**
533
- When dataset contains significant noise
534
- For improving classifier performance through data cleaning
535
- When combining multiple cleaning strategies
536

537
### Computational Complexity
538

539
- **RandomUnderSampler:** O(n) - fastest
540
- **ClusterCentroids:** O(n × k × iterations) - depends on clustering algorithm
541
- **NearMiss:** O(n²) - distance calculations between all samples
542
- **ENN variants:** O(n × k × neighbors) - depends on neighborhood size
543
- **TomekLinks:** O(n²) - pairwise distance calculations
544
- **CNN/OSS:** O(n²) - iterative neighbor searches
545

546
### Multi-Class Support
547

548
All methods support multi-class resampling:
549
- **One-vs.-rest:** NearMiss, ENN variants, TomekLinks, NeighbourhoodCleaningRule
550
- **One-vs.-one:** OneSidedSelection, CondensedNearestNeighbour  
551
- **Independent sampling:** RandomUnderSampler, ClusterCentroids, InstanceHardnessThreshold
552

553
### Pipeline Integration
554

555
```python
556
from sklearn.pipeline import Pipeline
557
from sklearn.ensemble import RandomForestClassifier
558
from imblearn.under_sampling import RandomUnderSampler
559

560
# Create preprocessing pipeline
561
pipeline = Pipeline([
562
    ('sampler', RandomUnderSampler(random_state=42)),
563
    ('classifier', RandomForestClassifier(random_state=42))
564
])
565

566
# Fit pipeline
567
pipeline.fit(X_train, y_train)
568
predictions = pipeline.predict(X_test)
569
```

Version

Tile

Files

under-sampling.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

under-sampling.mddocs/