0
# Over-sampling Methods
1
2
Over-sampling techniques address class imbalance by generating synthetic samples for minority classes. Unlike under-sampling, which removes samples, over-sampling increases the dataset size by creating new instances that follow the distribution patterns of existing minority class samples.
3
4
## Overview
5
6
The imbalanced-learn library provides several sophisticated over-sampling algorithms that use different strategies for synthetic sample generation:
7
8
- **SMOTE family**: Generate synthetic samples along feature space lines between nearest neighbors
9
- **Adaptive methods**: Adjust sample generation based on local class distributions
10
- **Categorical handling**: Specialized algorithms for datasets with categorical features
11
- **Filtering approaches**: Select specific boundary regions for enhanced sample generation
12
13
All over-sampling methods inherit from the `BaseOverSampler` class and implement the standard `fit_resample(X, y)` interface.
14
15
## Basic Over-sampling
16
17
### RandomOverSampler
18
19
Random over-sampling with optional smoothed bootstrap generation.
20
21
```python
22
{ .api }
23
class RandomOverSampler(BaseOverSampler):
24
def __init__(
25
self,
26
*,
27
sampling_strategy="auto",
28
random_state=None,
29
shrinkage=None,
30
):
31
"""
32
Parameters
33
----------
34
sampling_strategy : float, str, dict or callable, default='auto'
35
Sampling information to resample the data set.
36
37
random_state : int, RandomState instance or None, default=None
38
Control the randomization of the algorithm.
39
40
shrinkage : float or dict, default=None
41
Parameter controlling the shrinkage applied to the covariance matrix
42
when a smoothed bootstrap is generated. If None, normal bootstrap
43
without perturbation. If float, same shrinkage for all classes.
44
If dict, class-specific shrinkage factors.
45
"""
46
47
def fit_resample(self, X, y):
48
"""
49
Resample the dataset.
50
51
Parameters
52
----------
53
X : {array-like, sparse matrix} of shape (n_samples, n_features)
54
The input samples.
55
y : array-like of shape (n_samples,)
56
The input targets.
57
58
Returns
59
-------
60
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
61
The array containing the resampled data.
62
y_resampled : array-like of shape (n_samples_new,)
63
The corresponding label of `X_resampled`.
64
"""
65
```
66
67
The `RandomOverSampler` performs basic over-sampling by selecting samples at random with replacement. When `shrinkage` is specified, it generates smoothed bootstrap samples by adding small perturbations, also known as Random Over-Sampling Examples (ROSE).
68
69
## SMOTE Family
70
71
### SMOTE
72
73
Synthetic Minority Over-sampling Technique - the original algorithm for generating synthetic samples.
74
75
```python
76
{ .api }
77
class SMOTE(BaseSMOTE):
78
def __init__(
79
self,
80
*,
81
sampling_strategy="auto",
82
random_state=None,
83
k_neighbors=5,
84
):
85
"""
86
Parameters
87
----------
88
sampling_strategy : float, str, dict or callable, default='auto'
89
Sampling information to resample the data set.
90
91
random_state : int, RandomState instance or None, default=None
92
Control the randomization of the algorithm.
93
94
k_neighbors : int or object, default=5
95
The nearest neighbors used to define the neighborhood of samples
96
for generating synthetic samples. Can be int for number of neighbors
97
or a fitted neighbors estimator with kneighbors and kneighbors_graph methods.
98
"""
99
100
def fit_resample(self, X, y):
101
"""
102
Resample the dataset.
103
104
Parameters
105
----------
106
X : {array-like, sparse matrix} of shape (n_samples, n_features)
107
The input samples.
108
y : array-like of shape (n_samples,)
109
The input targets.
110
111
Returns
112
-------
113
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
114
The array containing the resampled data.
115
y_resampled : array-like of shape (n_samples_new,)
116
The corresponding label of `X_resampled`.
117
"""
118
```
119
120
SMOTE generates synthetic samples by interpolating between a minority sample and its k nearest neighbors. For each minority sample, it selects one of its k nearest neighbors randomly and creates a synthetic sample somewhere along the line segment between them.
121
122
### SMOTENC
123
124
SMOTE for datasets containing both numerical and categorical features.
125
126
```python
127
{ .api }
128
class SMOTENC(SMOTE):
129
def __init__(
130
self,
131
categorical_features,
132
*,
133
categorical_encoder=None,
134
sampling_strategy="auto",
135
random_state=None,
136
k_neighbors=5,
137
):
138
"""
139
Parameters
140
----------
141
categorical_features : "auto" or array-like of shape (n_cat_features,) or (n_features,)
142
Specified which features are categorical. Can be:
143
- "auto" to automatically detect from pandas DataFrame with CategoricalDtype
144
- array of int corresponding to categorical feature indices
145
- array of str corresponding to feature names (requires pandas DataFrame)
146
- boolean mask array of shape (n_features,)
147
148
categorical_encoder : estimator, default=None
149
One-hot encoder used to encode categorical features. If None,
150
uses OneHotEncoder with handle_unknown='ignore'.
151
152
sampling_strategy : float, str, dict or callable, default='auto'
153
Sampling information to resample the data set.
154
155
random_state : int, RandomState instance or None, default=None
156
Control the randomization of the algorithm.
157
158
k_neighbors : int or object, default=5
159
The nearest neighbors used for generating synthetic samples.
160
"""
161
162
def fit_resample(self, X, y):
163
"""
164
Resample the dataset.
165
166
Parameters
167
----------
168
X : {array-like, sparse matrix} of shape (n_samples, n_features)
169
The input samples.
170
y : array-like of shape (n_samples,)
171
The input targets.
172
173
Returns
174
-------
175
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
176
The array containing the resampled data.
177
y_resampled : array-like of shape (n_samples_new,)
178
The corresponding label of `X_resampled`.
179
"""
180
```
181
182
SMOTENC handles mixed-type datasets by applying standard SMOTE interpolation to numerical features while using mode-based selection for categorical features. Categorical features are encoded with one-hot encoding during processing.
183
184
### SMOTEN
185
186
SMOTE variant specifically designed for categorical features only.
187
188
```python
189
{ .api }
190
class SMOTEN(SMOTE):
191
def __init__(
192
self,
193
categorical_encoder=None,
194
*,
195
sampling_strategy="auto",
196
random_state=None,
197
k_neighbors=5,
198
):
199
"""
200
Parameters
201
----------
202
categorical_encoder : estimator, default=None
203
Ordinal encoder used to encode categorical features. If None,
204
uses OrdinalEncoder with default parameters.
205
206
sampling_strategy : float, str, dict or callable, default='auto'
207
Sampling information to resample the data set.
208
209
random_state : int, RandomState instance or None, default=None
210
Control the randomization of the algorithm.
211
212
k_neighbors : int or object, default=5
213
The nearest neighbors used for generating synthetic samples.
214
"""
215
216
def fit_resample(self, X, y):
217
"""
218
Resample the dataset.
219
220
Parameters
221
----------
222
X : {array-like, sparse matrix} of shape (n_samples, n_features)
223
The input samples.
224
y : array-like of shape (n_samples,)
225
The input targets.
226
227
Returns
228
-------
229
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
230
The array containing the resampled data.
231
y_resampled : array-like of shape (n_samples_new,)
232
The corresponding label of `X_resampled`.
233
"""
234
```
235
236
SMOTEN works exclusively with categorical features and uses the Value Difference Metric (VDM) to compute distances between categorical samples. Synthetic samples are generated by selecting the most frequent category among nearest neighbors for each feature.
237
238
## Boundary-focused Methods
239
240
### BorderlineSMOTE
241
242
SMOTE variant that focuses on samples near class boundaries.
243
244
```python
245
{ .api }
246
class BorderlineSMOTE(BaseSMOTE):
247
def __init__(
248
self,
249
*,
250
sampling_strategy="auto",
251
random_state=None,
252
k_neighbors=5,
253
m_neighbors=10,
254
kind="borderline-1",
255
):
256
"""
257
Parameters
258
----------
259
sampling_strategy : float, str, dict or callable, default='auto'
260
Sampling information to resample the data set.
261
262
random_state : int, RandomState instance or None, default=None
263
Control the randomization of the algorithm.
264
265
k_neighbors : int or object, default=5
266
The nearest neighbors used for generating synthetic samples.
267
268
m_neighbors : int or object, default=10
269
The nearest neighbors used to determine if a minority sample
270
is in "danger" (near the boundary).
271
272
kind : {"borderline-1", "borderline-2"}, default='borderline-1'
273
The type of borderline SMOTE algorithm:
274
- "borderline-1": considers only positive class for neighbor selection
275
- "borderline-2": considers whole dataset, applies weight adjustments
276
"""
277
278
def fit_resample(self, X, y):
279
"""
280
Resample the dataset.
281
282
Parameters
283
----------
284
X : {array-like, sparse matrix} of shape (n_samples, n_features)
285
The input samples.
286
y : array-like of shape (n_samples,)
287
The input targets.
288
289
Returns
290
-------
291
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
292
The array containing the resampled data.
293
y_resampled : array-like of shape (n_samples_new,)
294
The corresponding label of `X_resampled`.
295
"""
296
```
297
298
BorderlineSMOTE identifies "danger" samples that are close to the decision boundary (having more majority class neighbors than minority). It generates synthetic samples only from these borderline cases, focusing oversampling where it's most needed.
299
300
### SVMSMOTE
301
302
SVM-based SMOTE that uses support vectors to identify critical samples.
303
304
```python
305
{ .api }
306
class SVMSMOTE(BaseSMOTE):
307
def __init__(
308
self,
309
*,
310
sampling_strategy="auto",
311
random_state=None,
312
k_neighbors=5,
313
m_neighbors=10,
314
svm_estimator=None,
315
out_step=0.5,
316
):
317
"""
318
Parameters
319
----------
320
sampling_strategy : float, str, dict or callable, default='auto'
321
Sampling information to resample the data set.
322
323
random_state : int, RandomState instance or None, default=None
324
Control the randomization of the algorithm.
325
326
k_neighbors : int or object, default=5
327
The nearest neighbors used for generating synthetic samples.
328
329
m_neighbors : int or object, default=10
330
The nearest neighbors used to determine sample safety/danger status.
331
332
svm_estimator : estimator object, default=SVC()
333
SVM classifier used to identify support vectors. Must expose
334
support_ attribute after fitting.
335
336
out_step : float, default=0.5
337
Step size when extrapolating from safe support vectors.
338
"""
339
340
def fit_resample(self, X, y):
341
"""
342
Resample the dataset.
343
344
Parameters
345
----------
346
X : {array-like, sparse matrix} of shape (n_samples, n_features)
347
The input samples.
348
y : array-like of shape (n_samples,)
349
The input targets.
350
351
Returns
352
-------
353
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
354
The array containing the resampled data.
355
y_resampled : array-like of shape (n_samples_new,)
356
The corresponding label of `X_resampled`.
357
"""
358
```
359
360
SVMSMOTE trains an SVM classifier and uses the minority class support vectors as seed points for synthetic sample generation. It classifies support vectors as "safe" or "danger" and applies different generation strategies accordingly.
361
362
## Adaptive Methods
363
364
### ADASYN
365
366
Adaptive Synthetic Sampling approach that adjusts generation density based on local distributions.
367
368
```python
369
{ .api }
370
class ADASYN(BaseOverSampler):
371
def __init__(
372
self,
373
*,
374
sampling_strategy="auto",
375
random_state=None,
376
n_neighbors=5,
377
):
378
"""
379
Parameters
380
----------
381
sampling_strategy : float, str, dict or callable, default='auto'
382
Sampling information to resample the data set.
383
384
random_state : int, RandomState instance or None, default=None
385
Control the randomization of the algorithm.
386
387
n_neighbors : int or estimator object, default=5
388
The nearest neighbors used to determine local distribution and
389
generate synthetic samples. Can be int for number of neighbors
390
or fitted neighbors estimator.
391
"""
392
393
def fit_resample(self, X, y):
394
"""
395
Resample the dataset.
396
397
Parameters
398
----------
399
X : {array-like, sparse matrix} of shape (n_samples, n_features)
400
The input samples.
401
y : array-like of shape (n_samples,)
402
The input targets.
403
404
Returns
405
-------
406
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
407
The array containing the resampled data.
408
y_resampled : array-like of shape (n_samples_new,)
409
The corresponding label of `X_resampled`.
410
"""
411
```
412
413
ADASYN calculates a difficulty coefficient for each minority sample based on the ratio of majority class neighbors. Samples in more difficult regions (surrounded by majority samples) generate more synthetic samples, adapting to local class distributions.
414
415
## Cluster-based Methods
416
417
### KMeansSMOTE
418
419
Applies K-Means clustering before SMOTE generation to handle complex data distributions.
420
421
```python
422
{ .api }
423
class KMeansSMOTE(BaseSMOTE):
424
def __init__(
425
self,
426
*,
427
sampling_strategy="auto",
428
random_state=None,
429
k_neighbors=2,
430
n_jobs=None,
431
kmeans_estimator=None,
432
cluster_balance_threshold="auto",
433
density_exponent="auto",
434
):
435
"""
436
Parameters
437
----------
438
sampling_strategy : float, str, dict or callable, default='auto'
439
Sampling information to resample the data set.
440
441
random_state : int, RandomState instance or None, default=None
442
Control the randomization of the algorithm.
443
444
k_neighbors : int or object, default=2
445
The nearest neighbors used for generating synthetic samples.
446
447
n_jobs : int, default=None
448
Number of CPU cores used during the cross-validation loop.
449
450
kmeans_estimator : int or object, default=None
451
K-Means clustering estimator or number of clusters. If None,
452
uses MiniBatchKMeans. If int, creates MiniBatchKMeans with
453
that number of clusters.
454
455
cluster_balance_threshold : "auto" or float, default="auto"
456
Threshold for determining balanced clusters. If "auto",
457
determined by class ratios. Manual threshold can be set.
458
459
density_exponent : "auto" or float, default="auto"
460
Exponent for cluster density calculation. If "auto", uses
461
feature-length based exponent.
462
"""
463
464
def fit_resample(self, X, y):
465
"""
466
Resample the dataset.
467
468
Parameters
469
----------
470
X : {array-like, sparse matrix} of shape (n_samples, n_features)
471
The input samples.
472
y : array-like of shape (n_samples,)
473
The input targets.
474
475
Returns
476
-------
477
X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
478
The array containing the resampled data.
479
y_resampled : array-like of shape (n_samples_new,)
480
The corresponding label of `X_resampled`.
481
"""
482
```
483
484
KMeansSMOTE first clusters the data, then identifies imbalanced clusters where the minority class representation falls below a threshold. It applies SMOTE within these clusters, distributing synthetic samples based on cluster sparsity to achieve better balance in complex, multimodal datasets.
485
486
## Usage Examples
487
488
### Basic SMOTE
489
490
```python
491
from collections import Counter
492
from sklearn.datasets import make_classification
493
from imblearn.over_sampling import SMOTE
494
495
# Create imbalanced dataset
496
X, y = make_classification(n_classes=2, class_sep=2,
497
weights=[0.1, 0.9], n_informative=3,
498
n_redundant=1, flip_y=0, n_features=20,
499
n_clusters_per_class=1, n_samples=1000,
500
random_state=10)
501
502
print('Original dataset shape %s' % Counter(y))
503
# Original dataset shape Counter({1: 900, 0: 100})
504
505
sm = SMOTE(random_state=42)
506
X_res, y_res = sm.fit_resample(X, y)
507
508
print('Resampled dataset shape %s' % Counter(y_res))
509
# Resampled dataset shape Counter({0: 900, 1: 900})
510
```
511
512
### Mixed-type Data with SMOTENC
513
514
```python
515
import numpy as np
516
from numpy.random import RandomState
517
from imblearn.over_sampling import SMOTENC
518
519
# Simulate mixed dataset with categorical features
520
X, y = make_classification(n_classes=2, class_sep=2,
521
weights=[0.1, 0.9], n_informative=3,
522
n_redundant=1, flip_y=0, n_features=20,
523
n_clusters_per_class=1, n_samples=1000,
524
random_state=10)
525
526
# Make last 2 columns categorical
527
X[:, -2:] = RandomState(10).randint(0, 4, size=(1000, 2))
528
529
sm = SMOTENC(random_state=42, categorical_features=[18, 19])
530
X_res, y_res = sm.fit_resample(X, y)
531
532
print(f'Resampled dataset samples per class {Counter(y_res)}')
533
# Resampled dataset samples per class Counter({0: 900, 1: 900})
534
```
535
536
### Boundary-focused Oversampling
537
538
```python
539
from imblearn.over_sampling import BorderlineSMOTE
540
541
# Focus on borderline samples
542
sm = BorderlineSMOTE(random_state=42, kind='borderline-1')
543
X_res, y_res = sm.fit_resample(X, y)
544
545
print('Borderline SMOTE result %s' % Counter(y_res))
546
# Generates samples only from minority samples near decision boundary
547
```
548
549
## Type Definitions
550
551
```python
552
{ .api }
553
from typing import Union, Dict, Callable, Optional, Any
554
from numpy import ndarray
555
from scipy.sparse import spmatrix
556
from sklearn.base import BaseEstimator
557
558
ArrayLike = Union[ndarray, spmatrix]
559
SamplingStrategy = Union[float, str, Dict[Any, int], Callable[[ndarray], Dict[Any, int]]]
560
NeighborsLike = Union[int, BaseEstimator]
561
RandomState = Union[int, np.random.RandomState, None]
562
```
563
564
All over-sampling methods share common characteristics:
565
- Support for multi-class resampling using one-vs-rest approach
566
- Handling of both dense and sparse matrices
567
- Configurable sampling strategies for fine-tuned class balancing
568
- Integration with scikit-learn pipelines and cross-validation