Tessl Tile for pypi/imbalanced-learn@0.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

combination.md datasets.md deep-learning.md ensemble.md index.md metrics.md model-selection.md over-sampling.md pipeline.md under-sampling.md utilities.md

over-sampling.mddocs/

0
# Over-sampling Methods
1

2
Over-sampling techniques address class imbalance by generating synthetic samples for minority classes. Unlike under-sampling, which removes samples, over-sampling increases the dataset size by creating new instances that follow the distribution patterns of existing minority class samples.
3

4
## Overview
5

6
The imbalanced-learn library provides several sophisticated over-sampling algorithms that use different strategies for synthetic sample generation:
7

8
- **SMOTE family**: Generate synthetic samples along feature space lines between nearest neighbors
9
- **Adaptive methods**: Adjust sample generation based on local class distributions  
10
- **Categorical handling**: Specialized algorithms for datasets with categorical features
11
- **Filtering approaches**: Select specific boundary regions for enhanced sample generation
12

13
All over-sampling methods inherit from the `BaseOverSampler` class and implement the standard `fit_resample(X, y)` interface.
14

15
## Basic Over-sampling
16

17
### RandomOverSampler
18

19
Random over-sampling with optional smoothed bootstrap generation.
20

21
```python
22
{ .api }
23
class RandomOverSampler(BaseOverSampler):
24
    def __init__(
25
        self,
26
        *,
27
        sampling_strategy="auto",
28
        random_state=None,
29
        shrinkage=None,
30
    ):
31
        """
32
        Parameters
33
        ----------
34
        sampling_strategy : float, str, dict or callable, default='auto'
35
            Sampling information to resample the data set.
36
        
37
        random_state : int, RandomState instance or None, default=None
38
            Control the randomization of the algorithm.
39
        
40
        shrinkage : float or dict, default=None
41
            Parameter controlling the shrinkage applied to the covariance matrix
42
            when a smoothed bootstrap is generated. If None, normal bootstrap
43
            without perturbation. If float, same shrinkage for all classes.
44
            If dict, class-specific shrinkage factors.
45
        """
46

47
    def fit_resample(self, X, y):
48
        """
49
        Resample the dataset.
50
        
51
        Parameters
52
        ----------
53
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
54
            The input samples.
55
        y : array-like of shape (n_samples,)
56
            The input targets.
57
            
58
        Returns
59
        -------
60
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
61
            The array containing the resampled data.
62
        y_resampled : array-like of shape (n_samples_new,)
63
            The corresponding label of `X_resampled`.
64
        """
65
```
66

67
The `RandomOverSampler` performs basic over-sampling by selecting samples at random with replacement. When `shrinkage` is specified, it generates smoothed bootstrap samples by adding small perturbations, also known as Random Over-Sampling Examples (ROSE).
68

69
## SMOTE Family
70

71
### SMOTE
72

73
Synthetic Minority Over-sampling Technique - the original algorithm for generating synthetic samples.
74

75
```python
76
{ .api }
77
class SMOTE(BaseSMOTE):
78
    def __init__(
79
        self,
80
        *,
81
        sampling_strategy="auto",
82
        random_state=None,
83
        k_neighbors=5,
84
    ):
85
        """
86
        Parameters
87
        ----------
88
        sampling_strategy : float, str, dict or callable, default='auto'
89
            Sampling information to resample the data set.
90
        
91
        random_state : int, RandomState instance or None, default=None
92
            Control the randomization of the algorithm.
93
        
94
        k_neighbors : int or object, default=5
95
            The nearest neighbors used to define the neighborhood of samples
96
            for generating synthetic samples. Can be int for number of neighbors
97
            or a fitted neighbors estimator with kneighbors and kneighbors_graph methods.
98
        """
99

100
    def fit_resample(self, X, y):
101
        """
102
        Resample the dataset.
103
        
104
        Parameters
105
        ----------
106
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
107
            The input samples.
108
        y : array-like of shape (n_samples,)
109
            The input targets.
110
            
111
        Returns
112
        -------
113
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
114
            The array containing the resampled data.
115
        y_resampled : array-like of shape (n_samples_new,)
116
            The corresponding label of `X_resampled`.
117
        """
118
```
119

120
SMOTE generates synthetic samples by interpolating between a minority sample and its k nearest neighbors. For each minority sample, it selects one of its k nearest neighbors randomly and creates a synthetic sample somewhere along the line segment between them.
121

122
### SMOTENC
123

124
SMOTE for datasets containing both numerical and categorical features.
125

126
```python
127
{ .api }
128
class SMOTENC(SMOTE):
129
    def __init__(
130
        self,
131
        categorical_features,
132
        *,
133
        categorical_encoder=None,
134
        sampling_strategy="auto",
135
        random_state=None,
136
        k_neighbors=5,
137
    ):
138
        """
139
        Parameters
140
        ----------
141
        categorical_features : "auto" or array-like of shape (n_cat_features,) or (n_features,)
142
            Specified which features are categorical. Can be:
143
            - "auto" to automatically detect from pandas DataFrame with CategoricalDtype
144
            - array of int corresponding to categorical feature indices  
145
            - array of str corresponding to feature names (requires pandas DataFrame)
146
            - boolean mask array of shape (n_features,)
147
        
148
        categorical_encoder : estimator, default=None
149
            One-hot encoder used to encode categorical features. If None,
150
            uses OneHotEncoder with handle_unknown='ignore'.
151
        
152
        sampling_strategy : float, str, dict or callable, default='auto'
153
            Sampling information to resample the data set.
154
        
155
        random_state : int, RandomState instance or None, default=None
156
            Control the randomization of the algorithm.
157
        
158
        k_neighbors : int or object, default=5
159
            The nearest neighbors used for generating synthetic samples.
160
        """
161

162
    def fit_resample(self, X, y):
163
        """
164
        Resample the dataset.
165
        
166
        Parameters
167
        ----------
168
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
169
            The input samples.
170
        y : array-like of shape (n_samples,)
171
            The input targets.
172
            
173
        Returns
174
        -------
175
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
176
            The array containing the resampled data.
177
        y_resampled : array-like of shape (n_samples_new,)
178
            The corresponding label of `X_resampled`.
179
        """
180
```
181

182
SMOTENC handles mixed-type datasets by applying standard SMOTE interpolation to numerical features while using mode-based selection for categorical features. Categorical features are encoded with one-hot encoding during processing.
183

184
### SMOTEN
185

186
SMOTE variant specifically designed for categorical features only.
187

188
```python
189
{ .api }
190
class SMOTEN(SMOTE):
191
    def __init__(
192
        self,
193
        categorical_encoder=None,
194
        *,
195
        sampling_strategy="auto",
196
        random_state=None,
197
        k_neighbors=5,
198
    ):
199
        """
200
        Parameters
201
        ----------
202
        categorical_encoder : estimator, default=None
203
            Ordinal encoder used to encode categorical features. If None,
204
            uses OrdinalEncoder with default parameters.
205
        
206
        sampling_strategy : float, str, dict or callable, default='auto'
207
            Sampling information to resample the data set.
208
        
209
        random_state : int, RandomState instance or None, default=None
210
            Control the randomization of the algorithm.
211
        
212
        k_neighbors : int or object, default=5
213
            The nearest neighbors used for generating synthetic samples.
214
        """
215

216
    def fit_resample(self, X, y):
217
        """
218
        Resample the dataset.
219
        
220
        Parameters
221
        ----------
222
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
223
            The input samples.
224
        y : array-like of shape (n_samples,)
225
            The input targets.
226
            
227
        Returns
228
        -------
229
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
230
            The array containing the resampled data.
231
        y_resampled : array-like of shape (n_samples_new,)
232
            The corresponding label of `X_resampled`.
233
        """
234
```
235

236
SMOTEN works exclusively with categorical features and uses the Value Difference Metric (VDM) to compute distances between categorical samples. Synthetic samples are generated by selecting the most frequent category among nearest neighbors for each feature.
237

238
## Boundary-focused Methods
239

240
### BorderlineSMOTE
241

242
SMOTE variant that focuses on samples near class boundaries.
243

244
```python
245
{ .api }
246
class BorderlineSMOTE(BaseSMOTE):
247
    def __init__(
248
        self,
249
        *,
250
        sampling_strategy="auto",
251
        random_state=None,
252
        k_neighbors=5,
253
        m_neighbors=10,
254
        kind="borderline-1",
255
    ):
256
        """
257
        Parameters
258
        ----------
259
        sampling_strategy : float, str, dict or callable, default='auto'
260
            Sampling information to resample the data set.
261
        
262
        random_state : int, RandomState instance or None, default=None
263
            Control the randomization of the algorithm.
264
        
265
        k_neighbors : int or object, default=5
266
            The nearest neighbors used for generating synthetic samples.
267
        
268
        m_neighbors : int or object, default=10
269
            The nearest neighbors used to determine if a minority sample
270
            is in "danger" (near the boundary).
271
        
272
        kind : {"borderline-1", "borderline-2"}, default='borderline-1'
273
            The type of borderline SMOTE algorithm:
274
            - "borderline-1": considers only positive class for neighbor selection
275
            - "borderline-2": considers whole dataset, applies weight adjustments
276
        """
277

278
    def fit_resample(self, X, y):
279
        """
280
        Resample the dataset.
281
        
282
        Parameters
283
        ----------
284
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
285
            The input samples.
286
        y : array-like of shape (n_samples,)
287
            The input targets.
288
            
289
        Returns
290
        -------
291
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
292
            The array containing the resampled data.
293
        y_resampled : array-like of shape (n_samples_new,)
294
            The corresponding label of `X_resampled`.
295
        """
296
```
297

298
BorderlineSMOTE identifies "danger" samples that are close to the decision boundary (having more majority class neighbors than minority). It generates synthetic samples only from these borderline cases, focusing oversampling where it's most needed.
299

300
### SVMSMOTE
301

302
SVM-based SMOTE that uses support vectors to identify critical samples.
303

304
```python
305
{ .api }
306
class SVMSMOTE(BaseSMOTE):
307
    def __init__(
308
        self,
309
        *,
310
        sampling_strategy="auto",
311
        random_state=None,
312
        k_neighbors=5,
313
        m_neighbors=10,
314
        svm_estimator=None,
315
        out_step=0.5,
316
    ):
317
        """
318
        Parameters
319
        ----------
320
        sampling_strategy : float, str, dict or callable, default='auto'
321
            Sampling information to resample the data set.
322
        
323
        random_state : int, RandomState instance or None, default=None
324
            Control the randomization of the algorithm.
325
        
326
        k_neighbors : int or object, default=5
327
            The nearest neighbors used for generating synthetic samples.
328
        
329
        m_neighbors : int or object, default=10
330
            The nearest neighbors used to determine sample safety/danger status.
331
        
332
        svm_estimator : estimator object, default=SVC()
333
            SVM classifier used to identify support vectors. Must expose
334
            support_ attribute after fitting.
335
        
336
        out_step : float, default=0.5
337
            Step size when extrapolating from safe support vectors.
338
        """
339

340
    def fit_resample(self, X, y):
341
        """
342
        Resample the dataset.
343
        
344
        Parameters
345
        ----------
346
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
347
            The input samples.
348
        y : array-like of shape (n_samples,)
349
            The input targets.
350
            
351
        Returns
352
        -------
353
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
354
            The array containing the resampled data.
355
        y_resampled : array-like of shape (n_samples_new,)
356
            The corresponding label of `X_resampled`.
357
        """
358
```
359

360
SVMSMOTE trains an SVM classifier and uses the minority class support vectors as seed points for synthetic sample generation. It classifies support vectors as "safe" or "danger" and applies different generation strategies accordingly.
361

362
## Adaptive Methods
363

364
### ADASYN
365

366
Adaptive Synthetic Sampling approach that adjusts generation density based on local distributions.
367

368
```python
369
{ .api }
370
class ADASYN(BaseOverSampler):
371
    def __init__(
372
        self,
373
        *,
374
        sampling_strategy="auto",
375
        random_state=None,
376
        n_neighbors=5,
377
    ):
378
        """
379
        Parameters
380
        ----------
381
        sampling_strategy : float, str, dict or callable, default='auto'
382
            Sampling information to resample the data set.
383
        
384
        random_state : int, RandomState instance or None, default=None
385
            Control the randomization of the algorithm.
386
        
387
        n_neighbors : int or estimator object, default=5
388
            The nearest neighbors used to determine local distribution and
389
            generate synthetic samples. Can be int for number of neighbors
390
            or fitted neighbors estimator.
391
        """
392

393
    def fit_resample(self, X, y):
394
        """
395
        Resample the dataset.
396
        
397
        Parameters
398
        ----------
399
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
400
            The input samples.
401
        y : array-like of shape (n_samples,)
402
            The input targets.
403
            
404
        Returns
405
        -------
406
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
407
            The array containing the resampled data.
408
        y_resampled : array-like of shape (n_samples_new,)
409
            The corresponding label of `X_resampled`.
410
        """
411
```
412

413
ADASYN calculates a difficulty coefficient for each minority sample based on the ratio of majority class neighbors. Samples in more difficult regions (surrounded by majority samples) generate more synthetic samples, adapting to local class distributions.
414

415
## Cluster-based Methods
416

417
### KMeansSMOTE
418

419
Applies K-Means clustering before SMOTE generation to handle complex data distributions.
420

421
```python
422
{ .api }
423
class KMeansSMOTE(BaseSMOTE):
424
    def __init__(
425
        self,
426
        *,
427
        sampling_strategy="auto",
428
        random_state=None,
429
        k_neighbors=2,
430
        n_jobs=None,
431
        kmeans_estimator=None,
432
        cluster_balance_threshold="auto",
433
        density_exponent="auto",
434
    ):
435
        """
436
        Parameters
437
        ----------
438
        sampling_strategy : float, str, dict or callable, default='auto'
439
            Sampling information to resample the data set.
440
        
441
        random_state : int, RandomState instance or None, default=None
442
            Control the randomization of the algorithm.
443
        
444
        k_neighbors : int or object, default=2
445
            The nearest neighbors used for generating synthetic samples.
446
        
447
        n_jobs : int, default=None
448
            Number of CPU cores used during the cross-validation loop.
449
        
450
        kmeans_estimator : int or object, default=None
451
            K-Means clustering estimator or number of clusters. If None,
452
            uses MiniBatchKMeans. If int, creates MiniBatchKMeans with
453
            that number of clusters.
454
        
455
        cluster_balance_threshold : "auto" or float, default="auto"
456
            Threshold for determining balanced clusters. If "auto",
457
            determined by class ratios. Manual threshold can be set.
458
        
459
        density_exponent : "auto" or float, default="auto"
460
            Exponent for cluster density calculation. If "auto", uses
461
            feature-length based exponent.
462
        """
463

464
    def fit_resample(self, X, y):
465
        """
466
        Resample the dataset.
467
        
468
        Parameters
469
        ----------
470
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
471
            The input samples.
472
        y : array-like of shape (n_samples,)
473
            The input targets.
474
            
475
        Returns
476
        -------
477
        X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)
478
            The array containing the resampled data.
479
        y_resampled : array-like of shape (n_samples_new,)
480
            The corresponding label of `X_resampled`.
481
        """
482
```
483

484
KMeansSMOTE first clusters the data, then identifies imbalanced clusters where the minority class representation falls below a threshold. It applies SMOTE within these clusters, distributing synthetic samples based on cluster sparsity to achieve better balance in complex, multimodal datasets.
485

486
## Usage Examples
487

488
### Basic SMOTE
489

490
```python
491
from collections import Counter
492
from sklearn.datasets import make_classification
493
from imblearn.over_sampling import SMOTE
494

495
# Create imbalanced dataset
496
X, y = make_classification(n_classes=2, class_sep=2,
497
                          weights=[0.1, 0.9], n_informative=3, 
498
                          n_redundant=1, flip_y=0, n_features=20, 
499
                          n_clusters_per_class=1, n_samples=1000, 
500
                          random_state=10)
501

502
print('Original dataset shape %s' % Counter(y))
503
# Original dataset shape Counter({1: 900, 0: 100})
504

505
sm = SMOTE(random_state=42)
506
X_res, y_res = sm.fit_resample(X, y)
507

508
print('Resampled dataset shape %s' % Counter(y_res))
509
# Resampled dataset shape Counter({0: 900, 1: 900})
510
```
511

512
### Mixed-type Data with SMOTENC
513

514
```python
515
import numpy as np
516
from numpy.random import RandomState
517
from imblearn.over_sampling import SMOTENC
518

519
# Simulate mixed dataset with categorical features
520
X, y = make_classification(n_classes=2, class_sep=2,
521
                          weights=[0.1, 0.9], n_informative=3,
522
                          n_redundant=1, flip_y=0, n_features=20,
523
                          n_clusters_per_class=1, n_samples=1000, 
524
                          random_state=10)
525

526
# Make last 2 columns categorical
527
X[:, -2:] = RandomState(10).randint(0, 4, size=(1000, 2))
528

529
sm = SMOTENC(random_state=42, categorical_features=[18, 19])
530
X_res, y_res = sm.fit_resample(X, y)
531

532
print(f'Resampled dataset samples per class {Counter(y_res)}')
533
# Resampled dataset samples per class Counter({0: 900, 1: 900})
534
```
535

536
### Boundary-focused Oversampling
537

538
```python
539
from imblearn.over_sampling import BorderlineSMOTE
540

541
# Focus on borderline samples
542
sm = BorderlineSMOTE(random_state=42, kind='borderline-1')
543
X_res, y_res = sm.fit_resample(X, y)
544

545
print('Borderline SMOTE result %s' % Counter(y_res))
546
# Generates samples only from minority samples near decision boundary
547
```
548

549
## Type Definitions
550

551
```python
552
{ .api }
553
from typing import Union, Dict, Callable, Optional, Any
554
from numpy import ndarray
555
from scipy.sparse import spmatrix
556
from sklearn.base import BaseEstimator
557

558
ArrayLike = Union[ndarray, spmatrix]
559
SamplingStrategy = Union[float, str, Dict[Any, int], Callable[[ndarray], Dict[Any, int]]]
560
NeighborsLike = Union[int, BaseEstimator]
561
RandomState = Union[int, np.random.RandomState, None]
562
```
563

564
All over-sampling methods share common characteristics:
565
- Support for multi-class resampling using one-vs-rest approach  
566
- Handling of both dense and sparse matrices
567
- Configurable sampling strategies for fine-tuned class balancing
568
- Integration with scikit-learn pipelines and cross-validation

Version

Tile

Files

over-sampling.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

over-sampling.mddocs/