or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

combination.mddatasets.mddeep-learning.mdensemble.mdindex.mdmetrics.mdmodel-selection.mdover-sampling.mdpipeline.mdunder-sampling.mdutilities.md

over-sampling.mddocs/

0

# Over-sampling Methods

1

2

Over-sampling techniques address class imbalance by generating synthetic samples for minority classes. Unlike under-sampling, which removes samples, over-sampling increases the dataset size by creating new instances that follow the distribution patterns of existing minority class samples.

3

4

## Overview

5

6

The imbalanced-learn library provides several sophisticated over-sampling algorithms that use different strategies for synthetic sample generation:

7

8

- **SMOTE family**: Generate synthetic samples along feature space lines between nearest neighbors

9

- **Adaptive methods**: Adjust sample generation based on local class distributions

10

- **Categorical handling**: Specialized algorithms for datasets with categorical features

11

- **Filtering approaches**: Select specific boundary regions for enhanced sample generation

12

13

All over-sampling methods inherit from the `BaseOverSampler` class and implement the standard `fit_resample(X, y)` interface.

14

15

## Basic Over-sampling

16

17

### RandomOverSampler

18

19

Random over-sampling with optional smoothed bootstrap generation.

20

21

```python

22

{ .api }

23

class RandomOverSampler(BaseOverSampler):

24

def __init__(

25

self,

26

*,

27

sampling_strategy="auto",

28

random_state=None,

29

shrinkage=None,

30

):

31

"""

32

Parameters

33

----------

34

sampling_strategy : float, str, dict or callable, default='auto'

35

Sampling information to resample the data set.

36

37

random_state : int, RandomState instance or None, default=None

38

Control the randomization of the algorithm.

39

40

shrinkage : float or dict, default=None

41

Parameter controlling the shrinkage applied to the covariance matrix

42

when a smoothed bootstrap is generated. If None, normal bootstrap

43

without perturbation. If float, same shrinkage for all classes.

44

If dict, class-specific shrinkage factors.

45

"""

46

47

def fit_resample(self, X, y):

48

"""

49

Resample the dataset.

50

51

Parameters

52

----------

53

X : {array-like, sparse matrix} of shape (n_samples, n_features)

54

The input samples.

55

y : array-like of shape (n_samples,)

56

The input targets.

57

58

Returns

59

-------

60

X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)

61

The array containing the resampled data.

62

y_resampled : array-like of shape (n_samples_new,)

63

The corresponding label of `X_resampled`.

64

"""

65

```

66

67

The `RandomOverSampler` performs basic over-sampling by selecting samples at random with replacement. When `shrinkage` is specified, it generates smoothed bootstrap samples by adding small perturbations, also known as Random Over-Sampling Examples (ROSE).

68

69

## SMOTE Family

70

71

### SMOTE

72

73

Synthetic Minority Over-sampling Technique - the original algorithm for generating synthetic samples.

74

75

```python

76

{ .api }

77

class SMOTE(BaseSMOTE):

78

def __init__(

79

self,

80

*,

81

sampling_strategy="auto",

82

random_state=None,

83

k_neighbors=5,

84

):

85

"""

86

Parameters

87

----------

88

sampling_strategy : float, str, dict or callable, default='auto'

89

Sampling information to resample the data set.

90

91

random_state : int, RandomState instance or None, default=None

92

Control the randomization of the algorithm.

93

94

k_neighbors : int or object, default=5

95

The nearest neighbors used to define the neighborhood of samples

96

for generating synthetic samples. Can be int for number of neighbors

97

or a fitted neighbors estimator with kneighbors and kneighbors_graph methods.

98

"""

99

100

def fit_resample(self, X, y):

101

"""

102

Resample the dataset.

103

104

Parameters

105

----------

106

X : {array-like, sparse matrix} of shape (n_samples, n_features)

107

The input samples.

108

y : array-like of shape (n_samples,)

109

The input targets.

110

111

Returns

112

-------

113

X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)

114

The array containing the resampled data.

115

y_resampled : array-like of shape (n_samples_new,)

116

The corresponding label of `X_resampled`.

117

"""

118

```

119

120

SMOTE generates synthetic samples by interpolating between a minority sample and its k nearest neighbors. For each minority sample, it selects one of its k nearest neighbors randomly and creates a synthetic sample somewhere along the line segment between them.

121

122

### SMOTENC

123

124

SMOTE for datasets containing both numerical and categorical features.

125

126

```python

127

{ .api }

128

class SMOTENC(SMOTE):

129

def __init__(

130

self,

131

categorical_features,

132

*,

133

categorical_encoder=None,

134

sampling_strategy="auto",

135

random_state=None,

136

k_neighbors=5,

137

):

138

"""

139

Parameters

140

----------

141

categorical_features : "auto" or array-like of shape (n_cat_features,) or (n_features,)

142

Specified which features are categorical. Can be:

143

- "auto" to automatically detect from pandas DataFrame with CategoricalDtype

144

- array of int corresponding to categorical feature indices

145

- array of str corresponding to feature names (requires pandas DataFrame)

146

- boolean mask array of shape (n_features,)

147

148

categorical_encoder : estimator, default=None

149

One-hot encoder used to encode categorical features. If None,

150

uses OneHotEncoder with handle_unknown='ignore'.

151

152

sampling_strategy : float, str, dict or callable, default='auto'

153

Sampling information to resample the data set.

154

155

random_state : int, RandomState instance or None, default=None

156

Control the randomization of the algorithm.

157

158

k_neighbors : int or object, default=5

159

The nearest neighbors used for generating synthetic samples.

160

"""

161

162

def fit_resample(self, X, y):

163

"""

164

Resample the dataset.

165

166

Parameters

167

----------

168

X : {array-like, sparse matrix} of shape (n_samples, n_features)

169

The input samples.

170

y : array-like of shape (n_samples,)

171

The input targets.

172

173

Returns

174

-------

175

X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)

176

The array containing the resampled data.

177

y_resampled : array-like of shape (n_samples_new,)

178

The corresponding label of `X_resampled`.

179

"""

180

```

181

182

SMOTENC handles mixed-type datasets by applying standard SMOTE interpolation to numerical features while using mode-based selection for categorical features. Categorical features are encoded with one-hot encoding during processing.

183

184

### SMOTEN

185

186

SMOTE variant specifically designed for categorical features only.

187

188

```python

189

{ .api }

190

class SMOTEN(SMOTE):

191

def __init__(

192

self,

193

categorical_encoder=None,

194

*,

195

sampling_strategy="auto",

196

random_state=None,

197

k_neighbors=5,

198

):

199

"""

200

Parameters

201

----------

202

categorical_encoder : estimator, default=None

203

Ordinal encoder used to encode categorical features. If None,

204

uses OrdinalEncoder with default parameters.

205

206

sampling_strategy : float, str, dict or callable, default='auto'

207

Sampling information to resample the data set.

208

209

random_state : int, RandomState instance or None, default=None

210

Control the randomization of the algorithm.

211

212

k_neighbors : int or object, default=5

213

The nearest neighbors used for generating synthetic samples.

214

"""

215

216

def fit_resample(self, X, y):

217

"""

218

Resample the dataset.

219

220

Parameters

221

----------

222

X : {array-like, sparse matrix} of shape (n_samples, n_features)

223

The input samples.

224

y : array-like of shape (n_samples,)

225

The input targets.

226

227

Returns

228

-------

229

X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)

230

The array containing the resampled data.

231

y_resampled : array-like of shape (n_samples_new,)

232

The corresponding label of `X_resampled`.

233

"""

234

```

235

236

SMOTEN works exclusively with categorical features and uses the Value Difference Metric (VDM) to compute distances between categorical samples. Synthetic samples are generated by selecting the most frequent category among nearest neighbors for each feature.

237

238

## Boundary-focused Methods

239

240

### BorderlineSMOTE

241

242

SMOTE variant that focuses on samples near class boundaries.

243

244

```python

245

{ .api }

246

class BorderlineSMOTE(BaseSMOTE):

247

def __init__(

248

self,

249

*,

250

sampling_strategy="auto",

251

random_state=None,

252

k_neighbors=5,

253

m_neighbors=10,

254

kind="borderline-1",

255

):

256

"""

257

Parameters

258

----------

259

sampling_strategy : float, str, dict or callable, default='auto'

260

Sampling information to resample the data set.

261

262

random_state : int, RandomState instance or None, default=None

263

Control the randomization of the algorithm.

264

265

k_neighbors : int or object, default=5

266

The nearest neighbors used for generating synthetic samples.

267

268

m_neighbors : int or object, default=10

269

The nearest neighbors used to determine if a minority sample

270

is in "danger" (near the boundary).

271

272

kind : {"borderline-1", "borderline-2"}, default='borderline-1'

273

The type of borderline SMOTE algorithm:

274

- "borderline-1": considers only positive class for neighbor selection

275

- "borderline-2": considers whole dataset, applies weight adjustments

276

"""

277

278

def fit_resample(self, X, y):

279

"""

280

Resample the dataset.

281

282

Parameters

283

----------

284

X : {array-like, sparse matrix} of shape (n_samples, n_features)

285

The input samples.

286

y : array-like of shape (n_samples,)

287

The input targets.

288

289

Returns

290

-------

291

X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)

292

The array containing the resampled data.

293

y_resampled : array-like of shape (n_samples_new,)

294

The corresponding label of `X_resampled`.

295

"""

296

```

297

298

BorderlineSMOTE identifies "danger" samples that are close to the decision boundary (having more majority class neighbors than minority). It generates synthetic samples only from these borderline cases, focusing oversampling where it's most needed.

299

300

### SVMSMOTE

301

302

SVM-based SMOTE that uses support vectors to identify critical samples.

303

304

```python

305

{ .api }

306

class SVMSMOTE(BaseSMOTE):

307

def __init__(

308

self,

309

*,

310

sampling_strategy="auto",

311

random_state=None,

312

k_neighbors=5,

313

m_neighbors=10,

314

svm_estimator=None,

315

out_step=0.5,

316

):

317

"""

318

Parameters

319

----------

320

sampling_strategy : float, str, dict or callable, default='auto'

321

Sampling information to resample the data set.

322

323

random_state : int, RandomState instance or None, default=None

324

Control the randomization of the algorithm.

325

326

k_neighbors : int or object, default=5

327

The nearest neighbors used for generating synthetic samples.

328

329

m_neighbors : int or object, default=10

330

The nearest neighbors used to determine sample safety/danger status.

331

332

svm_estimator : estimator object, default=SVC()

333

SVM classifier used to identify support vectors. Must expose

334

support_ attribute after fitting.

335

336

out_step : float, default=0.5

337

Step size when extrapolating from safe support vectors.

338

"""

339

340

def fit_resample(self, X, y):

341

"""

342

Resample the dataset.

343

344

Parameters

345

----------

346

X : {array-like, sparse matrix} of shape (n_samples, n_features)

347

The input samples.

348

y : array-like of shape (n_samples,)

349

The input targets.

350

351

Returns

352

-------

353

X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)

354

The array containing the resampled data.

355

y_resampled : array-like of shape (n_samples_new,)

356

The corresponding label of `X_resampled`.

357

"""

358

```

359

360

SVMSMOTE trains an SVM classifier and uses the minority class support vectors as seed points for synthetic sample generation. It classifies support vectors as "safe" or "danger" and applies different generation strategies accordingly.

361

362

## Adaptive Methods

363

364

### ADASYN

365

366

Adaptive Synthetic Sampling approach that adjusts generation density based on local distributions.

367

368

```python

369

{ .api }

370

class ADASYN(BaseOverSampler):

371

def __init__(

372

self,

373

*,

374

sampling_strategy="auto",

375

random_state=None,

376

n_neighbors=5,

377

):

378

"""

379

Parameters

380

----------

381

sampling_strategy : float, str, dict or callable, default='auto'

382

Sampling information to resample the data set.

383

384

random_state : int, RandomState instance or None, default=None

385

Control the randomization of the algorithm.

386

387

n_neighbors : int or estimator object, default=5

388

The nearest neighbors used to determine local distribution and

389

generate synthetic samples. Can be int for number of neighbors

390

or fitted neighbors estimator.

391

"""

392

393

def fit_resample(self, X, y):

394

"""

395

Resample the dataset.

396

397

Parameters

398

----------

399

X : {array-like, sparse matrix} of shape (n_samples, n_features)

400

The input samples.

401

y : array-like of shape (n_samples,)

402

The input targets.

403

404

Returns

405

-------

406

X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)

407

The array containing the resampled data.

408

y_resampled : array-like of shape (n_samples_new,)

409

The corresponding label of `X_resampled`.

410

"""

411

```

412

413

ADASYN calculates a difficulty coefficient for each minority sample based on the ratio of majority class neighbors. Samples in more difficult regions (surrounded by majority samples) generate more synthetic samples, adapting to local class distributions.

414

415

## Cluster-based Methods

416

417

### KMeansSMOTE

418

419

Applies K-Means clustering before SMOTE generation to handle complex data distributions.

420

421

```python

422

{ .api }

423

class KMeansSMOTE(BaseSMOTE):

424

def __init__(

425

self,

426

*,

427

sampling_strategy="auto",

428

random_state=None,

429

k_neighbors=2,

430

n_jobs=None,

431

kmeans_estimator=None,

432

cluster_balance_threshold="auto",

433

density_exponent="auto",

434

):

435

"""

436

Parameters

437

----------

438

sampling_strategy : float, str, dict or callable, default='auto'

439

Sampling information to resample the data set.

440

441

random_state : int, RandomState instance or None, default=None

442

Control the randomization of the algorithm.

443

444

k_neighbors : int or object, default=2

445

The nearest neighbors used for generating synthetic samples.

446

447

n_jobs : int, default=None

448

Number of CPU cores used during the cross-validation loop.

449

450

kmeans_estimator : int or object, default=None

451

K-Means clustering estimator or number of clusters. If None,

452

uses MiniBatchKMeans. If int, creates MiniBatchKMeans with

453

that number of clusters.

454

455

cluster_balance_threshold : "auto" or float, default="auto"

456

Threshold for determining balanced clusters. If "auto",

457

determined by class ratios. Manual threshold can be set.

458

459

density_exponent : "auto" or float, default="auto"

460

Exponent for cluster density calculation. If "auto", uses

461

feature-length based exponent.

462

"""

463

464

def fit_resample(self, X, y):

465

"""

466

Resample the dataset.

467

468

Parameters

469

----------

470

X : {array-like, sparse matrix} of shape (n_samples, n_features)

471

The input samples.

472

y : array-like of shape (n_samples,)

473

The input targets.

474

475

Returns

476

-------

477

X_resampled : {array-like, sparse matrix} of shape (n_samples_new, n_features)

478

The array containing the resampled data.

479

y_resampled : array-like of shape (n_samples_new,)

480

The corresponding label of `X_resampled`.

481

"""

482

```

483

484

KMeansSMOTE first clusters the data, then identifies imbalanced clusters where the minority class representation falls below a threshold. It applies SMOTE within these clusters, distributing synthetic samples based on cluster sparsity to achieve better balance in complex, multimodal datasets.

485

486

## Usage Examples

487

488

### Basic SMOTE

489

490

```python

491

from collections import Counter

492

from sklearn.datasets import make_classification

493

from imblearn.over_sampling import SMOTE

494

495

# Create imbalanced dataset

496

X, y = make_classification(n_classes=2, class_sep=2,

497

weights=[0.1, 0.9], n_informative=3,

498

n_redundant=1, flip_y=0, n_features=20,

499

n_clusters_per_class=1, n_samples=1000,

500

random_state=10)

501

502

print('Original dataset shape %s' % Counter(y))

503

# Original dataset shape Counter({1: 900, 0: 100})

504

505

sm = SMOTE(random_state=42)

506

X_res, y_res = sm.fit_resample(X, y)

507

508

print('Resampled dataset shape %s' % Counter(y_res))

509

# Resampled dataset shape Counter({0: 900, 1: 900})

510

```

511

512

### Mixed-type Data with SMOTENC

513

514

```python

515

import numpy as np

516

from numpy.random import RandomState

517

from imblearn.over_sampling import SMOTENC

518

519

# Simulate mixed dataset with categorical features

520

X, y = make_classification(n_classes=2, class_sep=2,

521

weights=[0.1, 0.9], n_informative=3,

522

n_redundant=1, flip_y=0, n_features=20,

523

n_clusters_per_class=1, n_samples=1000,

524

random_state=10)

525

526

# Make last 2 columns categorical

527

X[:, -2:] = RandomState(10).randint(0, 4, size=(1000, 2))

528

529

sm = SMOTENC(random_state=42, categorical_features=[18, 19])

530

X_res, y_res = sm.fit_resample(X, y)

531

532

print(f'Resampled dataset samples per class {Counter(y_res)}')

533

# Resampled dataset samples per class Counter({0: 900, 1: 900})

534

```

535

536

### Boundary-focused Oversampling

537

538

```python

539

from imblearn.over_sampling import BorderlineSMOTE

540

541

# Focus on borderline samples

542

sm = BorderlineSMOTE(random_state=42, kind='borderline-1')

543

X_res, y_res = sm.fit_resample(X, y)

544

545

print('Borderline SMOTE result %s' % Counter(y_res))

546

# Generates samples only from minority samples near decision boundary

547

```

548

549

## Type Definitions

550

551

```python

552

{ .api }

553

from typing import Union, Dict, Callable, Optional, Any

554

from numpy import ndarray

555

from scipy.sparse import spmatrix

556

from sklearn.base import BaseEstimator

557

558

ArrayLike = Union[ndarray, spmatrix]

559

SamplingStrategy = Union[float, str, Dict[Any, int], Callable[[ndarray], Dict[Any, int]]]

560

NeighborsLike = Union[int, BaseEstimator]

561

RandomState = Union[int, np.random.RandomState, None]

562

```

563

564

All over-sampling methods share common characteristics:

565

- Support for multi-class resampling using one-vs-rest approach

566

- Handling of both dense and sparse matrices

567

- Configurable sampling strategies for fine-tuned class balancing

568

- Integration with scikit-learn pipelines and cross-validation