or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

combination.mddatasets.mddeep-learning.mdensemble.mdindex.mdmetrics.mdmodel-selection.mdover-sampling.mdpipeline.mdunder-sampling.mdutilities.md

under-sampling.mddocs/

0

# Under-Sampling Methods

1

2

Under-sampling methods reduce the size of the majority class(es) to address class imbalance. These techniques remove samples from the dataset, either randomly or using intelligent selection criteria to preserve important boundary information.

3

4

## Categories of Under-Sampling Methods

5

6

### Random Under-Sampling

7

Methods that randomly select samples to remove from majority classes.

8

9

### Prototype Generation

10

Methods that generate new synthetic samples to represent the original data distribution.

11

12

### Prototype Selection

13

Methods that intelligently select which samples to keep based on neighborhood analysis, distance metrics, or classification difficulty.

14

15

### Neighborhood Cleaning

16

Methods that remove noisy samples or samples that negatively affect classification performance.

17

18

---

19

20

## Random Under-Sampling

21

22

### RandomUnderSampler

23

24

Random under-sampling of majority class samples with or without replacement.

25

26

```python { .api }

27

class RandomUnderSampler:

28

def __init__(

29

self,

30

*,

31

sampling_strategy="auto",

32

random_state=None,

33

replacement=False

34

):

35

```

36

37

**Parameters:**

38

- `sampling_strategy` (str, dict, list): Strategy to control sampling. Default is "auto".

39

- `random_state` (int, RandomState, None): Random number generator seed for reproducibility.

40

- `replacement` (bool): Whether sampling is with or without replacement. Default is False.

41

42

**Attributes:**

43

- `sampling_strategy_` (dict): Dictionary containing sampling information per class.

44

- `sample_indices_` (ndarray): Indices of selected samples.

45

- `n_features_in_` (int): Number of input features.

46

- `feature_names_in_` (ndarray): Names of input features when available.

47

48

**Methods:**

49

- `fit_resample(X, y)`: Fit the sampler and resample the dataset.

50

51

**Usage Example:**

52

```python

53

from imblearn.under_sampling import RandomUnderSampler

54

from collections import Counter

55

56

# Create random under-sampler

57

rus = RandomUnderSampler(random_state=42)

58

59

# Apply under-sampling

60

X_resampled, y_resampled = rus.fit_resample(X, y)

61

print(f"Original: {Counter(y)}")

62

print(f"Resampled: {Counter(y_resampled)}")

63

```

64

65

---

66

67

## Prototype Generation

68

69

### ClusterCentroids

70

71

Under-sample by generating centroids based on clustering methods. Replaces clusters of majority samples with their centroids.

72

73

```python { .api }

74

class ClusterCentroids:

75

def __init__(

76

self,

77

*,

78

sampling_strategy="auto",

79

random_state=None,

80

estimator=None,

81

voting="auto"

82

):

83

```

84

85

**Parameters:**

86

- `sampling_strategy` (str, dict, list): Strategy to control sampling. Default is "auto".

87

- `random_state` (int, RandomState, None): Random number generator seed.

88

- `estimator` (estimator object): Clustering estimator with `n_clusters` parameter and `cluster_centers_` attribute. Defaults to KMeans.

89

- `voting` (str): Voting strategy for generating new samples:

90

- "hard": Use nearest neighbors of centroids

91

- "soft": Use centroids directly

92

- "auto": Choose based on input sparsity

93

94

**Attributes:**

95

- `sampling_strategy_` (dict): Dictionary containing sampling information per class.

96

- `estimator_` (estimator object): The validated clustering estimator.

97

- `voting_` (str): The validated voting strategy.

98

- `n_features_in_` (int): Number of input features.

99

- `feature_names_in_` (ndarray): Names of input features when available.

100

101

**Methods:**

102

- `fit_resample(X, y)`: Fit the sampler and resample the dataset.

103

104

**Usage Example:**

105

```python

106

from imblearn.under_sampling import ClusterCentroids

107

from sklearn.cluster import MiniBatchKMeans

108

109

# Create cluster centroids sampler with custom estimator

110

cc = ClusterCentroids(

111

estimator=MiniBatchKMeans(n_init=1, random_state=0),

112

random_state=42

113

)

114

115

# Apply cluster-based under-sampling

116

X_resampled, y_resampled = cc.fit_resample(X, y)

117

```

118

119

---

120

121

## Prototype Selection Methods

122

123

### NearMiss

124

125

Under-sample based on NearMiss methods that select samples based on distance to minority class samples.

126

127

```python { .api }

128

class NearMiss:

129

def __init__(

130

self,

131

*,

132

sampling_strategy="auto",

133

version=1,

134

n_neighbors=3,

135

n_neighbors_ver3=3,

136

n_jobs=None

137

):

138

```

139

140

**Parameters:**

141

- `sampling_strategy` (str, dict, list): Strategy to control sampling.

142

- `version` (int): NearMiss version (1, 2, or 3):

143

- Version 1: Select samples closest to minority class samples

144

- Version 2: Select samples closest to farthest minority class samples

145

- Version 3: Two-step process with neighborhood selection

146

- `n_neighbors` (int, estimator): Number of neighbors or KNN estimator.

147

- `n_neighbors_ver3` (int, estimator): Number of neighbors for version 3 pre-selection.

148

- `n_jobs` (int): Number of parallel jobs.

149

150

**Attributes:**

151

- `sampling_strategy_` (dict): Dictionary containing sampling information.

152

- `nn_` (estimator object): Validated K-nearest neighbors estimator.

153

- `nn_ver3_` (estimator object): K-nearest neighbors estimator for version 3.

154

- `sample_indices_` (ndarray): Indices of selected samples.

155

156

**Usage Example:**

157

```python

158

from imblearn.under_sampling import NearMiss

159

160

# NearMiss version 1 (select closest to minority)

161

nm1 = NearMiss(version=1)

162

X_res1, y_res1 = nm1.fit_resample(X, y)

163

164

# NearMiss version 3 (two-step selection)

165

nm3 = NearMiss(version=3, n_neighbors=3, n_neighbors_ver3=3)

166

X_res3, y_res3 = nm3.fit_resample(X, y)

167

```

168

169

### InstanceHardnessThreshold

170

171

Under-sample based on instance hardness threshold using cross-validation predictions.

172

173

```python { .api }

174

class InstanceHardnessThreshold:

175

def __init__(

176

self,

177

*,

178

estimator=None,

179

sampling_strategy="auto",

180

random_state=None,

181

cv=5,

182

n_jobs=None

183

):

184

```

185

186

**Parameters:**

187

- `estimator` (estimator object): Classifier with `predict_proba` method. Defaults to RandomForestClassifier.

188

- `sampling_strategy` (str, dict, list): Strategy to control sampling.

189

- `random_state` (int, RandomState, None): Random number generator seed.

190

- `cv` (int): Number of cross-validation folds for hardness estimation.

191

- `n_jobs` (int): Number of parallel jobs.

192

193

**Attributes:**

194

- `sampling_strategy_` (dict): Dictionary containing sampling information.

195

- `estimator_` (estimator object): The validated classifier.

196

- `sample_indices_` (ndarray): Indices of selected samples.

197

198

**Usage Example:**

199

```python

200

from imblearn.under_sampling import InstanceHardnessThreshold

201

from sklearn.ensemble import RandomForestClassifier

202

203

# Use custom classifier for hardness estimation

204

iht = InstanceHardnessThreshold(

205

estimator=RandomForestClassifier(n_estimators=50),

206

cv=3,

207

random_state=42

208

)

209

X_resampled, y_resampled = iht.fit_resample(X, y)

210

```

211

212

### TomekLinks

213

214

Under-sample by removing Tomek's links - pairs of nearest neighbors from different classes.

215

216

```python { .api }

217

class TomekLinks:

218

def __init__(

219

self,

220

*,

221

sampling_strategy="auto",

222

n_jobs=None

223

):

224

```

225

226

**Parameters:**

227

- `sampling_strategy` (str, dict, list): Strategy to control which classes to clean.

228

- `n_jobs` (int): Number of parallel jobs.

229

230

**Attributes:**

231

- `sampling_strategy_` (dict): Dictionary containing sampling information.

232

- `sample_indices_` (ndarray): Indices of selected samples.

233

234

**Methods:**

235

- `fit_resample(X, y)`: Remove Tomek links from the dataset.

236

- `is_tomek(y, nn_index, class_type)`: Static method to detect Tomek pairs.

237

238

**Usage Example:**

239

```python

240

from imblearn.under_sampling import TomekLinks

241

242

# Remove Tomek links (noisy border samples)

243

tl = TomekLinks()

244

X_cleaned, y_cleaned = tl.fit_resample(X, y)

245

print(f"Removed {len(y) - len(y_cleaned)} Tomek links")

246

```

247

248

### EditedNearestNeighbours

249

250

Under-sample by removing samples whose neighborhood contains samples from different classes.

251

252

```python { .api }

253

class EditedNearestNeighbours:

254

def __init__(

255

self,

256

*,

257

sampling_strategy="auto",

258

n_neighbors=3,

259

kind_sel="all",

260

n_jobs=None

261

):

262

```

263

264

**Parameters:**

265

- `sampling_strategy` (str, dict, list): Strategy to control sampling.

266

- `n_neighbors` (int, estimator): Number of neighbors to examine or KNN estimator.

267

- `kind_sel` (str): Selection strategy:

268

- "all": Remove if any neighbor is from different class

269

- "mode": Remove if most neighbors are from different class

270

- `n_jobs` (int): Number of parallel jobs.

271

272

**Attributes:**

273

- `sampling_strategy_` (dict): Dictionary containing sampling information.

274

- `nn_` (estimator object): Validated K-nearest neighbors estimator.

275

- `sample_indices_` (ndarray): Indices of selected samples.

276

277

**Usage Example:**

278

```python

279

from imblearn.under_sampling import EditedNearestNeighbours

280

281

# Conservative cleaning (remove if any neighbor differs)

282

enn_all = EditedNearestNeighbours(kind_sel="all", n_neighbors=3)

283

X_clean_all, y_clean_all = enn_all.fit_resample(X, y)

284

285

# Less aggressive cleaning (remove if majority neighbors differ)

286

enn_mode = EditedNearestNeighbours(kind_sel="mode", n_neighbors=5)

287

X_clean_mode, y_clean_mode = enn_mode.fit_resample(X, y)

288

```

289

290

### RepeatedEditedNearestNeighbours

291

292

Repeated application of EditedNearestNeighbours until convergence or stopping criteria.

293

294

```python { .api }

295

class RepeatedEditedNearestNeighbours:

296

def __init__(

297

self,

298

*,

299

sampling_strategy="auto",

300

n_neighbors=3,

301

max_iter=100,

302

kind_sel="all",

303

n_jobs=None

304

):

305

```

306

307

**Parameters:**

308

- `sampling_strategy` (str, dict, list): Strategy to control sampling.

309

- `n_neighbors` (int, estimator): Number of neighbors or KNN estimator.

310

- `max_iter` (int): Maximum number of iterations.

311

- `kind_sel` (str): Selection strategy ("all" or "mode").

312

- `n_jobs` (int): Number of parallel jobs.

313

314

**Attributes:**

315

- `sampling_strategy_` (dict): Dictionary containing sampling information.

316

- `nn_` (estimator object): Validated K-nearest neighbors estimator.

317

- `enn_` (sampler object): The EditedNearestNeighbours instance.

318

- `sample_indices_` (ndarray): Indices of selected samples.

319

- `n_iter_` (int): Number of iterations performed.

320

321

**Usage Example:**

322

```python

323

from imblearn.under_sampling import RepeatedEditedNearestNeighbours

324

325

# Repeat ENN until convergence

326

renn = RepeatedEditedNearestNeighbours(

327

n_neighbors=3,

328

max_iter=50,

329

kind_sel="all"

330

)

331

X_resampled, y_resampled = renn.fit_resample(X, y)

332

print(f"Converged after {renn.n_iter_} iterations")

333

```

334

335

### AllKNN

336

337

Apply EditedNearestNeighbours with increasing neighborhood sizes from 1 to n_neighbors.

338

339

```python { .api }

340

class AllKNN:

341

def __init__(

342

self,

343

*,

344

sampling_strategy="auto",

345

n_neighbors=3,

346

kind_sel="all",

347

allow_minority=False,

348

n_jobs=None

349

):

350

```

351

352

**Parameters:**

353

- `sampling_strategy` (str, dict, list): Strategy to control sampling.

354

- `n_neighbors` (int, estimator): Maximum number of neighbors or KNN estimator.

355

- `kind_sel` (str): Selection strategy ("all" or "mode").

356

- `allow_minority` (bool): Allow majority classes to become minority classes.

357

- `n_jobs` (int): Number of parallel jobs.

358

359

**Attributes:**

360

- `sampling_strategy_` (dict): Dictionary containing sampling information.

361

- `nn_` (estimator object): Validated K-nearest neighbors estimator.

362

- `enn_` (sampler object): The EditedNearestNeighbours instance.

363

- `sample_indices_` (ndarray): Indices of selected samples.

364

365

**Usage Example:**

366

```python

367

from imblearn.under_sampling import AllKNN

368

369

# Progressive neighborhood cleaning

370

allknn = AllKNN(n_neighbors=5, kind_sel="all")

371

X_resampled, y_resampled = allknn.fit_resample(X, y)

372

```

373

374

### OneSidedSelection

375

376

Under-sample using one-sided selection method combining CNN and Tomek links.

377

378

```python { .api }

379

class OneSidedSelection:

380

def __init__(

381

self,

382

*,

383

sampling_strategy="auto",

384

random_state=None,

385

n_neighbors=None,

386

n_seeds_S=1,

387

n_jobs=None

388

):

389

```

390

391

**Parameters:**

392

- `sampling_strategy` (str, dict, list): Strategy to control sampling.

393

- `random_state` (int, RandomState, None): Random number generator seed.

394

- `n_neighbors` (int, estimator, None): Number of neighbors or KNN estimator. Defaults to 1-NN.

395

- `n_seeds_S` (int): Number of seed samples to extract for set S.

396

- `n_jobs` (int): Number of parallel jobs.

397

398

**Attributes:**

399

- `sampling_strategy_` (dict): Dictionary containing sampling information.

400

- `estimators_` (list): List of KNN estimators used per class.

401

- `sample_indices_` (ndarray): Indices of selected samples.

402

403

**Usage Example:**

404

```python

405

from imblearn.under_sampling import OneSidedSelection

406

407

# One-sided selection with custom parameters

408

oss = OneSidedSelection(

409

n_neighbors=3,

410

n_seeds_S=1,

411

random_state=42

412

)

413

X_resampled, y_resampled = oss.fit_resample(X, y)

414

```

415

416

### CondensedNearestNeighbour

417

418

Under-sample using condensed nearest neighbor rule to find consistent subset.

419

420

```python { .api }

421

class CondensedNearestNeighbour:

422

def __init__(

423

self,

424

*,

425

sampling_strategy="auto",

426

random_state=None,

427

n_neighbors=None,

428

n_seeds_S=1,

429

n_jobs=None

430

):

431

```

432

433

**Parameters:**

434

- `sampling_strategy` (str, dict, list): Strategy to control sampling.

435

- `random_state` (int, RandomState, None): Random number generator seed.

436

- `n_neighbors` (int, estimator, None): Number of neighbors or KNN estimator. Defaults to 1-NN.

437

- `n_seeds_S` (int): Number of seed samples for set S initialization.

438

- `n_jobs` (int): Number of parallel jobs.

439

440

**Attributes:**

441

- `sampling_strategy_` (dict): Dictionary containing sampling information.

442

- `estimators_` (list): List of KNN estimators used per class.

443

- `sample_indices_` (ndarray): Indices of selected samples.

444

445

**Usage Example:**

446

```python

447

from imblearn.under_sampling import CondensedNearestNeighbour

448

449

# Condensed nearest neighbor selection

450

cnn = CondensedNearestNeighbour(

451

n_neighbors=1,

452

n_seeds_S=1,

453

random_state=42

454

)

455

X_resampled, y_resampled = cnn.fit_resample(X, y)

456

```

457

458

---

459

460

## Neighborhood Cleaning Methods

461

462

### NeighbourhoodCleaningRule

463

464

Under-sample using neighborhood cleaning rule that combines ENN and KNN for noise removal.

465

466

```python { .api }

467

class NeighbourhoodCleaningRule:

468

def __init__(

469

self,

470

*,

471

sampling_strategy="auto",

472

edited_nearest_neighbours=None,

473

n_neighbors=3,

474

threshold_cleaning=0.5,

475

n_jobs=None

476

):

477

```

478

479

**Parameters:**

480

- `sampling_strategy` (str, dict, list): Strategy to control sampling.

481

- `edited_nearest_neighbours` (estimator, None): ENN estimator for initial cleaning. Defaults to ENN with `kind_sel="mode"`.

482

- `n_neighbors` (int, estimator): Number of neighbors or KNN estimator.

483

- `threshold_cleaning` (float): Threshold for considering classes in second cleaning phase: `Ci > C × threshold`.

484

- `n_jobs` (int): Number of parallel jobs.

485

486

**Attributes:**

487

- `sampling_strategy_` (dict): Dictionary containing sampling information.

488

- `edited_nearest_neighbours_` (estimator): The ENN object for first cleaning phase.

489

- `nn_` (estimator object): Validated K-nearest neighbors estimator.

490

- `classes_to_clean_` (list): Classes considered for second cleaning phase.

491

- `sample_indices_` (ndarray): Indices of selected samples.

492

493

**Usage Example:**

494

```python

495

from imblearn.under_sampling import NeighbourhoodCleaningRule

496

from imblearn.under_sampling import EditedNearestNeighbours

497

498

# Default neighborhood cleaning

499

ncr = NeighbourhoodCleaningRule()

500

X_cleaned, y_cleaned = ncr.fit_resample(X, y)

501

502

# Custom ENN for first phase

503

custom_enn = EditedNearestNeighbours(kind_sel="all", n_neighbors=5)

504

ncr_custom = NeighbourhoodCleaningRule(

505

edited_nearest_neighbours=custom_enn,

506

threshold_cleaning=0.3

507

)

508

X_cleaned_custom, y_cleaned_custom = ncr_custom.fit_resample(X, y)

509

```

510

511

---

512

513

## Method Selection Guidelines

514

515

### When to Use Each Method

516

517

**Random Under-Sampling:**

518

- Simple baseline approach

519

- When computational resources are limited

520

- For initial experimentation

521

522

**Prototype Generation (ClusterCentroids):**

523

- When you want to preserve cluster structure

524

- For high-dimensional data where centroids can represent regions well

525

- When interpretability of synthetic samples is important

526

527

**Prototype Selection (NearMiss, ENN variants):**

528

- When preserving decision boundary information is crucial

529

- For datasets where border samples are informative

530

- When you want to remove noisy/outlier samples

531

532

**Neighborhood Cleaning:**

533

- When dataset contains significant noise

534

- For improving classifier performance through data cleaning

535

- When combining multiple cleaning strategies

536

537

### Computational Complexity

538

539

- **RandomUnderSampler:** O(n) - fastest

540

- **ClusterCentroids:** O(n × k × iterations) - depends on clustering algorithm

541

- **NearMiss:** O(n²) - distance calculations between all samples

542

- **ENN variants:** O(n × k × neighbors) - depends on neighborhood size

543

- **TomekLinks:** O(n²) - pairwise distance calculations

544

- **CNN/OSS:** O(n²) - iterative neighbor searches

545

546

### Multi-Class Support

547

548

All methods support multi-class resampling:

549

- **One-vs.-rest:** NearMiss, ENN variants, TomekLinks, NeighbourhoodCleaningRule

550

- **One-vs.-one:** OneSidedSelection, CondensedNearestNeighbour

551

- **Independent sampling:** RandomUnderSampler, ClusterCentroids, InstanceHardnessThreshold

552

553

### Pipeline Integration

554

555

```python

556

from sklearn.pipeline import Pipeline

557

from sklearn.ensemble import RandomForestClassifier

558

from imblearn.under_sampling import RandomUnderSampler

559

560

# Create preprocessing pipeline

561

pipeline = Pipeline([

562

('sampler', RandomUnderSampler(random_state=42)),

563

('classifier', RandomForestClassifier(random_state=42))

564

])

565

566

# Fit pipeline

567

pipeline.fit(X_train, y_train)

568

predictions = pipeline.predict(X_test)

569

```