or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

combination.mddatasets.mddeep-learning.mdensemble.mdindex.mdmetrics.mdmodel-selection.mdover-sampling.mdpipeline.mdunder-sampling.mdutilities.md

combination.mddocs/

0

# Combination Methods

1

2

Combination methods in imbalanced-learn provide a powerful approach to handling imbalanced datasets by sequentially applying both over-sampling and under-sampling techniques. These hybrid methods first generate synthetic samples to balance the dataset, then remove noisy or problematic samples to improve data quality.

3

4

## Overview

5

6

Combination methods work by:

7

8

1. **Over-sampling phase**: Generate synthetic samples using techniques like SMOTE to increase minority class representation

9

2. **Under-sampling phase**: Remove noisy, borderline, or problematic samples using cleaning techniques like Edited Nearest Neighbours or Tomek Links removal

10

11

This two-step approach aims to achieve both balanced class distribution and improved data quality, potentially leading to better classifier performance than using either technique alone.

12

13

## Available Methods

14

15

The `imblearn.combine` module provides two main combination methods:

16

17

- **SMOTEENN**: Combines SMOTE over-sampling with Edited Nearest Neighbours cleaning

18

- **SMOTETomek**: Combines SMOTE over-sampling with Tomek Links removal

19

20

Both methods follow the same general pattern: apply SMOTE first to generate synthetic samples, then apply a cleaning technique to remove noisy samples from the augmented dataset.

21

22

---

23

24

## SMOTEENN

25

26

```python { .api }

27

class SMOTEENN(

28

*,

29

sampling_strategy="auto",

30

random_state=None,

31

smote=None,

32

enn=None,

33

n_jobs=None

34

)

35

```

36

37

Over-sampling using SMOTE and cleaning using Edited Nearest Neighbours.

38

39

This method combines the SMOTE over-sampling technique with Edited Nearest Neighbours (ENN) cleaning. It first applies SMOTE to generate synthetic samples for minority classes, then uses ENN to remove noisy samples from the resulting dataset.

40

41

### Parameters

42

43

- **sampling_strategy** : `float`, `str`, `dict` or `callable`, default=`'auto'`

44

45

Sampling information to resample the data set.

46

47

- When `float`, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as α_os = N_rm / N_M where N_rm is the number of samples in the minority class after resampling and N_M is the number of samples in the majority class.

48

49

**Warning**: `float` is only available for **binary** classification. An error is raised for multi-class classification.

50

51

- When `str`, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:

52

- `'minority'`: resample only the minority class

53

- `'not minority'`: resample all classes but the minority class

54

- `'not majority'`: resample all classes but the majority class

55

- `'all'`: resample all classes

56

- `'auto'`: equivalent to `'not majority'`

57

58

- When `dict`, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.

59

60

- When callable, function taking `y` and returns a `dict`. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

61

62

- **random_state** : `int`, `RandomState` instance, default=`None`

63

64

Control the randomization of the algorithm.

65

66

- If `int`, `random_state` is the seed used by the random number generator

67

- If `RandomState` instance, `random_state` is the random number generator

68

- If `None`, the random number generator is the `RandomState` instance used by `np.random`

69

70

- **smote** : sampler object, default=`None`

71

72

The `SMOTE` object to use. If not given, a `SMOTE` object with default parameters will be used.

73

74

- **enn** : sampler object, default=`None`

75

76

The `EditedNearestNeighbours` object to use. If not given, an `EditedNearestNeighbours` object with `sampling_strategy='all'` will be used.

77

78

- **n_jobs** : `int`, default=`None`

79

80

Number of CPU cores used during the cross-validation loop. `None` means 1 unless in a `joblib.parallel_backend` context. `-1` means using all processors.

81

82

### Attributes

83

84

- **sampling_strategy_** : `dict`

85

86

Dictionary containing the information to sample the dataset. The keys correspond to the class labels from which to sample and the values are the number of samples to sample.

87

88

- **smote_** : sampler object

89

90

The validated `SMOTE` instance.

91

92

- **enn_** : sampler object

93

94

The validated `EditedNearestNeighbours` instance.

95

96

- **n_features_in_** : `int`

97

98

Number of features in the input dataset.

99

100

- **feature_names_in_** : `ndarray` of shape `(n_features_in_,)`

101

102

Names of features seen during `fit`. Defined only when `X` has feature names that are all strings.

103

104

### Methods

105

106

```python { .api }

107

def fit_resample(X, y, **params)

108

```

109

110

Resample the dataset.

111

112

**Parameters:**

113

- **X** : `{array-like, dataframe, sparse matrix}` of shape `(n_samples, n_features)`

114

115

Matrix containing the data which have to be sampled.

116

117

- **y** : `array-like` of shape `(n_samples,)`

118

119

Corresponding label for each sample in X.

120

121

- ****params** : `dict`

122

123

Extra parameters to use by the sampler.

124

125

**Returns:**

126

- **X_resampled** : `{array-like, dataframe, sparse matrix}` of shape `(n_samples_new, n_features)`

127

128

The array containing the resampled data.

129

130

- **y_resampled** : `array-like` of shape `(n_samples_new,)`

131

132

The corresponding label of `X_resampled`.

133

134

### Example Usage

135

136

```python

137

from collections import Counter

138

from sklearn.datasets import make_classification

139

from imblearn.combine import SMOTEENN

140

141

# Create an imbalanced dataset

142

X, y = make_classification(

143

n_classes=2,

144

class_sep=2,

145

weights=[0.1, 0.9],

146

n_informative=3,

147

n_redundant=1,

148

flip_y=0,

149

n_features=20,

150

n_clusters_per_class=1,

151

n_samples=1000,

152

random_state=10

153

)

154

155

print('Original dataset shape:', Counter(y))

156

# Original dataset shape: Counter({1: 900, 0: 100})

157

158

# Apply SMOTEENN

159

sme = SMOTEENN(random_state=42)

160

X_res, y_res = sme.fit_resample(X, y)

161

162

print('Resampled dataset shape:', Counter(y_res))

163

# Resampled dataset shape: Counter({0: 900, 1: 881})

164

165

# Using custom SMOTE and ENN parameters

166

from imblearn.over_sampling import SMOTE

167

from imblearn.under_sampling import EditedNearestNeighbours

168

169

custom_smote = SMOTE(k_neighbors=3, random_state=42)

170

custom_enn = EditedNearestNeighbours(n_neighbors=5, kind_sel='mode')

171

172

sme_custom = SMOTEENN(

173

smote=custom_smote,

174

enn=custom_enn,

175

random_state=42

176

)

177

X_res_custom, y_res_custom = sme_custom.fit_resample(X, y)

178

```

179

180

### Notes

181

182

- The method was first presented in Batista et al. (2004)

183

- Supports multi-class resampling following the schemes used by SMOTE and ENN

184

- The ENN cleaning step removes samples that are misclassified by their nearest neighbors, which can help remove both noisy samples and borderline cases created by SMOTE

185

- The final dataset size is typically smaller than what SMOTE alone would produce due to the cleaning step

186

187

---

188

189

## SMOTETomek

190

191

```python { .api }

192

class SMOTETomek(

193

*,

194

sampling_strategy="auto",

195

random_state=None,

196

smote=None,

197

tomek=None,

198

n_jobs=None

199

)

200

```

201

202

Over-sampling using SMOTE and cleaning using Tomek links.

203

204

This method combines the SMOTE over-sampling technique with Tomek links removal. It first applies SMOTE to generate synthetic samples for minority classes, then removes Tomek links (pairs of nearest neighbors from different classes) from the resulting dataset.

205

206

### Parameters

207

208

- **sampling_strategy** : `float`, `str`, `dict` or `callable`, default=`'auto'`

209

210

Sampling information to resample the data set.

211

212

- When `float`, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as α_os = N_rm / N_M where N_rm is the number of samples in the minority class after resampling and N_M is the number of samples in the majority class.

213

214

**Warning**: `float` is only available for **binary** classification. An error is raised for multi-class classification.

215

216

- When `str`, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:

217

- `'minority'`: resample only the minority class

218

- `'not minority'`: resample all classes but the minority class

219

- `'not majority'`: resample all classes but the majority class

220

- `'all'`: resample all classes

221

- `'auto'`: equivalent to `'not majority'`

222

223

- When `dict`, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.

224

225

- When callable, function taking `y` and returns a `dict`. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

226

227

- **random_state** : `int`, `RandomState` instance, default=`None`

228

229

Control the randomization of the algorithm.

230

231

- If `int`, `random_state` is the seed used by the random number generator

232

- If `RandomState` instance, `random_state` is the random number generator

233

- If `None`, the random number generator is the `RandomState` instance used by `np.random`

234

235

- **smote** : sampler object, default=`None`

236

237

The `SMOTE` object to use. If not given, a `SMOTE` object with default parameters will be used.

238

239

- **tomek** : sampler object, default=`None`

240

241

The `TomekLinks` object to use. If not given, a `TomekLinks` object with `sampling_strategy='all'` will be used.

242

243

- **n_jobs** : `int`, default=`None`

244

245

Number of CPU cores used during the cross-validation loop. `None` means 1 unless in a `joblib.parallel_backend` context. `-1` means using all processors.

246

247

### Attributes

248

249

- **sampling_strategy_** : `dict`

250

251

Dictionary containing the information to sample the dataset. The keys correspond to the class labels from which to sample and the values are the number of samples to sample.

252

253

- **smote_** : sampler object

254

255

The validated `SMOTE` instance.

256

257

- **tomek_** : sampler object

258

259

The validated `TomekLinks` instance.

260

261

- **n_features_in_** : `int`

262

263

Number of features in the input dataset.

264

265

- **feature_names_in_** : `ndarray` of shape `(n_features_in_,)`

266

267

Names of features seen during `fit`. Defined only when `X` has feature names that are all strings.

268

269

### Methods

270

271

```python { .api }

272

def fit_resample(X, y, **params)

273

```

274

275

Resample the dataset.

276

277

**Parameters:**

278

- **X** : `{array-like, dataframe, sparse matrix}` of shape `(n_samples, n_features)`

279

280

Matrix containing the data which have to be sampled.

281

282

- **y** : `array-like` of shape `(n_samples,)`

283

284

Corresponding label for each sample in X.

285

286

- ****params** : `dict`

287

288

Extra parameters to use by the sampler.

289

290

**Returns:**

291

- **X_resampled** : `{array-like, dataframe, sparse matrix}` of shape `(n_samples_new, n_features)`

292

293

The array containing the resampled data.

294

295

- **y_resampled** : `array-like` of shape `(n_samples_new,)`

296

297

The corresponding label of `X_resampled`.

298

299

### Example Usage

300

301

```python

302

from collections import Counter

303

from sklearn.datasets import make_classification

304

from imblearn.combine import SMOTETomek

305

306

# Create an imbalanced dataset

307

X, y = make_classification(

308

n_classes=2,

309

class_sep=2,

310

weights=[0.1, 0.9],

311

n_informative=3,

312

n_redundant=1,

313

flip_y=0,

314

n_features=20,

315

n_clusters_per_class=1,

316

n_samples=1000,

317

random_state=10

318

)

319

320

print('Original dataset shape:', Counter(y))

321

# Original dataset shape: Counter({1: 900, 0: 100})

322

323

# Apply SMOTETomek

324

smt = SMOTETomek(random_state=42)

325

X_res, y_res = smt.fit_resample(X, y)

326

327

print('Resampled dataset shape:', Counter(y_res))

328

# Resampled dataset shape: Counter({0: 900, 1: 900})

329

330

# Using custom SMOTE and Tomek parameters

331

from imblearn.over_sampling import SMOTE

332

from imblearn.under_sampling import TomekLinks

333

334

custom_smote = SMOTE(k_neighbors=5, random_state=42)

335

custom_tomek = TomekLinks(sampling_strategy='majority')

336

337

smt_custom = SMOTETomek(

338

smote=custom_smote,

339

tomek=custom_tomek,

340

random_state=42

341

)

342

X_res_custom, y_res_custom = smt_custom.fit_resample(X, y)

343

```

344

345

### Notes

346

347

- The method was first presented in Batista et al. (2003)

348

- Supports multi-class resampling following the schemes used by SMOTE and TomekLinks

349

- Tomek links removal focuses on cleaning the decision boundary by removing ambiguous samples

350

- Generally preserves more samples than SMOTEENN since Tomek links removal is less aggressive than ENN

351

352

---

353

354

## Comparison: SMOTEENN vs SMOTETomek

355

356

| Aspect | SMOTEENN | SMOTETomek |

357

|--------|----------|------------|

358

| **Cleaning Method** | Edited Nearest Neighbours | Tomek Links |

359

| **Cleaning Aggressiveness** | More aggressive | Less aggressive |

360

| **Typical Sample Reduction** | Higher | Lower |

361

| **Focus** | Removes misclassified samples | Removes boundary ambiguous samples |

362

| **Best Use Case** | Noisy datasets | Clean decision boundaries |

363

364

### When to Use Each Method

365

366

**Use SMOTEENN when:**

367

- Your dataset contains significant noise

368

- You want more aggressive cleaning

369

- Class boundaries are poorly defined

370

- You can afford to lose more samples for better quality

371

372

**Use SMOTETomek when:**

373

- Your dataset is relatively clean

374

- You want to preserve more samples

375

- You need to clean decision boundaries

376

- Class overlap is the main issue

377

378

### Algorithm Workflow

379

380

Both methods follow the same general workflow:

381

382

1. **Input**: Imbalanced dataset (X, y)

383

2. **SMOTE Phase**: Apply SMOTE over-sampling to generate synthetic minority class samples

384

3. **Cleaning Phase**:

385

- SMOTEENN: Apply ENN to remove misclassified samples

386

- SMOTETomek: Remove Tomek links from the dataset

387

4. **Output**: Balanced and cleaned dataset

388

389

This sequential approach ensures that the benefits of both techniques are realized: balanced class distribution from SMOTE and improved data quality from the cleaning step.