or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

combination.mddatasets.mddeep-learning.mdensemble.mdindex.mdmetrics.mdmodel-selection.mdover-sampling.mdpipeline.mdunder-sampling.mdutilities.md

datasets.mddocs/

0

# Datasets

1

2

Functions for creating imbalanced datasets and fetching benchmark datasets for testing and evaluation of imbalanced learning algorithms.

3

4

## Overview

5

6

Imbalanced-learn provides utilities for working with imbalanced datasets, including functions to create artificially imbalanced datasets from balanced ones and to fetch real-world benchmark datasets specifically curated for imbalanced learning research.

7

8

### Key Features

9

- **Dataset creation**: Transform balanced datasets into imbalanced ones with controlled class distributions

10

- **Benchmark datasets**: Access to 27 curated real-world imbalanced datasets

11

- **Flexible sampling strategies**: Support for various imbalance ratios and class targeting

12

- **Research reproducibility**: Consistent datasets for comparing imbalanced learning methods

13

- **Easy integration**: Compatible with scikit-learn data formats and workflows

14

15

## Dataset Creation

16

17

### make_imbalance

18

19

#### make_imbalance

20

21

```python

22

{ .api }

23

def make_imbalance(

24

X,

25

y,

26

*,

27

sampling_strategy=None,

28

random_state=None,

29

verbose=False,

30

**kwargs

31

) -> tuple[ndarray, ndarray]

32

```

33

34

Turn a dataset into an imbalanced dataset with a specific sampling strategy.

35

36

**Parameters:**

37

- **X** (`{array-like, dataframe}` of shape `(n_samples, n_features)`): Matrix containing the data to be imbalanced

38

- **y** (`array-like` of shape `(n_samples,)`): Corresponding label for each sample in X

39

- **sampling_strategy** (`dict` or `callable`, default=`None`): Ratio to use for resampling the data set

40

- When `dict`: The keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class

41

- When `callable`: Function taking `y` and returns a `dict`. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class

42

- **random_state** (`int`, `RandomState` instance or `None`, default=`None`): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random

43

- **verbose** (`bool`, default=`False`): Show information regarding the sampling

44

- **kwargs** (`dict`): Dictionary of additional keyword arguments to pass to `sampling_strategy`

45

46

**Returns:**

47

- **X_resampled** (`{ndarray, dataframe}` of shape `(n_samples_new, n_features)`): The array containing the imbalanced data

48

- **y_resampled** (`ndarray` of shape `(n_samples_new,)`): The corresponding label of `X_resampled`

49

50

**Algorithm:**

51

The function uses `RandomUnderSampler` internally to reduce the number of samples in specified classes, creating imbalanced distributions from balanced datasets.

52

53

**Basic Usage:**

54

```python

55

from collections import Counter

56

from sklearn.datasets import load_iris

57

from imblearn.datasets import make_imbalance

58

59

# Load balanced dataset

60

data = load_iris()

61

X, y = data.data, data.target

62

print(f'Distribution before imbalancing: {Counter(y)}')

63

# Distribution before imbalancing: Counter({0: 50, 1: 50, 2: 50})

64

65

# Create imbalanced dataset

66

X_res, y_res = make_imbalance(

67

X, y,

68

sampling_strategy={0: 10, 1: 20, 2: 30},

69

random_state=42

70

)

71

print(f'Distribution after imbalancing: {Counter(y_res)}')

72

# Distribution after imbalancing: Counter({2: 30, 1: 20, 0: 10})

73

```

74

75

**Using Callable Strategies:**

76

```python

77

def progressive_imbalance(y):

78

"""Create progressively more imbalanced classes."""

79

from collections import Counter

80

counter = Counter(y)

81

classes = sorted(counter.keys())

82

83

# Create exponentially decreasing class sizes

84

target_sizes = {}

85

base_size = 100

86

for i, cls in enumerate(classes):

87

target_sizes[cls] = base_size // (2 ** i)

88

89

return target_sizes

90

91

# Apply progressive imbalance

92

X_prog, y_prog = make_imbalance(

93

X, y,

94

sampling_strategy=progressive_imbalance,

95

random_state=42,

96

verbose=True

97

)

98

```

99

100

**Multi-class Imbalance Patterns:**

101

```python

102

from sklearn.datasets import make_classification

103

104

# Create multi-class dataset

105

X, y = make_classification(

106

n_classes=5,

107

n_samples=1000,

108

n_features=10,

109

n_informative=8,

110

n_redundant=1,

111

n_clusters_per_class=1,

112

weights=[0.2, 0.2, 0.2, 0.2, 0.2], # Initially balanced

113

random_state=42

114

)

115

116

print(f"Original distribution: {Counter(y)}")

117

118

# Create different imbalance patterns

119

strategies = {

120

'mild_imbalance': {0: 150, 1: 120, 2: 100, 3: 80, 4: 50},

121

'severe_imbalance': {0: 200, 1: 50, 2: 25, 3: 15, 4: 10},

122

'binary_like': {0: 250, 1: 250, 2: 10, 3: 10, 4: 10}

123

}

124

125

for name, strategy in strategies.items():

126

X_imb, y_imb = make_imbalance(X, y, sampling_strategy=strategy, random_state=42)

127

print(f"{name}: {Counter(y_imb)}")

128

```

129

130

## Benchmark Datasets

131

132

### fetch_datasets

133

134

#### fetch_datasets

135

136

```python

137

{ .api }

138

def fetch_datasets(

139

*,

140

data_home=None,

141

filter_data=None,

142

download_if_missing=True,

143

random_state=None,

144

shuffle=False,

145

verbose=False

146

) -> OrderedDict

147

```

148

149

Load the benchmark datasets from Zenodo, downloading it if necessary.

150

151

**Parameters:**

152

- **data_home** (`str`, default=`None`): Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in '~/scikit_learn_data' subfolders

153

- **filter_data** (`tuple` of `str`/`int`, default=`None`): A tuple containing the ID or the name of the datasets to be returned. Refer to the dataset table to get the ID and name of the datasets

154

- **download_if_missing** (`bool`, default=`True`): If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site

155

- **random_state** (`int`, `RandomState` instance or `None`, default=`None`): Random state for shuffling the dataset. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`

156

- **shuffle** (`bool`, default=`False`): Whether to shuffle dataset

157

- **verbose** (`bool`, default=`False`): Show information regarding the fetching

158

159

**Returns:**

160

- **datasets** (`OrderedDict` of `Bunch` object): The ordered is defined by `filter_data`. Each Bunch object (referred as dataset) have the following attributes:

161

- **dataset.data** (`ndarray` of shape `(n_samples, n_features)`): The input data

162

- **dataset.target** (`ndarray` of shape `(n_samples,)`): The target values

163

- **dataset.DESCR** (`str`): Description of the dataset

164

165

## Available Benchmark Datasets

166

167

The collection contains 27 real-world imbalanced datasets from various domains:

168

169

| ID | Name | Repository & Target | Ratio | #S | #F |

170

|----|------|-------------------|-------|----|----|

171

| 1 | ecoli | UCI, target: imU | 8.6:1 | 336 | 7 |

172

| 2 | optical_digits | UCI, target: 8 | 9.1:1 | 5,620 | 64 |

173

| 3 | satimage | UCI, target: 4 | 9.3:1 | 6,435 | 36 |

174

| 4 | pen_digits | UCI, target: 5 | 9.4:1 | 10,992 | 16 |

175

| 5 | abalone | UCI, target: 7 | 9.7:1 | 4,177 | 10 |

176

| 6 | sick_euthyroid | UCI, target: sick euthyroid | 9.8:1 | 3,163 | 42 |

177

| 7 | spectrometer | UCI, target: >=44 | 11:1 | 531 | 93 |

178

| 8 | car_eval_34 | UCI, target: good, v good | 12:1 | 1,728 | 21 |

179

| 9 | isolet | UCI, target: A, B | 12:1 | 7,797 | 617 |

180

| 10 | us_crime | UCI, target: >0.65 | 12:1 | 1,994 | 100 |

181

| 11 | yeast_ml8 | LIBSVM, target: 8 | 13:1 | 2,417 | 103 |

182

| 12 | scene | LIBSVM, target: >one label | 13:1 | 2,407 | 294 |

183

| 13 | libras_move | UCI, target: 1 | 14:1 | 360 | 90 |

184

| 14 | thyroid_sick | UCI, target: sick | 15:1 | 3,772 | 52 |

185

| 15 | coil_2000 | KDD, CoIL, target: minority | 16:1 | 9,822 | 85 |

186

| 16 | arrhythmia | UCI, target: 06 | 17:1 | 452 | 278 |

187

| 17 | solar_flare_m0 | UCI, target: M->0 | 19:1 | 1,389 | 32 |

188

| 18 | oil | UCI, target: minority | 22:1 | 937 | 49 |

189

| 19 | car_eval_4 | UCI, target: vgood | 26:1 | 1,728 | 21 |

190

| 20 | wine_quality | UCI, wine, target: <=4 | 26:1 | 4,898 | 11 |

191

| 21 | letter_img | UCI, target: Z | 26:1 | 20,000 | 16 |

192

| 22 | yeast_me2 | UCI, target: ME2 | 28:1 | 1,484 | 8 |

193

| 23 | webpage | LIBSVM, w7a, target: minority | 33:1 | 34,780 | 300 |

194

| 24 | ozone_level | UCI, ozone, data | 34:1 | 2,536 | 72 |

195

| 25 | mammography | UCI, target: minority | 42:1 | 11,183 | 6 |

196

| 26 | protein_homo | KDD CUP 2004, minority | 111:1 | 145,751 | 74 |

197

| 27 | abalone_19 | UCI, target: 19 | 130:1 | 4,177 | 10 |

198

199

**Dataset Categories:**

200

201

### Small Datasets (< 1,000 samples)

202

Suitable for quick experimentation and algorithm development:

203

```python

204

# Fetch small datasets for rapid prototyping

205

small_datasets = fetch_datasets(filter_data=('ecoli', 'libras_move', 'arrhythmia'))

206

207

for name, dataset in small_datasets.items():

208

print(f"{name}: {dataset.data.shape} samples, ratio ~{dataset.DESCR}")

209

```

210

211

### Medium Datasets (1,000 - 10,000 samples)

212

Good balance of complexity and computational efficiency:

213

```python

214

# Medium-sized datasets for thorough evaluation

215

medium_datasets = fetch_datasets(

216

filter_data=('satimage', 'abalone', 'sick_euthyroid', 'coil_2000')

217

)

218

```

219

220

### Large Datasets (> 10,000 samples)

221

For scalability testing and real-world performance evaluation:

222

```python

223

# Large datasets for scalability testing

224

large_datasets = fetch_datasets(

225

filter_data=('pen_digits', 'isolet', 'letter_img', 'webpage', 'protein_homo')

226

)

227

```

228

229

**Usage Examples:**

230

231

##### Fetch All Datasets

232

```python

233

from imblearn.datasets import fetch_datasets

234

from collections import Counter

235

236

# Download all benchmark datasets

237

all_datasets = fetch_datasets(verbose=True)

238

239

# Analyze dataset characteristics

240

for name, dataset in all_datasets.items():

241

counter = Counter(dataset.target)

242

n_samples, n_features = dataset.data.shape

243

ratio = max(counter.values()) / min(counter.values())

244

245

print(f"{name}:")

246

print(f" Samples: {n_samples}, Features: {n_features}")

247

print(f" Classes: {len(counter)}, Ratio: {ratio:.1f}:1")

248

print(f" Distribution: {dict(counter)}")

249

print()

250

```

251

252

##### Fetch Specific Datasets

253

```python

254

# Fetch datasets by name

255

datasets_by_name = fetch_datasets(

256

filter_data=('ecoli', 'mammography', 'abalone_19'),

257

shuffle=True,

258

random_state=42

259

)

260

261

# Fetch datasets by ID

262

datasets_by_id = fetch_datasets(

263

filter_data=(1, 25, 27), # Same as above

264

shuffle=True,

265

random_state=42

266

)

267

268

# Access individual datasets

269

ecoli = datasets_by_name['ecoli']

270

X, y = ecoli.data, ecoli.target

271

print(f"Ecoli dataset: {X.shape}, classes: {Counter(y)}")

272

```

273

274

##### Cross-Dataset Evaluation

275

```python

276

from sklearn.model_selection import cross_val_score

277

from imblearn.over_sampling import SMOTE

278

from imblearn.pipeline import Pipeline

279

from sklearn.ensemble import RandomForestClassifier

280

281

# Evaluate algorithm across multiple datasets

282

def evaluate_on_datasets(dataset_names, n_runs=5):

283

"""Evaluate sampling + classification across datasets."""

284

datasets = fetch_datasets(filter_data=dataset_names)

285

286

# Create pipeline

287

pipeline = Pipeline([

288

('sampling', SMOTE(random_state=42)),

289

('classifier', RandomForestClassifier(random_state=42))

290

])

291

292

results = {}

293

for name, dataset in datasets.items():

294

scores = cross_val_score(

295

pipeline, dataset.data, dataset.target,

296

cv=5, scoring='f1_macro'

297

)

298

results[name] = {

299

'mean_score': scores.mean(),

300

'std_score': scores.std(),

301

'dataset_info': {

302

'n_samples': dataset.data.shape[0],

303

'n_features': dataset.data.shape[1],

304

'n_classes': len(Counter(dataset.target))

305

}

306

}

307

308

return results

309

310

# Run evaluation

311

results = evaluate_on_datasets([

312

'ecoli', 'optical_digits', 'satimage', 'abalone', 'mammography'

313

])

314

315

for name, result in results.items():

316

info = result['dataset_info']

317

print(f"{name}:")

318

print(f" F1-macro: {result['mean_score']:.3f} ± {result['std_score']:.3f}")

319

print(f" Dataset: {info['n_samples']} samples, {info['n_features']} features, {info['n_classes']} classes")

320

```

321

322

## Research and Benchmarking

323

324

### Systematic Evaluation

325

326

```python

327

from imblearn.datasets import fetch_datasets

328

from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE

329

from imblearn.under_sampling import RandomUnderSampler, EditedNearestNeighbours

330

from imblearn.combine import SMOTEENN, SMOTETomek

331

from sklearn.ensemble import RandomForestClassifier

332

from sklearn.model_selection import StratifiedKFold, cross_validate

333

import pandas as pd

334

335

def comprehensive_benchmark():

336

"""Systematic evaluation across datasets and methods."""

337

338

# Select representative datasets across different characteristics

339

dataset_selection = {

340

'small_mild': 'ecoli', # Small, mild imbalance

341

'medium_moderate': 'abalone', # Medium, moderate imbalance

342

'large_mild': 'pen_digits', # Large, mild imbalance

343

'small_severe': 'libras_move', # Small, severe imbalance

344

'medium_severe': 'car_eval_4', # Medium, severe imbalance

345

'large_extreme': 'mammography' # Large, extreme imbalance

346

}

347

348

# Define sampling methods

349

samplers = {

350

'baseline': None,

351

'smote': SMOTE(random_state=42),

352

'adasyn': ADASYN(random_state=42),

353

'borderline': BorderlineSMOTE(random_state=42),

354

'under_random': RandomUnderSampler(random_state=42),

355

'under_enn': EditedNearestNeighbours(),

356

'smoteenn': SMOTEENN(random_state=42),

357

'smotetomek': SMOTETomek(random_state=42)

358

}

359

360

# Fetch datasets

361

datasets = fetch_datasets(filter_data=tuple(dataset_selection.values()))

362

363

results = []

364

365

for category, dataset_name in dataset_selection.items():

366

dataset = datasets[dataset_name]

367

X, y = dataset.data, dataset.target

368

369

print(f"Evaluating on {dataset_name} ({category})...")

370

371

for sampler_name, sampler in samplers.items():

372

if sampler is None:

373

# Baseline without sampling

374

pipeline = RandomForestClassifier(random_state=42)

375

else:

376

# Pipeline with sampling

377

from imblearn.pipeline import Pipeline

378

pipeline = Pipeline([

379

('sampling', sampler),

380

('classifier', RandomForestClassifier(random_state=42))

381

])

382

383

# Cross-validation

384

cv_results = cross_validate(

385

pipeline, X, y,

386

cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),

387

scoring=['accuracy', 'f1_macro', 'precision_macro', 'recall_macro'],

388

return_train_score=False

389

)

390

391

# Store results

392

results.append({

393

'dataset': dataset_name,

394

'category': category,

395

'sampler': sampler_name,

396

'accuracy': cv_results['test_accuracy'].mean(),

397

'f1_macro': cv_results['test_f1_macro'].mean(),

398

'precision_macro': cv_results['test_precision_macro'].mean(),

399

'recall_macro': cv_results['test_recall_macro'].mean(),

400

'accuracy_std': cv_results['test_accuracy'].std(),

401

'f1_std': cv_results['test_f1_macro'].std()

402

})

403

404

# Convert to DataFrame for analysis

405

results_df = pd.DataFrame(results)

406

return results_df

407

408

# Run benchmark

409

benchmark_results = comprehensive_benchmark()

410

411

# Analyze results

412

print("\nBest F1-macro scores by dataset:")

413

best_by_dataset = benchmark_results.loc[benchmark_results.groupby('dataset')['f1_macro'].idxmax()]

414

print(best_by_dataset[['dataset', 'sampler', 'f1_macro', 'f1_std']])

415

416

print("\nAverage performance by sampler:")

417

avg_by_sampler = benchmark_results.groupby('sampler')[['accuracy', 'f1_macro']].mean()

418

print(avg_by_sampler.round(3))

419

```

420

421

### Custom Dataset Creation for Research

422

423

```python

424

def create_research_dataset_suite():

425

"""Create controlled imbalanced datasets for research."""

426

from sklearn.datasets import make_classification

427

428

# Define dataset configurations

429

configs = {

430

'binary_mild': {

431

'n_classes': 2, 'weights': [0.7, 0.3], 'n_samples': 1000,

432

'n_features': 20, 'n_informative': 15, 'n_redundant': 2

433

},

434

'binary_severe': {

435

'n_classes': 2, 'weights': [0.9, 0.1], 'n_samples': 1000,

436

'n_features': 20, 'n_informative': 15, 'n_redundant': 2

437

},

438

'multiclass_progressive': {

439

'n_classes': 5, 'weights': [0.4, 0.25, 0.2, 0.1, 0.05], 'n_samples': 2000,

440

'n_features': 30, 'n_informative': 20, 'n_redundant': 5

441

},

442

'high_dimensional': {

443

'n_classes': 3, 'weights': [0.6, 0.3, 0.1], 'n_samples': 1500,

444

'n_features': 100, 'n_informative': 50, 'n_redundant': 20

445

}

446

}

447

448

research_datasets = {}

449

450

for name, config in configs.items():

451

# Generate base dataset

452

X, y = make_classification(random_state=42, **config)

453

454

# Further imbalance using make_imbalance if needed

455

if name == 'multiclass_progressive':

456

# Create even more extreme imbalance

457

imbalance_strategy = {0: 600, 1: 300, 2: 150, 3: 75, 4: 25}

458

X, y = make_imbalance(X, y, sampling_strategy=imbalance_strategy, random_state=42)

459

460

research_datasets[name] = {'data': X, 'target': y}

461

462

# Print dataset characteristics

463

counter = Counter(y)

464

ratio = max(counter.values()) / min(counter.values())

465

print(f"{name}:")

466

print(f" Shape: {X.shape}")

467

print(f" Classes: {dict(counter)}")

468

print(f" Imbalance ratio: {ratio:.1f}:1")

469

print()

470

471

return research_datasets

472

473

# Create research datasets

474

research_data = create_research_dataset_suite()

475

```

476

477

## Best Practices

478

479

### Dataset Selection Guidelines

480

481

1. **Start with diverse datasets**: Use datasets with different sizes, feature counts, and imbalance ratios

482

2. **Consider domain relevance**: Choose datasets similar to your application domain

483

3. **Validate on multiple datasets**: Don't rely on results from a single dataset

484

4. **Report comprehensive metrics**: Use multiple evaluation metrics beyond accuracy

485

486

### Reproducible Research

487

488

```python

489

# Ensure reproducible results

490

def reproducible_evaluation(dataset_names, random_state=42):

491

"""Reproducible benchmark evaluation."""

492

493

# Set random state for dataset fetching

494

datasets = fetch_datasets(

495

filter_data=dataset_names,

496

shuffle=True,

497

random_state=random_state

498

)

499

500

# Use consistent random state across all components

501

for name, dataset in datasets.items():

502

print(f"Dataset: {name}")

503

print(f" Original shape: {dataset.data.shape}")

504

505

# Create reproducible imbalanced version

506

X_imb, y_imb = make_imbalance(

507

dataset.data, dataset.target,

508

sampling_strategy={0: 100, 1: 50}, # Example strategy

509

random_state=random_state,

510

verbose=True

511

)

512

513

print(f" Imbalanced shape: {X_imb.shape}")

514

print(f" Class distribution: {Counter(y_imb)}")

515

print()

516

517

# Run reproducible evaluation

518

reproducible_evaluation(['ecoli', 'abalone'], random_state=42)

519

```

520

521

The datasets module provides essential tools for both creating controlled imbalanced datasets and accessing real-world benchmark datasets, enabling comprehensive evaluation and research in imbalanced learning.