Tessl Tile for pypi/imbalanced-learn@0.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

combination.md datasets.md deep-learning.md ensemble.md index.md metrics.md model-selection.md over-sampling.md pipeline.md under-sampling.md utilities.md

datasets.mddocs/

0
# Datasets
1

2
Functions for creating imbalanced datasets and fetching benchmark datasets for testing and evaluation of imbalanced learning algorithms.
3

4
## Overview
5

6
Imbalanced-learn provides utilities for working with imbalanced datasets, including functions to create artificially imbalanced datasets from balanced ones and to fetch real-world benchmark datasets specifically curated for imbalanced learning research.
7

8
### Key Features
9
- **Dataset creation**: Transform balanced datasets into imbalanced ones with controlled class distributions
10
- **Benchmark datasets**: Access to 27 curated real-world imbalanced datasets
11
- **Flexible sampling strategies**: Support for various imbalance ratios and class targeting
12
- **Research reproducibility**: Consistent datasets for comparing imbalanced learning methods
13
- **Easy integration**: Compatible with scikit-learn data formats and workflows
14

15
## Dataset Creation
16

17
### make_imbalance
18

19
#### make_imbalance
20

21
```python
22
{ .api }
23
def make_imbalance(
24
    X,
25
    y,
26
    *,
27
    sampling_strategy=None,
28
    random_state=None,
29
    verbose=False,
30
    **kwargs
31
) -> tuple[ndarray, ndarray]
32
```
33

34
Turn a dataset into an imbalanced dataset with a specific sampling strategy.
35

36
**Parameters:**
37
- **X** (`{array-like, dataframe}` of shape `(n_samples, n_features)`): Matrix containing the data to be imbalanced
38
- **y** (`array-like` of shape `(n_samples,)`): Corresponding label for each sample in X
39
- **sampling_strategy** (`dict` or `callable`, default=`None`): Ratio to use for resampling the data set
40
  - When `dict`: The keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class
41
  - When `callable`: Function taking `y` and returns a `dict`. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class
42
- **random_state** (`int`, `RandomState` instance or `None`, default=`None`): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random
43
- **verbose** (`bool`, default=`False`): Show information regarding the sampling
44
- **kwargs** (`dict`): Dictionary of additional keyword arguments to pass to `sampling_strategy`
45

46
**Returns:**
47
- **X_resampled** (`{ndarray, dataframe}` of shape `(n_samples_new, n_features)`): The array containing the imbalanced data
48
- **y_resampled** (`ndarray` of shape `(n_samples_new,)`): The corresponding label of `X_resampled`
49

50
**Algorithm:**
51
The function uses `RandomUnderSampler` internally to reduce the number of samples in specified classes, creating imbalanced distributions from balanced datasets.
52

53
**Basic Usage:**
54
```python
55
from collections import Counter
56
from sklearn.datasets import load_iris
57
from imblearn.datasets import make_imbalance
58

59
# Load balanced dataset
60
data = load_iris()
61
X, y = data.data, data.target
62
print(f'Distribution before imbalancing: {Counter(y)}')
63
# Distribution before imbalancing: Counter({0: 50, 1: 50, 2: 50})
64

65
# Create imbalanced dataset
66
X_res, y_res = make_imbalance(
67
    X, y,
68
    sampling_strategy={0: 10, 1: 20, 2: 30},
69
    random_state=42
70
)
71
print(f'Distribution after imbalancing: {Counter(y_res)}')
72
# Distribution after imbalancing: Counter({2: 30, 1: 20, 0: 10})
73
```
74

75
**Using Callable Strategies:**
76
```python
77
def progressive_imbalance(y):
78
    """Create progressively more imbalanced classes."""
79
    from collections import Counter
80
    counter = Counter(y)
81
    classes = sorted(counter.keys())
82
    
83
    # Create exponentially decreasing class sizes
84
    target_sizes = {}
85
    base_size = 100
86
    for i, cls in enumerate(classes):
87
        target_sizes[cls] = base_size // (2 ** i)
88
    
89
    return target_sizes
90

91
# Apply progressive imbalance
92
X_prog, y_prog = make_imbalance(
93
    X, y, 
94
    sampling_strategy=progressive_imbalance,
95
    random_state=42,
96
    verbose=True
97
)
98
```
99

100
**Multi-class Imbalance Patterns:**
101
```python
102
from sklearn.datasets import make_classification
103

104
# Create multi-class dataset
105
X, y = make_classification(
106
    n_classes=5,
107
    n_samples=1000,
108
    n_features=10,
109
    n_informative=8,
110
    n_redundant=1,
111
    n_clusters_per_class=1,
112
    weights=[0.2, 0.2, 0.2, 0.2, 0.2],  # Initially balanced
113
    random_state=42
114
)
115

116
print(f"Original distribution: {Counter(y)}")
117

118
# Create different imbalance patterns
119
strategies = {
120
    'mild_imbalance': {0: 150, 1: 120, 2: 100, 3: 80, 4: 50},
121
    'severe_imbalance': {0: 200, 1: 50, 2: 25, 3: 15, 4: 10},
122
    'binary_like': {0: 250, 1: 250, 2: 10, 3: 10, 4: 10}
123
}
124

125
for name, strategy in strategies.items():
126
    X_imb, y_imb = make_imbalance(X, y, sampling_strategy=strategy, random_state=42)
127
    print(f"{name}: {Counter(y_imb)}")
128
```
129

130
## Benchmark Datasets
131

132
### fetch_datasets
133

134
#### fetch_datasets
135

136
```python
137
{ .api }
138
def fetch_datasets(
139
    *,
140
    data_home=None,
141
    filter_data=None,
142
    download_if_missing=True,
143
    random_state=None,
144
    shuffle=False,
145
    verbose=False
146
) -> OrderedDict
147
```
148

149
Load the benchmark datasets from Zenodo, downloading it if necessary.
150

151
**Parameters:**
152
- **data_home** (`str`, default=`None`): Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in '~/scikit_learn_data' subfolders
153
- **filter_data** (`tuple` of `str`/`int`, default=`None`): A tuple containing the ID or the name of the datasets to be returned. Refer to the dataset table to get the ID and name of the datasets
154
- **download_if_missing** (`bool`, default=`True`): If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site
155
- **random_state** (`int`, `RandomState` instance or `None`, default=`None`): Random state for shuffling the dataset. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`
156
- **shuffle** (`bool`, default=`False`): Whether to shuffle dataset
157
- **verbose** (`bool`, default=`False`): Show information regarding the fetching
158

159
**Returns:**
160
- **datasets** (`OrderedDict` of `Bunch` object): The ordered is defined by `filter_data`. Each Bunch object (referred as dataset) have the following attributes:
161
  - **dataset.data** (`ndarray` of shape `(n_samples, n_features)`): The input data
162
  - **dataset.target** (`ndarray` of shape `(n_samples,)`): The target values  
163
  - **dataset.DESCR** (`str`): Description of the dataset
164

165
## Available Benchmark Datasets
166

167
The collection contains 27 real-world imbalanced datasets from various domains:
168

169
| ID | Name | Repository & Target | Ratio | #S | #F |
170
|----|------|-------------------|-------|----|----|
171
| 1 | ecoli | UCI, target: imU | 8.6:1 | 336 | 7 |
172
| 2 | optical_digits | UCI, target: 8 | 9.1:1 | 5,620 | 64 |
173
| 3 | satimage | UCI, target: 4 | 9.3:1 | 6,435 | 36 |
174
| 4 | pen_digits | UCI, target: 5 | 9.4:1 | 10,992 | 16 |
175
| 5 | abalone | UCI, target: 7 | 9.7:1 | 4,177 | 10 |
176
| 6 | sick_euthyroid | UCI, target: sick euthyroid | 9.8:1 | 3,163 | 42 |
177
| 7 | spectrometer | UCI, target: >=44 | 11:1 | 531 | 93 |
178
| 8 | car_eval_34 | UCI, target: good, v good | 12:1 | 1,728 | 21 |
179
| 9 | isolet | UCI, target: A, B | 12:1 | 7,797 | 617 |
180
| 10 | us_crime | UCI, target: >0.65 | 12:1 | 1,994 | 100 |
181
| 11 | yeast_ml8 | LIBSVM, target: 8 | 13:1 | 2,417 | 103 |
182
| 12 | scene | LIBSVM, target: >one label | 13:1 | 2,407 | 294 |
183
| 13 | libras_move | UCI, target: 1 | 14:1 | 360 | 90 |
184
| 14 | thyroid_sick | UCI, target: sick | 15:1 | 3,772 | 52 |
185
| 15 | coil_2000 | KDD, CoIL, target: minority | 16:1 | 9,822 | 85 |
186
| 16 | arrhythmia | UCI, target: 06 | 17:1 | 452 | 278 |
187
| 17 | solar_flare_m0 | UCI, target: M->0 | 19:1 | 1,389 | 32 |
188
| 18 | oil | UCI, target: minority | 22:1 | 937 | 49 |
189
| 19 | car_eval_4 | UCI, target: vgood | 26:1 | 1,728 | 21 |
190
| 20 | wine_quality | UCI, wine, target: <=4 | 26:1 | 4,898 | 11 |
191
| 21 | letter_img | UCI, target: Z | 26:1 | 20,000 | 16 |
192
| 22 | yeast_me2 | UCI, target: ME2 | 28:1 | 1,484 | 8 |
193
| 23 | webpage | LIBSVM, w7a, target: minority | 33:1 | 34,780 | 300 |
194
| 24 | ozone_level | UCI, ozone, data | 34:1 | 2,536 | 72 |
195
| 25 | mammography | UCI, target: minority | 42:1 | 11,183 | 6 |
196
| 26 | protein_homo | KDD CUP 2004, minority | 111:1 | 145,751 | 74 |
197
| 27 | abalone_19 | UCI, target: 19 | 130:1 | 4,177 | 10 |
198
199
**Dataset Categories:**
200

201
### Small Datasets (< 1,000 samples)
202
Suitable for quick experimentation and algorithm development:
203
```python
204
# Fetch small datasets for rapid prototyping
205
small_datasets = fetch_datasets(filter_data=('ecoli', 'libras_move', 'arrhythmia'))
206

207
for name, dataset in small_datasets.items():
208
    print(f"{name}: {dataset.data.shape} samples, ratio ~{dataset.DESCR}")
209
```
210

211
### Medium Datasets (1,000 - 10,000 samples)  
212
Good balance of complexity and computational efficiency:
213
```python
214
# Medium-sized datasets for thorough evaluation
215
medium_datasets = fetch_datasets(
216
    filter_data=('satimage', 'abalone', 'sick_euthyroid', 'coil_2000')
217
)
218
```
219

220
### Large Datasets (> 10,000 samples)
221
For scalability testing and real-world performance evaluation:
222
```python
223
# Large datasets for scalability testing
224
large_datasets = fetch_datasets(
225
    filter_data=('pen_digits', 'isolet', 'letter_img', 'webpage', 'protein_homo')
226
)
227
```
228

229
**Usage Examples:**
230

231
##### Fetch All Datasets
232
```python
233
from imblearn.datasets import fetch_datasets
234
from collections import Counter
235

236
# Download all benchmark datasets
237
all_datasets = fetch_datasets(verbose=True)
238

239
# Analyze dataset characteristics
240
for name, dataset in all_datasets.items():
241
    counter = Counter(dataset.target)
242
    n_samples, n_features = dataset.data.shape
243
    ratio = max(counter.values()) / min(counter.values())
244
    
245
    print(f"{name}:")
246
    print(f"  Samples: {n_samples}, Features: {n_features}")
247
    print(f"  Classes: {len(counter)}, Ratio: {ratio:.1f}:1")
248
    print(f"  Distribution: {dict(counter)}")
249
    print()
250
```
251

252
##### Fetch Specific Datasets
253
```python
254
# Fetch datasets by name
255
datasets_by_name = fetch_datasets(
256
    filter_data=('ecoli', 'mammography', 'abalone_19'),
257
    shuffle=True,
258
    random_state=42
259
)
260

261
# Fetch datasets by ID
262
datasets_by_id = fetch_datasets(
263
    filter_data=(1, 25, 27),  # Same as above
264
    shuffle=True, 
265
    random_state=42
266
)
267

268
# Access individual datasets
269
ecoli = datasets_by_name['ecoli']
270
X, y = ecoli.data, ecoli.target
271
print(f"Ecoli dataset: {X.shape}, classes: {Counter(y)}")
272
```
273

274
##### Cross-Dataset Evaluation
275
```python
276
from sklearn.model_selection import cross_val_score
277
from imblearn.over_sampling import SMOTE
278
from imblearn.pipeline import Pipeline
279
from sklearn.ensemble import RandomForestClassifier
280

281
# Evaluate algorithm across multiple datasets
282
def evaluate_on_datasets(dataset_names, n_runs=5):
283
    """Evaluate sampling + classification across datasets."""
284
    datasets = fetch_datasets(filter_data=dataset_names)
285
    
286
    # Create pipeline
287
    pipeline = Pipeline([
288
        ('sampling', SMOTE(random_state=42)),
289
        ('classifier', RandomForestClassifier(random_state=42))
290
    ])
291
    
292
    results = {}
293
    for name, dataset in datasets.items():
294
        scores = cross_val_score(
295
            pipeline, dataset.data, dataset.target,
296
            cv=5, scoring='f1_macro'
297
        )
298
        results[name] = {
299
            'mean_score': scores.mean(),
300
            'std_score': scores.std(),
301
            'dataset_info': {
302
                'n_samples': dataset.data.shape[0],
303
                'n_features': dataset.data.shape[1],
304
                'n_classes': len(Counter(dataset.target))
305
            }
306
        }
307
    
308
    return results
309

310
# Run evaluation
311
results = evaluate_on_datasets([
312
    'ecoli', 'optical_digits', 'satimage', 'abalone', 'mammography'
313
])
314

315
for name, result in results.items():
316
    info = result['dataset_info']
317
    print(f"{name}:")
318
    print(f"  F1-macro: {result['mean_score']:.3f} ± {result['std_score']:.3f}")
319
    print(f"  Dataset: {info['n_samples']} samples, {info['n_features']} features, {info['n_classes']} classes")
320
```
321

322
## Research and Benchmarking
323

324
### Systematic Evaluation
325

326
```python
327
from imblearn.datasets import fetch_datasets
328
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
329
from imblearn.under_sampling import RandomUnderSampler, EditedNearestNeighbours
330
from imblearn.combine import SMOTEENN, SMOTETomek
331
from sklearn.ensemble import RandomForestClassifier
332
from sklearn.model_selection import StratifiedKFold, cross_validate
333
import pandas as pd
334

335
def comprehensive_benchmark():
336
    """Systematic evaluation across datasets and methods."""
337
    
338
    # Select representative datasets across different characteristics
339
    dataset_selection = {
340
        'small_mild': 'ecoli',           # Small, mild imbalance
341
        'medium_moderate': 'abalone',     # Medium, moderate imbalance  
342
        'large_mild': 'pen_digits',      # Large, mild imbalance
343
        'small_severe': 'libras_move',   # Small, severe imbalance
344
        'medium_severe': 'car_eval_4',   # Medium, severe imbalance
345
        'large_extreme': 'mammography'    # Large, extreme imbalance
346
    }
347
    
348
    # Define sampling methods
349
    samplers = {
350
        'baseline': None,
351
        'smote': SMOTE(random_state=42),
352
        'adasyn': ADASYN(random_state=42),
353
        'borderline': BorderlineSMOTE(random_state=42),
354
        'under_random': RandomUnderSampler(random_state=42),
355
        'under_enn': EditedNearestNeighbours(),
356
        'smoteenn': SMOTEENN(random_state=42),
357
        'smotetomek': SMOTETomek(random_state=42)
358
    }
359
    
360
    # Fetch datasets
361
    datasets = fetch_datasets(filter_data=tuple(dataset_selection.values()))
362
    
363
    results = []
364
    
365
    for category, dataset_name in dataset_selection.items():
366
        dataset = datasets[dataset_name]
367
        X, y = dataset.data, dataset.target
368
        
369
        print(f"Evaluating on {dataset_name} ({category})...")
370
        
371
        for sampler_name, sampler in samplers.items():
372
            if sampler is None:
373
                # Baseline without sampling
374
                pipeline = RandomForestClassifier(random_state=42)
375
            else:
376
                # Pipeline with sampling
377
                from imblearn.pipeline import Pipeline
378
                pipeline = Pipeline([
379
                    ('sampling', sampler),
380
                    ('classifier', RandomForestClassifier(random_state=42))
381
                ])
382
            
383
            # Cross-validation
384
            cv_results = cross_validate(
385
                pipeline, X, y,
386
                cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
387
                scoring=['accuracy', 'f1_macro', 'precision_macro', 'recall_macro'],
388
                return_train_score=False
389
            )
390
            
391
            # Store results
392
            results.append({
393
                'dataset': dataset_name,
394
                'category': category,
395
                'sampler': sampler_name,
396
                'accuracy': cv_results['test_accuracy'].mean(),
397
                'f1_macro': cv_results['test_f1_macro'].mean(),
398
                'precision_macro': cv_results['test_precision_macro'].mean(),
399
                'recall_macro': cv_results['test_recall_macro'].mean(),
400
                'accuracy_std': cv_results['test_accuracy'].std(),
401
                'f1_std': cv_results['test_f1_macro'].std()
402
            })
403
    
404
    # Convert to DataFrame for analysis
405
    results_df = pd.DataFrame(results)
406
    return results_df
407

408
# Run benchmark
409
benchmark_results = comprehensive_benchmark()
410

411
# Analyze results
412
print("\nBest F1-macro scores by dataset:")
413
best_by_dataset = benchmark_results.loc[benchmark_results.groupby('dataset')['f1_macro'].idxmax()]
414
print(best_by_dataset[['dataset', 'sampler', 'f1_macro', 'f1_std']])
415

416
print("\nAverage performance by sampler:")
417
avg_by_sampler = benchmark_results.groupby('sampler')[['accuracy', 'f1_macro']].mean()
418
print(avg_by_sampler.round(3))
419
```
420

421
### Custom Dataset Creation for Research
422

423
```python
424
def create_research_dataset_suite():
425
    """Create controlled imbalanced datasets for research."""
426
    from sklearn.datasets import make_classification
427
    
428
    # Define dataset configurations
429
    configs = {
430
        'binary_mild': {
431
            'n_classes': 2, 'weights': [0.7, 0.3], 'n_samples': 1000,
432
            'n_features': 20, 'n_informative': 15, 'n_redundant': 2
433
        },
434
        'binary_severe': {
435
            'n_classes': 2, 'weights': [0.9, 0.1], 'n_samples': 1000,
436
            'n_features': 20, 'n_informative': 15, 'n_redundant': 2
437
        },
438
        'multiclass_progressive': {
439
            'n_classes': 5, 'weights': [0.4, 0.25, 0.2, 0.1, 0.05], 'n_samples': 2000,
440
            'n_features': 30, 'n_informative': 20, 'n_redundant': 5
441
        },
442
        'high_dimensional': {
443
            'n_classes': 3, 'weights': [0.6, 0.3, 0.1], 'n_samples': 1500,
444
            'n_features': 100, 'n_informative': 50, 'n_redundant': 20
445
        }
446
    }
447
    
448
    research_datasets = {}
449
    
450
    for name, config in configs.items():
451
        # Generate base dataset
452
        X, y = make_classification(random_state=42, **config)
453
        
454
        # Further imbalance using make_imbalance if needed
455
        if name == 'multiclass_progressive':
456
            # Create even more extreme imbalance
457
            imbalance_strategy = {0: 600, 1: 300, 2: 150, 3: 75, 4: 25}
458
            X, y = make_imbalance(X, y, sampling_strategy=imbalance_strategy, random_state=42)
459
        
460
        research_datasets[name] = {'data': X, 'target': y}
461
        
462
        # Print dataset characteristics
463
        counter = Counter(y)
464
        ratio = max(counter.values()) / min(counter.values())
465
        print(f"{name}:")
466
        print(f"  Shape: {X.shape}")
467
        print(f"  Classes: {dict(counter)}")
468
        print(f"  Imbalance ratio: {ratio:.1f}:1")
469
        print()
470
    
471
    return research_datasets
472

473
# Create research datasets
474
research_data = create_research_dataset_suite()
475
```
476

477
## Best Practices
478

479
### Dataset Selection Guidelines
480

481
1. **Start with diverse datasets**: Use datasets with different sizes, feature counts, and imbalance ratios
482
2. **Consider domain relevance**: Choose datasets similar to your application domain
483
3. **Validate on multiple datasets**: Don't rely on results from a single dataset
484
4. **Report comprehensive metrics**: Use multiple evaluation metrics beyond accuracy
485

486
### Reproducible Research
487

488
```python
489
# Ensure reproducible results
490
def reproducible_evaluation(dataset_names, random_state=42):
491
    """Reproducible benchmark evaluation."""
492
    
493
    # Set random state for dataset fetching
494
    datasets = fetch_datasets(
495
        filter_data=dataset_names,
496
        shuffle=True,
497
        random_state=random_state
498
    )
499
    
500
    # Use consistent random state across all components
501
    for name, dataset in datasets.items():
502
        print(f"Dataset: {name}")
503
        print(f"  Original shape: {dataset.data.shape}")
504
        
505
        # Create reproducible imbalanced version
506
        X_imb, y_imb = make_imbalance(
507
            dataset.data, dataset.target,
508
            sampling_strategy={0: 100, 1: 50},  # Example strategy
509
            random_state=random_state,
510
            verbose=True
511
        )
512
        
513
        print(f"  Imbalanced shape: {X_imb.shape}")
514
        print(f"  Class distribution: {Counter(y_imb)}")
515
        print()
516

517
# Run reproducible evaluation
518
reproducible_evaluation(['ecoli', 'abalone'], random_state=42)
519
```
520

521
The datasets module provides essential tools for both creating controlled imbalanced datasets and accessing real-world benchmark datasets, enabling comprehensive evaluation and research in imbalanced learning.

Version

Tile

Files

datasets.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

datasets.mddocs/