0
# Model Selection
1
2
Cross-validation and model selection tools adapted for imbalanced datasets, providing specialized splitting strategies that consider instance hardness and class distribution to ensure more reliable model evaluation.
3
4
## Overview
5
6
Imbalanced-learn extends scikit-learn's model selection capabilities with specialized cross-validation strategies that account for class imbalance. These tools help ensure fair evaluation of models on imbalanced datasets by considering instance difficulty and maintaining appropriate class distributions across folds.
7
8
### Key Features
9
- **Instance hardness awareness**: Cross-validation that considers sample difficulty
10
- **Balanced fold distribution**: Ensures minority class representation across all folds
11
- **Compatible with scikit-learn**: Seamless integration with existing model selection workflows
12
- **Binary classification focus**: Specialized for binary imbalanced problems
13
14
## Cross-Validation Strategies
15
16
### InstanceHardnessCV
17
18
#### InstanceHardnessCV
19
20
```python
21
{ .api }
22
class InstanceHardnessCV:
23
def __init__(
24
self,
25
estimator,
26
*,
27
n_splits=5,
28
pos_label=None
29
): ...
30
def split(self, X, y, groups=None): ...
31
def get_n_splits(self, X=None, y=None, groups=None): ...
32
```
33
34
Instance-hardness cross-validation splitter that distributes samples with large instance hardness equally over the folds.
35
36
**Parameters:**
37
- **estimator** (`estimator object`): Classifier to be used to estimate instance hardness of the samples. This classifier should implement `predict_proba`
38
- **n_splits** (`int`, default=`5`): Number of folds. Must be at least 2
39
- **pos_label** (`int`, `float`, `bool` or `str`, default=`None`): The class considered the positive class when selecting the probability representing the instance hardness. If None, the positive class is automatically inferred by the estimator as `estimator.classes_[1]`
40
41
**Methods:**
42
43
##### split
44
45
```python
46
def split(self, X, y, groups=None) -> Generator[tuple[ndarray, ndarray], None, None]
47
```
48
49
Generate indices to split data into training and test set.
50
51
**Parameters:**
52
- **X** (`array-like` of shape `(n_samples, n_features)`): Training data, where `n_samples` is the number of samples and `n_features` is the number of features
53
- **y** (`array-like` of shape `(n_samples,)`): The target variable for supervised learning problems
54
- **groups** (`object`): Always ignored, exists for compatibility
55
56
**Yields:**
57
- **train** (`ndarray`): The training set indices for that split
58
- **test** (`ndarray`): The testing set indices for that split
59
60
##### get_n_splits
61
62
```python
63
def get_n_splits(self, X=None, y=None, groups=None) -> int
64
```
65
66
Returns the number of splitting iterations in the cross-validator.
67
68
**Parameters:**
69
- **X** (`object`): Always ignored, exists for compatibility
70
- **y** (`object`): Always ignored, exists for compatibility
71
- **groups** (`object`): Always ignored, exists for compatibility
72
73
**Returns:**
74
- **n_splits** (`int`): Returns the number of splitting iterations in the cross-validator
75
76
**Instance Hardness Concept:**
77
The instance hardness is internally estimated using the provided `estimator` and stratified cross-validation. Samples with higher instance hardness (those that are harder to classify correctly) are distributed more evenly across folds to ensure each fold contains a representative mix of easy and difficult samples.
78
79
**Algorithm:**
80
1. Uses cross-validation to estimate instance hardness via `predict_proba`
81
2. Sorts samples first by class label, then by instance hardness
82
3. Distributes samples across folds to balance both class distribution and hardness levels
83
4. Ensures each fold has similar difficulty characteristics
84
85
**Example:**
86
```python
87
from imblearn.model_selection import InstanceHardnessCV
88
from sklearn.datasets import make_classification
89
from sklearn.model_selection import cross_validate
90
from sklearn.linear_model import LogisticRegression
91
92
# Create imbalanced dataset
93
X, y = make_classification(
94
weights=[0.9, 0.1],
95
class_sep=2,
96
n_informative=3,
97
n_redundant=1,
98
flip_y=0.05,
99
n_samples=1000,
100
random_state=10
101
)
102
103
# Create instance hardness CV
104
estimator = LogisticRegression()
105
ih_cv = InstanceHardnessCV(estimator)
106
107
# Use in cross-validation
108
cv_result = cross_validate(estimator, X, y, cv=ih_cv)
109
print(f"Standard deviation of test_scores: {cv_result['test_score'].std():.3f}")
110
111
# Manual splitting
112
for train_idx, test_idx in ih_cv.split(X, y):
113
X_train, X_test = X[train_idx], X[test_idx]
114
y_train, y_test = y[train_idx], y[test_idx]
115
# Train and evaluate model
116
```
117
118
## Integration with scikit-learn
119
120
### Compatible Workflows
121
122
**Cross-validation Functions:**
123
```python
124
from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV
125
from imblearn.model_selection import InstanceHardnessCV
126
127
# Use with cross_val_score
128
scores = cross_val_score(estimator, X, y, cv=InstanceHardnessCV(estimator))
129
130
# Use with cross_validate
131
cv_results = cross_validate(estimator, X, y, cv=InstanceHardnessCV(estimator))
132
133
# Use with GridSearchCV
134
grid_search = GridSearchCV(
135
estimator,
136
param_grid,
137
cv=InstanceHardnessCV(estimator)
138
)
139
```
140
141
**Pipeline Integration:**
142
```python
143
from imblearn.pipeline import Pipeline
144
from imblearn.over_sampling import SMOTE
145
from sklearn.ensemble import RandomForestClassifier
146
147
# Create pipeline with sampling
148
pipeline = Pipeline([
149
('sampling', SMOTE(random_state=42)),
150
('classifier', RandomForestClassifier(random_state=42))
151
])
152
153
# Use instance hardness CV for evaluation
154
ih_cv = InstanceHardnessCV(LogisticRegression())
155
scores = cross_val_score(pipeline, X, y, cv=ih_cv)
156
```
157
158
## Comparison with Standard CV
159
160
### Advantages over Standard Cross-Validation
161
162
**Standard StratifiedKFold:**
163
- Only considers class distribution
164
- May create folds with varying difficulty levels
165
- Can lead to optimistic or pessimistic performance estimates
166
167
**InstanceHardnessCV:**
168
- Considers both class distribution and sample difficulty
169
- Creates folds with balanced hardness levels
170
- Provides more reliable performance estimates on imbalanced data
171
172
**When to Use:**
173
- **Binary classification problems** with class imbalance
174
- When sample difficulty varies significantly within classes
175
- For more reliable model selection on imbalanced datasets
176
- When you need consistent cross-validation performance
177
178
**Limitations:**
179
- Currently supports only **binary classification**
180
- Requires additional computation for hardness estimation
181
- The base estimator must implement `predict_proba`
182
183
## Best Practices
184
185
1. **Choose appropriate base estimator**: Use a fast, reasonable classifier for hardness estimation
186
2. **Consider computational cost**: Instance hardness estimation adds overhead
187
3. **Validate assumptions**: Ensure your problem benefits from hardness-aware splitting
188
4. **Combine with sampling**: Use alongside imblearn sampling techniques for comprehensive approach
189
190
**Complete Example:**
191
```python
192
from imblearn.model_selection import InstanceHardnessCV
193
from imblearn.over_sampling import SMOTE
194
from imblearn.pipeline import Pipeline
195
from sklearn.ensemble import RandomForestClassifier
196
from sklearn.linear_model import LogisticRegression
197
from sklearn.model_selection import cross_validate
198
from sklearn.datasets import make_classification
199
200
# Create imbalanced dataset
201
X, y = make_classification(
202
n_classes=2,
203
weights=[0.8, 0.2],
204
n_samples=1000,
205
random_state=42
206
)
207
208
# Create pipeline
209
pipeline = Pipeline([
210
('sampling', SMOTE(random_state=42)),
211
('classifier', RandomForestClassifier(random_state=42))
212
])
213
214
# Use instance hardness CV
215
base_estimator = LogisticRegression()
216
ih_cv = InstanceHardnessCV(base_estimator, n_splits=5)
217
218
# Evaluate model
219
cv_results = cross_validate(
220
pipeline, X, y,
221
cv=ih_cv,
222
scoring=['accuracy', 'f1', 'roc_auc'],
223
return_train_score=True
224
)
225
226
print(f"Test scores: {cv_results['test_accuracy'].mean():.3f} ± {cv_results['test_accuracy'].std():.3f}")
227
```