Tessl Tile for pypi/imbalanced-learn@0.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

combination.md datasets.md deep-learning.md ensemble.md index.md metrics.md model-selection.md over-sampling.md pipeline.md under-sampling.md utilities.md

model-selection.mddocs/

0
# Model Selection
1

2
Cross-validation and model selection tools adapted for imbalanced datasets, providing specialized splitting strategies that consider instance hardness and class distribution to ensure more reliable model evaluation.
3

4
## Overview
5

6
Imbalanced-learn extends scikit-learn's model selection capabilities with specialized cross-validation strategies that account for class imbalance. These tools help ensure fair evaluation of models on imbalanced datasets by considering instance difficulty and maintaining appropriate class distributions across folds.
7

8
### Key Features
9
- **Instance hardness awareness**: Cross-validation that considers sample difficulty
10
- **Balanced fold distribution**: Ensures minority class representation across all folds
11
- **Compatible with scikit-learn**: Seamless integration with existing model selection workflows
12
- **Binary classification focus**: Specialized for binary imbalanced problems
13

14
## Cross-Validation Strategies
15

16
### InstanceHardnessCV
17

18
#### InstanceHardnessCV
19

20
```python
21
{ .api }
22
class InstanceHardnessCV:
23
    def __init__(
24
        self,
25
        estimator,
26
        *,
27
        n_splits=5,
28
        pos_label=None
29
    ): ...
30
    def split(self, X, y, groups=None): ...
31
    def get_n_splits(self, X=None, y=None, groups=None): ...
32
```
33

34
Instance-hardness cross-validation splitter that distributes samples with large instance hardness equally over the folds.
35

36
**Parameters:**
37
- **estimator** (`estimator object`): Classifier to be used to estimate instance hardness of the samples. This classifier should implement `predict_proba`
38
- **n_splits** (`int`, default=`5`): Number of folds. Must be at least 2
39
- **pos_label** (`int`, `float`, `bool` or `str`, default=`None`): The class considered the positive class when selecting the probability representing the instance hardness. If None, the positive class is automatically inferred by the estimator as `estimator.classes_[1]`
40

41
**Methods:**
42

43
##### split
44

45
```python
46
def split(self, X, y, groups=None) -> Generator[tuple[ndarray, ndarray], None, None]
47
```
48

49
Generate indices to split data into training and test set.
50

51
**Parameters:**
52
- **X** (`array-like` of shape `(n_samples, n_features)`): Training data, where `n_samples` is the number of samples and `n_features` is the number of features
53
- **y** (`array-like` of shape `(n_samples,)`): The target variable for supervised learning problems
54
- **groups** (`object`): Always ignored, exists for compatibility
55

56
**Yields:**
57
- **train** (`ndarray`): The training set indices for that split
58
- **test** (`ndarray`): The testing set indices for that split
59

60
##### get_n_splits
61

62
```python
63
def get_n_splits(self, X=None, y=None, groups=None) -> int
64
```
65

66
Returns the number of splitting iterations in the cross-validator.
67

68
**Parameters:**
69
- **X** (`object`): Always ignored, exists for compatibility
70
- **y** (`object`): Always ignored, exists for compatibility  
71
- **groups** (`object`): Always ignored, exists for compatibility
72

73
**Returns:**
74
- **n_splits** (`int`): Returns the number of splitting iterations in the cross-validator
75

76
**Instance Hardness Concept:**
77
The instance hardness is internally estimated using the provided `estimator` and stratified cross-validation. Samples with higher instance hardness (those that are harder to classify correctly) are distributed more evenly across folds to ensure each fold contains a representative mix of easy and difficult samples.
78

79
**Algorithm:**
80
1. Uses cross-validation to estimate instance hardness via `predict_proba`
81
2. Sorts samples first by class label, then by instance hardness
82
3. Distributes samples across folds to balance both class distribution and hardness levels
83
4. Ensures each fold has similar difficulty characteristics
84

85
**Example:**
86
```python
87
from imblearn.model_selection import InstanceHardnessCV
88
from sklearn.datasets import make_classification
89
from sklearn.model_selection import cross_validate
90
from sklearn.linear_model import LogisticRegression
91

92
# Create imbalanced dataset
93
X, y = make_classification(
94
    weights=[0.9, 0.1], 
95
    class_sep=2,
96
    n_informative=3, 
97
    n_redundant=1, 
98
    flip_y=0.05, 
99
    n_samples=1000, 
100
    random_state=10
101
)
102

103
# Create instance hardness CV
104
estimator = LogisticRegression()
105
ih_cv = InstanceHardnessCV(estimator)
106

107
# Use in cross-validation
108
cv_result = cross_validate(estimator, X, y, cv=ih_cv)
109
print(f"Standard deviation of test_scores: {cv_result['test_score'].std():.3f}")
110

111
# Manual splitting
112
for train_idx, test_idx in ih_cv.split(X, y):
113
    X_train, X_test = X[train_idx], X[test_idx]
114
    y_train, y_test = y[train_idx], y[test_idx]
115
    # Train and evaluate model
116
```
117

118
## Integration with scikit-learn
119

120
### Compatible Workflows
121

122
**Cross-validation Functions:**
123
```python
124
from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV
125
from imblearn.model_selection import InstanceHardnessCV
126

127
# Use with cross_val_score
128
scores = cross_val_score(estimator, X, y, cv=InstanceHardnessCV(estimator))
129

130
# Use with cross_validate  
131
cv_results = cross_validate(estimator, X, y, cv=InstanceHardnessCV(estimator))
132

133
# Use with GridSearchCV
134
grid_search = GridSearchCV(
135
    estimator, 
136
    param_grid, 
137
    cv=InstanceHardnessCV(estimator)
138
)
139
```
140

141
**Pipeline Integration:**
142
```python
143
from imblearn.pipeline import Pipeline
144
from imblearn.over_sampling import SMOTE
145
from sklearn.ensemble import RandomForestClassifier
146

147
# Create pipeline with sampling
148
pipeline = Pipeline([
149
    ('sampling', SMOTE(random_state=42)),
150
    ('classifier', RandomForestClassifier(random_state=42))
151
])
152

153
# Use instance hardness CV for evaluation
154
ih_cv = InstanceHardnessCV(LogisticRegression())
155
scores = cross_val_score(pipeline, X, y, cv=ih_cv)
156
```
157

158
## Comparison with Standard CV
159

160
### Advantages over Standard Cross-Validation
161

162
**Standard StratifiedKFold:**
163
- Only considers class distribution
164
- May create folds with varying difficulty levels
165
- Can lead to optimistic or pessimistic performance estimates
166

167
**InstanceHardnessCV:**
168
- Considers both class distribution and sample difficulty
169
- Creates folds with balanced hardness levels
170
- Provides more reliable performance estimates on imbalanced data
171

172
**When to Use:**
173
- **Binary classification problems** with class imbalance
174
- When sample difficulty varies significantly within classes
175
- For more reliable model selection on imbalanced datasets
176
- When you need consistent cross-validation performance
177

178
**Limitations:**
179
- Currently supports only **binary classification**
180
- Requires additional computation for hardness estimation
181
- The base estimator must implement `predict_proba`
182

183
## Best Practices
184

185
1. **Choose appropriate base estimator**: Use a fast, reasonable classifier for hardness estimation
186
2. **Consider computational cost**: Instance hardness estimation adds overhead
187
3. **Validate assumptions**: Ensure your problem benefits from hardness-aware splitting
188
4. **Combine with sampling**: Use alongside imblearn sampling techniques for comprehensive approach
189

190
**Complete Example:**
191
```python
192
from imblearn.model_selection import InstanceHardnessCV
193
from imblearn.over_sampling import SMOTE
194
from imblearn.pipeline import Pipeline
195
from sklearn.ensemble import RandomForestClassifier
196
from sklearn.linear_model import LogisticRegression
197
from sklearn.model_selection import cross_validate
198
from sklearn.datasets import make_classification
199

200
# Create imbalanced dataset
201
X, y = make_classification(
202
    n_classes=2, 
203
    weights=[0.8, 0.2], 
204
    n_samples=1000,
205
    random_state=42
206
)
207

208
# Create pipeline
209
pipeline = Pipeline([
210
    ('sampling', SMOTE(random_state=42)),
211
    ('classifier', RandomForestClassifier(random_state=42))
212
])
213

214
# Use instance hardness CV
215
base_estimator = LogisticRegression()
216
ih_cv = InstanceHardnessCV(base_estimator, n_splits=5)
217

218
# Evaluate model
219
cv_results = cross_validate(
220
    pipeline, X, y, 
221
    cv=ih_cv, 
222
    scoring=['accuracy', 'f1', 'roc_auc'],
223
    return_train_score=True
224
)
225

226
print(f"Test scores: {cv_results['test_accuracy'].mean():.3f} ± {cv_results['test_accuracy'].std():.3f}")
227
```

Version

Tile

Files

model-selection.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

model-selection.mddocs/