or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

combination.mddatasets.mddeep-learning.mdensemble.mdindex.mdmetrics.mdmodel-selection.mdover-sampling.mdpipeline.mdunder-sampling.mdutilities.md

model-selection.mddocs/

0

# Model Selection

1

2

Cross-validation and model selection tools adapted for imbalanced datasets, providing specialized splitting strategies that consider instance hardness and class distribution to ensure more reliable model evaluation.

3

4

## Overview

5

6

Imbalanced-learn extends scikit-learn's model selection capabilities with specialized cross-validation strategies that account for class imbalance. These tools help ensure fair evaluation of models on imbalanced datasets by considering instance difficulty and maintaining appropriate class distributions across folds.

7

8

### Key Features

9

- **Instance hardness awareness**: Cross-validation that considers sample difficulty

10

- **Balanced fold distribution**: Ensures minority class representation across all folds

11

- **Compatible with scikit-learn**: Seamless integration with existing model selection workflows

12

- **Binary classification focus**: Specialized for binary imbalanced problems

13

14

## Cross-Validation Strategies

15

16

### InstanceHardnessCV

17

18

#### InstanceHardnessCV

19

20

```python

21

{ .api }

22

class InstanceHardnessCV:

23

def __init__(

24

self,

25

estimator,

26

*,

27

n_splits=5,

28

pos_label=None

29

): ...

30

def split(self, X, y, groups=None): ...

31

def get_n_splits(self, X=None, y=None, groups=None): ...

32

```

33

34

Instance-hardness cross-validation splitter that distributes samples with large instance hardness equally over the folds.

35

36

**Parameters:**

37

- **estimator** (`estimator object`): Classifier to be used to estimate instance hardness of the samples. This classifier should implement `predict_proba`

38

- **n_splits** (`int`, default=`5`): Number of folds. Must be at least 2

39

- **pos_label** (`int`, `float`, `bool` or `str`, default=`None`): The class considered the positive class when selecting the probability representing the instance hardness. If None, the positive class is automatically inferred by the estimator as `estimator.classes_[1]`

40

41

**Methods:**

42

43

##### split

44

45

```python

46

def split(self, X, y, groups=None) -> Generator[tuple[ndarray, ndarray], None, None]

47

```

48

49

Generate indices to split data into training and test set.

50

51

**Parameters:**

52

- **X** (`array-like` of shape `(n_samples, n_features)`): Training data, where `n_samples` is the number of samples and `n_features` is the number of features

53

- **y** (`array-like` of shape `(n_samples,)`): The target variable for supervised learning problems

54

- **groups** (`object`): Always ignored, exists for compatibility

55

56

**Yields:**

57

- **train** (`ndarray`): The training set indices for that split

58

- **test** (`ndarray`): The testing set indices for that split

59

60

##### get_n_splits

61

62

```python

63

def get_n_splits(self, X=None, y=None, groups=None) -> int

64

```

65

66

Returns the number of splitting iterations in the cross-validator.

67

68

**Parameters:**

69

- **X** (`object`): Always ignored, exists for compatibility

70

- **y** (`object`): Always ignored, exists for compatibility

71

- **groups** (`object`): Always ignored, exists for compatibility

72

73

**Returns:**

74

- **n_splits** (`int`): Returns the number of splitting iterations in the cross-validator

75

76

**Instance Hardness Concept:**

77

The instance hardness is internally estimated using the provided `estimator` and stratified cross-validation. Samples with higher instance hardness (those that are harder to classify correctly) are distributed more evenly across folds to ensure each fold contains a representative mix of easy and difficult samples.

78

79

**Algorithm:**

80

1. Uses cross-validation to estimate instance hardness via `predict_proba`

81

2. Sorts samples first by class label, then by instance hardness

82

3. Distributes samples across folds to balance both class distribution and hardness levels

83

4. Ensures each fold has similar difficulty characteristics

84

85

**Example:**

86

```python

87

from imblearn.model_selection import InstanceHardnessCV

88

from sklearn.datasets import make_classification

89

from sklearn.model_selection import cross_validate

90

from sklearn.linear_model import LogisticRegression

91

92

# Create imbalanced dataset

93

X, y = make_classification(

94

weights=[0.9, 0.1],

95

class_sep=2,

96

n_informative=3,

97

n_redundant=1,

98

flip_y=0.05,

99

n_samples=1000,

100

random_state=10

101

)

102

103

# Create instance hardness CV

104

estimator = LogisticRegression()

105

ih_cv = InstanceHardnessCV(estimator)

106

107

# Use in cross-validation

108

cv_result = cross_validate(estimator, X, y, cv=ih_cv)

109

print(f"Standard deviation of test_scores: {cv_result['test_score'].std():.3f}")

110

111

# Manual splitting

112

for train_idx, test_idx in ih_cv.split(X, y):

113

X_train, X_test = X[train_idx], X[test_idx]

114

y_train, y_test = y[train_idx], y[test_idx]

115

# Train and evaluate model

116

```

117

118

## Integration with scikit-learn

119

120

### Compatible Workflows

121

122

**Cross-validation Functions:**

123

```python

124

from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV

125

from imblearn.model_selection import InstanceHardnessCV

126

127

# Use with cross_val_score

128

scores = cross_val_score(estimator, X, y, cv=InstanceHardnessCV(estimator))

129

130

# Use with cross_validate

131

cv_results = cross_validate(estimator, X, y, cv=InstanceHardnessCV(estimator))

132

133

# Use with GridSearchCV

134

grid_search = GridSearchCV(

135

estimator,

136

param_grid,

137

cv=InstanceHardnessCV(estimator)

138

)

139

```

140

141

**Pipeline Integration:**

142

```python

143

from imblearn.pipeline import Pipeline

144

from imblearn.over_sampling import SMOTE

145

from sklearn.ensemble import RandomForestClassifier

146

147

# Create pipeline with sampling

148

pipeline = Pipeline([

149

('sampling', SMOTE(random_state=42)),

150

('classifier', RandomForestClassifier(random_state=42))

151

])

152

153

# Use instance hardness CV for evaluation

154

ih_cv = InstanceHardnessCV(LogisticRegression())

155

scores = cross_val_score(pipeline, X, y, cv=ih_cv)

156

```

157

158

## Comparison with Standard CV

159

160

### Advantages over Standard Cross-Validation

161

162

**Standard StratifiedKFold:**

163

- Only considers class distribution

164

- May create folds with varying difficulty levels

165

- Can lead to optimistic or pessimistic performance estimates

166

167

**InstanceHardnessCV:**

168

- Considers both class distribution and sample difficulty

169

- Creates folds with balanced hardness levels

170

- Provides more reliable performance estimates on imbalanced data

171

172

**When to Use:**

173

- **Binary classification problems** with class imbalance

174

- When sample difficulty varies significantly within classes

175

- For more reliable model selection on imbalanced datasets

176

- When you need consistent cross-validation performance

177

178

**Limitations:**

179

- Currently supports only **binary classification**

180

- Requires additional computation for hardness estimation

181

- The base estimator must implement `predict_proba`

182

183

## Best Practices

184

185

1. **Choose appropriate base estimator**: Use a fast, reasonable classifier for hardness estimation

186

2. **Consider computational cost**: Instance hardness estimation adds overhead

187

3. **Validate assumptions**: Ensure your problem benefits from hardness-aware splitting

188

4. **Combine with sampling**: Use alongside imblearn sampling techniques for comprehensive approach

189

190

**Complete Example:**

191

```python

192

from imblearn.model_selection import InstanceHardnessCV

193

from imblearn.over_sampling import SMOTE

194

from imblearn.pipeline import Pipeline

195

from sklearn.ensemble import RandomForestClassifier

196

from sklearn.linear_model import LogisticRegression

197

from sklearn.model_selection import cross_validate

198

from sklearn.datasets import make_classification

199

200

# Create imbalanced dataset

201

X, y = make_classification(

202

n_classes=2,

203

weights=[0.8, 0.2],

204

n_samples=1000,

205

random_state=42

206

)

207

208

# Create pipeline

209

pipeline = Pipeline([

210

('sampling', SMOTE(random_state=42)),

211

('classifier', RandomForestClassifier(random_state=42))

212

])

213

214

# Use instance hardness CV

215

base_estimator = LogisticRegression()

216

ih_cv = InstanceHardnessCV(base_estimator, n_splits=5)

217

218

# Evaluate model

219

cv_results = cross_validate(

220

pipeline, X, y,

221

cv=ih_cv,

222

scoring=['accuracy', 'f1', 'roc_auc'],

223

return_train_score=True

224

)

225

226

print(f"Test scores: {cv_results['test_accuracy'].mean():.3f} ± {cv_results['test_accuracy'].std():.3f}")

227

```