or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

combination.mddatasets.mddeep-learning.mdensemble.mdindex.mdmetrics.mdmodel-selection.mdover-sampling.mdpipeline.mdunder-sampling.mdutilities.md

pipeline.mddocs/

0

# Pipeline Integration

1

2

Advanced pipeline functionality that extends scikit-learn's Pipeline class to seamlessly integrate sampling algorithms with machine learning workflows. Ensures proper handling of resampling operations during model training while maintaining compatibility with cross-validation and model selection procedures.

3

4

## Overview

5

6

The imbalanced-learn pipeline system addresses key challenges when combining sampling methods with machine learning pipelines:

7

8

- **Resampling Integration**: Native support for `fit_resample()` methods in pipeline steps

9

- **Cross-validation Safety**: Prevents data leakage by applying sampling only during training phases

10

- **Memory Management**: Optional caching of expensive transformations and sampling operations

11

- **Parameter Routing**: Advanced metadata routing for passing parameters to specific pipeline steps

12

- **Transform Input**: Ability to transform input parameters through pipeline stages

13

14

The pipeline components extend scikit-learn's pipeline functionality while maintaining full API compatibility.

15

16

## Pipeline Class

17

18

### Pipeline

19

20

Extended pipeline class that supports both transformers and samplers in a unified workflow.

21

22

```python { .api }

23

class Pipeline(pipeline.Pipeline):

24

def __init__(

25

self,

26

steps,

27

*,

28

transform_input=None,

29

memory=None,

30

verbose=False,

31

):

32

"""

33

Parameters

34

----------

35

steps : list of (str, transformer/sampler) tuples

36

List of (name, transform) tuples implementing fit/transform/fit_resample

37

that are chained in order, with the last object an estimator.

38

39

transform_input : list of str, default=None

40

Names of metadata parameters that should be transformed by the pipeline

41

before passing to the step consuming them. Enables transforming input

42

arguments to fit() other than X. Only available with metadata routing enabled.

43

44

memory : None, str or object with joblib.Memory interface, default=None

45

Used to cache fitted transformers of the pipeline. If string, path to

46

caching directory. Caching triggers cloning of transformers before fitting.

47

48

verbose : bool, default=False

49

If True, time elapsed while fitting each step will be printed.

50

"""

51

52

def fit(self, X, y=None, **params):

53

"""

54

Fit the model.

55

56

Fits all transforms/samplers sequentially and transform/sample the data,

57

then fits the final estimator on the transformed/sampled data.

58

59

Parameters

60

----------

61

X : iterable

62

Training data. Must fulfill input requirements of first pipeline step.

63

64

y : iterable, default=None

65

Training targets. Must fulfill label requirements for all pipeline steps.

66

67

**params : dict of str -> object

68

Parameters passed to fit method of each step. Parameter names prefixed

69

with step name and '__' separator (e.g., 'step__parameter').

70

With metadata routing, parameters are forwarded based on step requests.

71

72

Returns

73

-------

74

self : Pipeline

75

Fitted pipeline instance.

76

"""

77

78

def fit_transform(self, X, y=None, **params):

79

"""

80

Fit the model and transform with the final estimator.

81

82

Fits all transformers/samplers sequentially, then uses fit_transform

83

on transformed data with the final estimator.

84

85

Parameters

86

----------

87

X : iterable

88

Training data. Must fulfill input requirements of first pipeline step.

89

90

y : iterable, default=None

91

Training targets. Must fulfill label requirements for all pipeline steps.

92

93

**params : dict of str -> object

94

Parameters for fit method of each step using 'step__parameter' format.

95

96

Returns

97

-------

98

Xt : array-like of shape (n_samples, n_transformed_features)

99

Transformed samples from final estimator.

100

"""

101

102

def fit_resample(self, X, y=None, **params):

103

"""

104

Fit the model and resample with the final estimator.

105

106

Fits all transformers/samplers sequentially, then uses fit_resample

107

on transformed data with the final estimator.

108

109

Parameters

110

----------

111

X : iterable

112

Training data. Must fulfill input requirements of first pipeline step.

113

114

y : iterable, default=None

115

Training targets. Must fulfill label requirements for all pipeline steps.

116

117

**params : dict of str -> object

118

Parameters for fit method of each step using 'step__parameter' format.

119

120

Returns

121

-------

122

Xt : array-like of shape (n_samples_new, n_transformed_features)

123

Resampled and transformed samples.

124

125

yt : array-like of shape (n_samples_new,)

126

Resampled target labels.

127

"""

128

129

def predict(self, X, **params):

130

"""

131

Transform data and apply predict with final estimator.

132

133

Parameters

134

----------

135

X : iterable

136

Data to predict on. Must fulfill input requirements of first step.

137

138

**params : dict of str -> object

139

Parameters for predict method of final estimator.

140

141

Returns

142

-------

143

y_pred : ndarray

144

Predictions from final estimator.

145

"""

146

147

def predict_proba(self, X, **params):

148

"""

149

Transform data and apply predict_proba with final estimator.

150

151

Parameters

152

----------

153

X : iterable

154

Data to predict probabilities for.

155

156

**params : dict of str -> object

157

Parameters for predict_proba method of final estimator.

158

159

Returns

160

-------

161

y_proba : ndarray of shape (n_samples, n_classes)

162

Class probability predictions.

163

"""

164

165

def transform(self, X, **params):

166

"""

167

Transform data through all pipeline steps.

168

169

Parameters

170

----------

171

X : iterable

172

Data to transform through pipeline steps.

173

174

**params : dict of str -> object

175

Parameters for transform methods of pipeline steps.

176

177

Returns

178

-------

179

Xt : ndarray

180

Transformed data.

181

"""

182

183

def inverse_transform(self, Xt, **params):

184

"""

185

Apply inverse_transform for each step in reverse order.

186

187

Parameters

188

----------

189

Xt : array-like

190

Transformed data to inverse transform.

191

192

**params : dict of str -> object

193

Parameters for inverse_transform methods.

194

195

Returns

196

-------

197

X : ndarray

198

Data in original feature space.

199

"""

200

```

201

202

### Attributes

203

204

```python { .api }

205

# Pipeline attributes after fitting

206

pipeline.named_steps # Bunch object for accessing steps by name

207

pipeline.classes_ # Class labels from final estimator

208

pipeline.n_features_in_ # Number of input features

209

pipeline.feature_names_in_ # Input feature names (if available)

210

```

211

212

## Helper Functions

213

214

### make_pipeline

215

216

Construct a Pipeline from estimators without explicit naming.

217

218

```python { .api }

219

def make_pipeline(

220

*steps,

221

memory=None,

222

transform_input=None,

223

verbose=False,

224

):

225

"""

226

Construct Pipeline from given estimators.

227

228

Shorthand for Pipeline constructor that automatically names estimators

229

based on their class names in lowercase.

230

231

Parameters

232

----------

233

*steps : list of estimators

234

Sequence of estimators to chain in pipeline.

235

236

memory : None, str or object with joblib.Memory interface, default=None

237

Used to cache fitted transformers. If string, path to caching directory.

238

239

transform_input : list of str, default=None

240

Names of metadata parameters to transform through pipeline steps.

241

Only available with metadata routing enabled.

242

243

verbose : bool, default=False

244

If True, print time elapsed while fitting each step.

245

246

Returns

247

-------

248

p : Pipeline

249

Imbalanced-learn Pipeline instance that handles samplers.

250

"""

251

```

252

253

## Key Differences from sklearn.pipeline.Pipeline

254

255

The imbalanced-learn Pipeline class extends scikit-learn's Pipeline with several important enhancements:

256

257

### 1. Sampler Support

258

- **fit_resample() Integration**: Native support for samplers that implement `fit_resample()` method

259

- **Resampling During Fit**: Samplers are applied only during fit stages, not during transform/predict

260

- **Mixed Steps**: Can combine transformers (fit/transform) and samplers (fit_resample) in same pipeline

261

262

### 2. Enhanced Validation

263

- **Step Validation**: Ensures intermediate steps implement either transform or fit_resample, but not both

264

- **Pipeline Nesting**: Prevents nesting of Pipeline objects within steps to avoid complexity

265

- **Passthrough Support**: Supports 'passthrough' and None values for skipping steps

266

267

### 3. Fit/Transform Behavior Warning

268

The pipeline breaks scikit-learn's usual contract where `fit_transform(X, y)` equals `fit(X, y).transform(X)`:

269

- **fit_transform()**: Applies resampling during the process

270

- **fit().transform()**: No resampling applied during transform phase

271

- This ensures proper cross-validation behavior but can be surprising

272

273

## Usage Examples

274

275

### Basic Pipeline Creation

276

277

```python

278

from imblearn.pipeline import Pipeline

279

from imblearn.over_sampling import SMOTE

280

from imblearn.under_sampling import EditedNearestNeighbours

281

from sklearn.preprocessing import StandardScaler

282

from sklearn.decomposition import PCA

283

from sklearn.ensemble import RandomForestClassifier

284

285

# Create pipeline with preprocessing, sampling, and classification

286

pipeline = Pipeline([

287

('scaler', StandardScaler()),

288

('sampling', SMOTE(random_state=42)),

289

('pca', PCA(n_components=10)),

290

('classifier', RandomForestClassifier(random_state=42))

291

])

292

293

# Fit pipeline - resampling happens during fit

294

pipeline.fit(X_train, y_train)

295

296

# Make predictions - no resampling during prediction

297

y_pred = pipeline.predict(X_test)

298

```

299

300

### Pipeline with Multiple Sampling Steps

301

302

```python

303

from imblearn.pipeline import Pipeline

304

from imblearn.over_sampling import SMOTE

305

from imblearn.under_sampling import EditedNearestNeighbours

306

from sklearn.preprocessing import StandardScaler

307

from sklearn.svm import SVC

308

309

# Combine over-sampling and under-sampling

310

pipeline = Pipeline([

311

('scaler', StandardScaler()),

312

('over_sampling', SMOTE(random_state=42)),

313

('under_sampling', EditedNearestNeighbours()),

314

('classifier', SVC(probability=True))

315

])

316

317

pipeline.fit(X_train, y_train)

318

probabilities = pipeline.predict_proba(X_test)

319

```

320

321

### Using make_pipeline

322

323

```python

324

from imblearn.pipeline import make_pipeline

325

from imblearn.over_sampling import ADASYN

326

from sklearn.preprocessing import MinMaxScaler

327

from sklearn.linear_model import LogisticRegression

328

329

# Automatic step naming based on class names

330

pipeline = make_pipeline(

331

MinMaxScaler(),

332

ADASYN(random_state=42),

333

LogisticRegression(random_state=42),

334

verbose=True # Print timing information

335

)

336

337

pipeline.fit(X_train, y_train)

338

print(f"Pipeline steps: {list(pipeline.named_steps.keys())}")

339

# Output: ['minmaxscaler', 'adasyn', 'logisticregression']

340

```

341

342

### Cross-validation with Pipeline

343

344

```python

345

from sklearn.model_selection import cross_val_score

346

from imblearn.pipeline import Pipeline

347

from imblearn.over_sampling import SMOTE

348

from sklearn.ensemble import RandomForestClassifier

349

350

# Create pipeline for cross-validation

351

pipeline = Pipeline([

352

('sampling', SMOTE(random_state=42)),

353

('classifier', RandomForestClassifier(random_state=42))

354

])

355

356

# Cross-validation applies sampling within each fold

357

scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')

358

print(f"Cross-validation F1 scores: {scores}")

359

print(f"Mean F1 score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

360

```

361

362

### Memory Caching for Expensive Operations

363

364

```python

365

from sklearn.externals import joblib

366

from imblearn.pipeline import Pipeline

367

from imblearn.over_sampling import SMOTE

368

from sklearn.decomposition import PCA

369

from sklearn.ensemble import RandomForestClassifier

370

371

# Cache expensive transformations

372

cachedir = '/tmp/joblib_cache'

373

memory = joblib.Memory(cachedir, verbose=0)

374

375

pipeline = Pipeline([

376

('sampling', SMOTE(random_state=42)),

377

('pca', PCA(n_components=50)), # Expensive for large datasets

378

('classifier', RandomForestClassifier(random_state=42))

379

], memory=memory)

380

381

# First fit caches transformations

382

pipeline.fit(X_train, y_train)

383

384

# Subsequent fits with same parameters use cache

385

pipeline.set_params(classifier__n_estimators=200)

386

pipeline.fit(X_train, y_train) # Reuses cached SMOTE and PCA results

387

```

388

389

### Parameter Grid Search

390

391

```python

392

from sklearn.model_selection import GridSearchCV

393

from imblearn.pipeline import Pipeline

394

from imblearn.over_sampling import SMOTE

395

from sklearn.svm import SVC

396

397

pipeline = Pipeline([

398

('sampling', SMOTE()),

399

('classifier', SVC())

400

])

401

402

# Define parameter grid with step prefixes

403

param_grid = {

404

'sampling__k_neighbors': [3, 5, 7],

405

'sampling__random_state': [42],

406

'classifier__C': [0.1, 1, 10],

407

'classifier__kernel': ['rbf', 'linear']

408

}

409

410

# Grid search with cross-validation

411

grid_search = GridSearchCV(

412

pipeline,

413

param_grid,

414

cv=5,

415

scoring='f1',

416

n_jobs=-1

417

)

418

419

grid_search.fit(X_train, y_train)

420

print(f"Best parameters: {grid_search.best_params_}")

421

print(f"Best cross-validation score: {grid_search.best_score_:.3f}")

422

```

423

424

### Advanced: Transform Input Parameters

425

426

```python

427

from sklearn.set_config import set_config

428

from imblearn.pipeline import Pipeline

429

from imblearn.over_sampling import SMOTE

430

from sklearn.preprocessing import StandardScaler

431

from sklearn.ensemble import RandomForestClassifier

432

433

# Enable metadata routing (sklearn >= 1.4)

434

set_config(enable_metadata_routing=True)

435

436

# Pipeline that transforms validation set through preprocessing

437

pipeline = Pipeline([

438

('scaler', StandardScaler()),

439

('sampling', SMOTE(random_state=42)),

440

('classifier', RandomForestClassifier())

441

], transform_input=['X_val'])

442

443

# Fit with validation set that gets transformed

444

pipeline.fit(X_train, y_train, X_val=X_val, y_val=y_val)

445

```

446

447

### Custom Step Access and Inspection

448

449

```python

450

from imblearn.pipeline import Pipeline

451

from imblearn.over_sampling import SMOTE

452

from sklearn.preprocessing import StandardScaler

453

from sklearn.ensemble import RandomForestClassifier

454

455

pipeline = Pipeline([

456

('scaler', StandardScaler()),

457

('sampling', SMOTE(random_state=42)),

458

('classifier', RandomForestClassifier(random_state=42))

459

])

460

461

pipeline.fit(X_train, y_train)

462

463

# Access individual steps

464

scaler = pipeline.named_steps['scaler']

465

sampler = pipeline.named_steps['sampling']

466

classifier = pipeline.named_steps['classifier']

467

468

# Get feature importance from final estimator

469

feature_importance = pipeline.named_steps['classifier'].feature_importances_

470

471

# Get resampling information

472

print(f"Original samples: {len(y_train)}")

473

# Note: Cannot directly get resampled data as sampling only occurs during fit

474

475

# Access pipeline properties

476

print(f"Number of pipeline steps: {len(pipeline.steps)}")

477

print(f"Step names: {list(pipeline.named_steps.keys())}")

478

print(f"Classes: {pipeline.classes_}")

479

```

480

481

## Best Practices

482

483

### 1. Cross-validation Safety

484

Always use the pipeline for cross-validation to prevent data leakage:

485

```python

486

# Correct: Sampling happens within each CV fold

487

scores = cross_val_score(pipeline, X, y, cv=5)

488

489

# Incorrect: Sampling applied to entire dataset first

490

X_resampled, y_resampled = smote.fit_resample(X, y)

491

scores = cross_val_score(classifier, X_resampled, y_resampled, cv=5)

492

```

493

494

### 2. Parameter Naming

495

Use double underscore notation for step-specific parameters:

496

```python

497

# Correct parameter naming

498

pipeline.set_params(

499

sampling__k_neighbors=7,

500

classifier__n_estimators=100

501

)

502

503

# Access parameters

504

params = pipeline.get_params()

505

print(params['sampling__random_state'])

506

```

507

508

### 3. Memory Management

509

Use caching for expensive operations in iterative workflows:

510

```python

511

# Cache expensive transformations

512

pipeline = Pipeline([

513

('expensive_transform', ExpensiveTransformer()),

514

('sampling', SMOTE()),

515

('classifier', RandomForestClassifier())

516

], memory='/tmp/cache')

517

```

518

519

### 4. Debugging and Monitoring

520

Use verbose mode and step inspection for debugging:

521

```python

522

# Enable timing information

523

pipeline = Pipeline(steps, verbose=True)

524

525

# Inspect individual steps after fitting

526

for name, step in pipeline.named_steps.items():

527

print(f"Step {name}: {step}")

528

```