0
# Pipeline Integration
1
2
Advanced pipeline functionality that extends scikit-learn's Pipeline class to seamlessly integrate sampling algorithms with machine learning workflows. Ensures proper handling of resampling operations during model training while maintaining compatibility with cross-validation and model selection procedures.
3
4
## Overview
5
6
The imbalanced-learn pipeline system addresses key challenges when combining sampling methods with machine learning pipelines:
7
8
- **Resampling Integration**: Native support for `fit_resample()` methods in pipeline steps
9
- **Cross-validation Safety**: Prevents data leakage by applying sampling only during training phases
10
- **Memory Management**: Optional caching of expensive transformations and sampling operations
11
- **Parameter Routing**: Advanced metadata routing for passing parameters to specific pipeline steps
12
- **Transform Input**: Ability to transform input parameters through pipeline stages
13
14
The pipeline components extend scikit-learn's pipeline functionality while maintaining full API compatibility.
15
16
## Pipeline Class
17
18
### Pipeline
19
20
Extended pipeline class that supports both transformers and samplers in a unified workflow.
21
22
```python { .api }
23
class Pipeline(pipeline.Pipeline):
24
def __init__(
25
self,
26
steps,
27
*,
28
transform_input=None,
29
memory=None,
30
verbose=False,
31
):
32
"""
33
Parameters
34
----------
35
steps : list of (str, transformer/sampler) tuples
36
List of (name, transform) tuples implementing fit/transform/fit_resample
37
that are chained in order, with the last object an estimator.
38
39
transform_input : list of str, default=None
40
Names of metadata parameters that should be transformed by the pipeline
41
before passing to the step consuming them. Enables transforming input
42
arguments to fit() other than X. Only available with metadata routing enabled.
43
44
memory : None, str or object with joblib.Memory interface, default=None
45
Used to cache fitted transformers of the pipeline. If string, path to
46
caching directory. Caching triggers cloning of transformers before fitting.
47
48
verbose : bool, default=False
49
If True, time elapsed while fitting each step will be printed.
50
"""
51
52
def fit(self, X, y=None, **params):
53
"""
54
Fit the model.
55
56
Fits all transforms/samplers sequentially and transform/sample the data,
57
then fits the final estimator on the transformed/sampled data.
58
59
Parameters
60
----------
61
X : iterable
62
Training data. Must fulfill input requirements of first pipeline step.
63
64
y : iterable, default=None
65
Training targets. Must fulfill label requirements for all pipeline steps.
66
67
**params : dict of str -> object
68
Parameters passed to fit method of each step. Parameter names prefixed
69
with step name and '__' separator (e.g., 'step__parameter').
70
With metadata routing, parameters are forwarded based on step requests.
71
72
Returns
73
-------
74
self : Pipeline
75
Fitted pipeline instance.
76
"""
77
78
def fit_transform(self, X, y=None, **params):
79
"""
80
Fit the model and transform with the final estimator.
81
82
Fits all transformers/samplers sequentially, then uses fit_transform
83
on transformed data with the final estimator.
84
85
Parameters
86
----------
87
X : iterable
88
Training data. Must fulfill input requirements of first pipeline step.
89
90
y : iterable, default=None
91
Training targets. Must fulfill label requirements for all pipeline steps.
92
93
**params : dict of str -> object
94
Parameters for fit method of each step using 'step__parameter' format.
95
96
Returns
97
-------
98
Xt : array-like of shape (n_samples, n_transformed_features)
99
Transformed samples from final estimator.
100
"""
101
102
def fit_resample(self, X, y=None, **params):
103
"""
104
Fit the model and resample with the final estimator.
105
106
Fits all transformers/samplers sequentially, then uses fit_resample
107
on transformed data with the final estimator.
108
109
Parameters
110
----------
111
X : iterable
112
Training data. Must fulfill input requirements of first pipeline step.
113
114
y : iterable, default=None
115
Training targets. Must fulfill label requirements for all pipeline steps.
116
117
**params : dict of str -> object
118
Parameters for fit method of each step using 'step__parameter' format.
119
120
Returns
121
-------
122
Xt : array-like of shape (n_samples_new, n_transformed_features)
123
Resampled and transformed samples.
124
125
yt : array-like of shape (n_samples_new,)
126
Resampled target labels.
127
"""
128
129
def predict(self, X, **params):
130
"""
131
Transform data and apply predict with final estimator.
132
133
Parameters
134
----------
135
X : iterable
136
Data to predict on. Must fulfill input requirements of first step.
137
138
**params : dict of str -> object
139
Parameters for predict method of final estimator.
140
141
Returns
142
-------
143
y_pred : ndarray
144
Predictions from final estimator.
145
"""
146
147
def predict_proba(self, X, **params):
148
"""
149
Transform data and apply predict_proba with final estimator.
150
151
Parameters
152
----------
153
X : iterable
154
Data to predict probabilities for.
155
156
**params : dict of str -> object
157
Parameters for predict_proba method of final estimator.
158
159
Returns
160
-------
161
y_proba : ndarray of shape (n_samples, n_classes)
162
Class probability predictions.
163
"""
164
165
def transform(self, X, **params):
166
"""
167
Transform data through all pipeline steps.
168
169
Parameters
170
----------
171
X : iterable
172
Data to transform through pipeline steps.
173
174
**params : dict of str -> object
175
Parameters for transform methods of pipeline steps.
176
177
Returns
178
-------
179
Xt : ndarray
180
Transformed data.
181
"""
182
183
def inverse_transform(self, Xt, **params):
184
"""
185
Apply inverse_transform for each step in reverse order.
186
187
Parameters
188
----------
189
Xt : array-like
190
Transformed data to inverse transform.
191
192
**params : dict of str -> object
193
Parameters for inverse_transform methods.
194
195
Returns
196
-------
197
X : ndarray
198
Data in original feature space.
199
"""
200
```
201
202
### Attributes
203
204
```python { .api }
205
# Pipeline attributes after fitting
206
pipeline.named_steps # Bunch object for accessing steps by name
207
pipeline.classes_ # Class labels from final estimator
208
pipeline.n_features_in_ # Number of input features
209
pipeline.feature_names_in_ # Input feature names (if available)
210
```
211
212
## Helper Functions
213
214
### make_pipeline
215
216
Construct a Pipeline from estimators without explicit naming.
217
218
```python { .api }
219
def make_pipeline(
220
*steps,
221
memory=None,
222
transform_input=None,
223
verbose=False,
224
):
225
"""
226
Construct Pipeline from given estimators.
227
228
Shorthand for Pipeline constructor that automatically names estimators
229
based on their class names in lowercase.
230
231
Parameters
232
----------
233
*steps : list of estimators
234
Sequence of estimators to chain in pipeline.
235
236
memory : None, str or object with joblib.Memory interface, default=None
237
Used to cache fitted transformers. If string, path to caching directory.
238
239
transform_input : list of str, default=None
240
Names of metadata parameters to transform through pipeline steps.
241
Only available with metadata routing enabled.
242
243
verbose : bool, default=False
244
If True, print time elapsed while fitting each step.
245
246
Returns
247
-------
248
p : Pipeline
249
Imbalanced-learn Pipeline instance that handles samplers.
250
"""
251
```
252
253
## Key Differences from sklearn.pipeline.Pipeline
254
255
The imbalanced-learn Pipeline class extends scikit-learn's Pipeline with several important enhancements:
256
257
### 1. Sampler Support
258
- **fit_resample() Integration**: Native support for samplers that implement `fit_resample()` method
259
- **Resampling During Fit**: Samplers are applied only during fit stages, not during transform/predict
260
- **Mixed Steps**: Can combine transformers (fit/transform) and samplers (fit_resample) in same pipeline
261
262
### 2. Enhanced Validation
263
- **Step Validation**: Ensures intermediate steps implement either transform or fit_resample, but not both
264
- **Pipeline Nesting**: Prevents nesting of Pipeline objects within steps to avoid complexity
265
- **Passthrough Support**: Supports 'passthrough' and None values for skipping steps
266
267
### 3. Fit/Transform Behavior Warning
268
The pipeline breaks scikit-learn's usual contract where `fit_transform(X, y)` equals `fit(X, y).transform(X)`:
269
- **fit_transform()**: Applies resampling during the process
270
- **fit().transform()**: No resampling applied during transform phase
271
- This ensures proper cross-validation behavior but can be surprising
272
273
## Usage Examples
274
275
### Basic Pipeline Creation
276
277
```python
278
from imblearn.pipeline import Pipeline
279
from imblearn.over_sampling import SMOTE
280
from imblearn.under_sampling import EditedNearestNeighbours
281
from sklearn.preprocessing import StandardScaler
282
from sklearn.decomposition import PCA
283
from sklearn.ensemble import RandomForestClassifier
284
285
# Create pipeline with preprocessing, sampling, and classification
286
pipeline = Pipeline([
287
('scaler', StandardScaler()),
288
('sampling', SMOTE(random_state=42)),
289
('pca', PCA(n_components=10)),
290
('classifier', RandomForestClassifier(random_state=42))
291
])
292
293
# Fit pipeline - resampling happens during fit
294
pipeline.fit(X_train, y_train)
295
296
# Make predictions - no resampling during prediction
297
y_pred = pipeline.predict(X_test)
298
```
299
300
### Pipeline with Multiple Sampling Steps
301
302
```python
303
from imblearn.pipeline import Pipeline
304
from imblearn.over_sampling import SMOTE
305
from imblearn.under_sampling import EditedNearestNeighbours
306
from sklearn.preprocessing import StandardScaler
307
from sklearn.svm import SVC
308
309
# Combine over-sampling and under-sampling
310
pipeline = Pipeline([
311
('scaler', StandardScaler()),
312
('over_sampling', SMOTE(random_state=42)),
313
('under_sampling', EditedNearestNeighbours()),
314
('classifier', SVC(probability=True))
315
])
316
317
pipeline.fit(X_train, y_train)
318
probabilities = pipeline.predict_proba(X_test)
319
```
320
321
### Using make_pipeline
322
323
```python
324
from imblearn.pipeline import make_pipeline
325
from imblearn.over_sampling import ADASYN
326
from sklearn.preprocessing import MinMaxScaler
327
from sklearn.linear_model import LogisticRegression
328
329
# Automatic step naming based on class names
330
pipeline = make_pipeline(
331
MinMaxScaler(),
332
ADASYN(random_state=42),
333
LogisticRegression(random_state=42),
334
verbose=True # Print timing information
335
)
336
337
pipeline.fit(X_train, y_train)
338
print(f"Pipeline steps: {list(pipeline.named_steps.keys())}")
339
# Output: ['minmaxscaler', 'adasyn', 'logisticregression']
340
```
341
342
### Cross-validation with Pipeline
343
344
```python
345
from sklearn.model_selection import cross_val_score
346
from imblearn.pipeline import Pipeline
347
from imblearn.over_sampling import SMOTE
348
from sklearn.ensemble import RandomForestClassifier
349
350
# Create pipeline for cross-validation
351
pipeline = Pipeline([
352
('sampling', SMOTE(random_state=42)),
353
('classifier', RandomForestClassifier(random_state=42))
354
])
355
356
# Cross-validation applies sampling within each fold
357
scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')
358
print(f"Cross-validation F1 scores: {scores}")
359
print(f"Mean F1 score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
360
```
361
362
### Memory Caching for Expensive Operations
363
364
```python
365
from sklearn.externals import joblib
366
from imblearn.pipeline import Pipeline
367
from imblearn.over_sampling import SMOTE
368
from sklearn.decomposition import PCA
369
from sklearn.ensemble import RandomForestClassifier
370
371
# Cache expensive transformations
372
cachedir = '/tmp/joblib_cache'
373
memory = joblib.Memory(cachedir, verbose=0)
374
375
pipeline = Pipeline([
376
('sampling', SMOTE(random_state=42)),
377
('pca', PCA(n_components=50)), # Expensive for large datasets
378
('classifier', RandomForestClassifier(random_state=42))
379
], memory=memory)
380
381
# First fit caches transformations
382
pipeline.fit(X_train, y_train)
383
384
# Subsequent fits with same parameters use cache
385
pipeline.set_params(classifier__n_estimators=200)
386
pipeline.fit(X_train, y_train) # Reuses cached SMOTE and PCA results
387
```
388
389
### Parameter Grid Search
390
391
```python
392
from sklearn.model_selection import GridSearchCV
393
from imblearn.pipeline import Pipeline
394
from imblearn.over_sampling import SMOTE
395
from sklearn.svm import SVC
396
397
pipeline = Pipeline([
398
('sampling', SMOTE()),
399
('classifier', SVC())
400
])
401
402
# Define parameter grid with step prefixes
403
param_grid = {
404
'sampling__k_neighbors': [3, 5, 7],
405
'sampling__random_state': [42],
406
'classifier__C': [0.1, 1, 10],
407
'classifier__kernel': ['rbf', 'linear']
408
}
409
410
# Grid search with cross-validation
411
grid_search = GridSearchCV(
412
pipeline,
413
param_grid,
414
cv=5,
415
scoring='f1',
416
n_jobs=-1
417
)
418
419
grid_search.fit(X_train, y_train)
420
print(f"Best parameters: {grid_search.best_params_}")
421
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
422
```
423
424
### Advanced: Transform Input Parameters
425
426
```python
427
from sklearn.set_config import set_config
428
from imblearn.pipeline import Pipeline
429
from imblearn.over_sampling import SMOTE
430
from sklearn.preprocessing import StandardScaler
431
from sklearn.ensemble import RandomForestClassifier
432
433
# Enable metadata routing (sklearn >= 1.4)
434
set_config(enable_metadata_routing=True)
435
436
# Pipeline that transforms validation set through preprocessing
437
pipeline = Pipeline([
438
('scaler', StandardScaler()),
439
('sampling', SMOTE(random_state=42)),
440
('classifier', RandomForestClassifier())
441
], transform_input=['X_val'])
442
443
# Fit with validation set that gets transformed
444
pipeline.fit(X_train, y_train, X_val=X_val, y_val=y_val)
445
```
446
447
### Custom Step Access and Inspection
448
449
```python
450
from imblearn.pipeline import Pipeline
451
from imblearn.over_sampling import SMOTE
452
from sklearn.preprocessing import StandardScaler
453
from sklearn.ensemble import RandomForestClassifier
454
455
pipeline = Pipeline([
456
('scaler', StandardScaler()),
457
('sampling', SMOTE(random_state=42)),
458
('classifier', RandomForestClassifier(random_state=42))
459
])
460
461
pipeline.fit(X_train, y_train)
462
463
# Access individual steps
464
scaler = pipeline.named_steps['scaler']
465
sampler = pipeline.named_steps['sampling']
466
classifier = pipeline.named_steps['classifier']
467
468
# Get feature importance from final estimator
469
feature_importance = pipeline.named_steps['classifier'].feature_importances_
470
471
# Get resampling information
472
print(f"Original samples: {len(y_train)}")
473
# Note: Cannot directly get resampled data as sampling only occurs during fit
474
475
# Access pipeline properties
476
print(f"Number of pipeline steps: {len(pipeline.steps)}")
477
print(f"Step names: {list(pipeline.named_steps.keys())}")
478
print(f"Classes: {pipeline.classes_}")
479
```
480
481
## Best Practices
482
483
### 1. Cross-validation Safety
484
Always use the pipeline for cross-validation to prevent data leakage:
485
```python
486
# Correct: Sampling happens within each CV fold
487
scores = cross_val_score(pipeline, X, y, cv=5)
488
489
# Incorrect: Sampling applied to entire dataset first
490
X_resampled, y_resampled = smote.fit_resample(X, y)
491
scores = cross_val_score(classifier, X_resampled, y_resampled, cv=5)
492
```
493
494
### 2. Parameter Naming
495
Use double underscore notation for step-specific parameters:
496
```python
497
# Correct parameter naming
498
pipeline.set_params(
499
sampling__k_neighbors=7,
500
classifier__n_estimators=100
501
)
502
503
# Access parameters
504
params = pipeline.get_params()
505
print(params['sampling__random_state'])
506
```
507
508
### 3. Memory Management
509
Use caching for expensive operations in iterative workflows:
510
```python
511
# Cache expensive transformations
512
pipeline = Pipeline([
513
('expensive_transform', ExpensiveTransformer()),
514
('sampling', SMOTE()),
515
('classifier', RandomForestClassifier())
516
], memory='/tmp/cache')
517
```
518
519
### 4. Debugging and Monitoring
520
Use verbose mode and step inspection for debugging:
521
```python
522
# Enable timing information
523
pipeline = Pipeline(steps, verbose=True)
524
525
# Inspect individual steps after fitting
526
for name, step in pipeline.named_steps.items():
527
print(f"Step {name}: {step}")
528
```