Tessl Tile for pypi/imbalanced-learn@0.14.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

combination.md datasets.md deep-learning.md ensemble.md index.md metrics.md model-selection.md over-sampling.md pipeline.md under-sampling.md utilities.md

pipeline.mddocs/

0
# Pipeline Integration
1

2
Advanced pipeline functionality that extends scikit-learn's Pipeline class to seamlessly integrate sampling algorithms with machine learning workflows. Ensures proper handling of resampling operations during model training while maintaining compatibility with cross-validation and model selection procedures.
3

4
## Overview
5

6
The imbalanced-learn pipeline system addresses key challenges when combining sampling methods with machine learning pipelines:
7

8
- **Resampling Integration**: Native support for `fit_resample()` methods in pipeline steps
9
- **Cross-validation Safety**: Prevents data leakage by applying sampling only during training phases  
10
- **Memory Management**: Optional caching of expensive transformations and sampling operations
11
- **Parameter Routing**: Advanced metadata routing for passing parameters to specific pipeline steps
12
- **Transform Input**: Ability to transform input parameters through pipeline stages
13

14
The pipeline components extend scikit-learn's pipeline functionality while maintaining full API compatibility.
15

16
## Pipeline Class
17

18
### Pipeline
19

20
Extended pipeline class that supports both transformers and samplers in a unified workflow.
21

22
```python { .api }
23
class Pipeline(pipeline.Pipeline):
24
    def __init__(
25
        self,
26
        steps,
27
        *,
28
        transform_input=None,
29
        memory=None,
30
        verbose=False,
31
    ):
32
        """
33
        Parameters
34
        ----------
35
        steps : list of (str, transformer/sampler) tuples
36
            List of (name, transform) tuples implementing fit/transform/fit_resample
37
            that are chained in order, with the last object an estimator.
38
            
39
        transform_input : list of str, default=None
40
            Names of metadata parameters that should be transformed by the pipeline
41
            before passing to the step consuming them. Enables transforming input
42
            arguments to fit() other than X. Only available with metadata routing enabled.
43
            
44
        memory : None, str or object with joblib.Memory interface, default=None
45
            Used to cache fitted transformers of the pipeline. If string, path to
46
            caching directory. Caching triggers cloning of transformers before fitting.
47
            
48
        verbose : bool, default=False
49
            If True, time elapsed while fitting each step will be printed.
50
        """
51

52
    def fit(self, X, y=None, **params):
53
        """
54
        Fit the model.
55
        
56
        Fits all transforms/samplers sequentially and transform/sample the data,
57
        then fits the final estimator on the transformed/sampled data.
58
        
59
        Parameters
60
        ----------
61
        X : iterable
62
            Training data. Must fulfill input requirements of first pipeline step.
63
            
64
        y : iterable, default=None
65
            Training targets. Must fulfill label requirements for all pipeline steps.
66
            
67
        **params : dict of str -> object
68
            Parameters passed to fit method of each step. Parameter names prefixed
69
            with step name and '__' separator (e.g., 'step__parameter').
70
            With metadata routing, parameters are forwarded based on step requests.
71
            
72
        Returns
73
        -------
74
        self : Pipeline
75
            Fitted pipeline instance.
76
        """
77

78
    def fit_transform(self, X, y=None, **params):
79
        """
80
        Fit the model and transform with the final estimator.
81
        
82
        Fits all transformers/samplers sequentially, then uses fit_transform
83
        on transformed data with the final estimator.
84
        
85
        Parameters
86
        ----------
87
        X : iterable
88
            Training data. Must fulfill input requirements of first pipeline step.
89
            
90
        y : iterable, default=None
91
            Training targets. Must fulfill label requirements for all pipeline steps.
92
            
93
        **params : dict of str -> object
94
            Parameters for fit method of each step using 'step__parameter' format.
95
            
96
        Returns
97
        -------
98
        Xt : array-like of shape (n_samples, n_transformed_features)
99
            Transformed samples from final estimator.
100
        """
101

102
    def fit_resample(self, X, y=None, **params):
103
        """
104
        Fit the model and resample with the final estimator.
105
        
106
        Fits all transformers/samplers sequentially, then uses fit_resample
107
        on transformed data with the final estimator.
108
        
109
        Parameters
110
        ----------
111
        X : iterable
112
            Training data. Must fulfill input requirements of first pipeline step.
113
            
114
        y : iterable, default=None
115
            Training targets. Must fulfill label requirements for all pipeline steps.
116
            
117
        **params : dict of str -> object
118
            Parameters for fit method of each step using 'step__parameter' format.
119
            
120
        Returns
121
        -------
122
        Xt : array-like of shape (n_samples_new, n_transformed_features)
123
            Resampled and transformed samples.
124
            
125
        yt : array-like of shape (n_samples_new,)
126
            Resampled target labels.
127
        """
128

129
    def predict(self, X, **params):
130
        """
131
        Transform data and apply predict with final estimator.
132
        
133
        Parameters
134
        ----------
135
        X : iterable
136
            Data to predict on. Must fulfill input requirements of first step.
137
            
138
        **params : dict of str -> object
139
            Parameters for predict method of final estimator.
140
            
141
        Returns
142
        -------
143
        y_pred : ndarray
144
            Predictions from final estimator.
145
        """
146

147
    def predict_proba(self, X, **params):
148
        """
149
        Transform data and apply predict_proba with final estimator.
150
        
151
        Parameters
152
        ----------
153
        X : iterable
154
            Data to predict probabilities for.
155
            
156
        **params : dict of str -> object
157
            Parameters for predict_proba method of final estimator.
158
            
159
        Returns
160
        -------
161
        y_proba : ndarray of shape (n_samples, n_classes)
162
            Class probability predictions.
163
        """
164

165
    def transform(self, X, **params):
166
        """
167
        Transform data through all pipeline steps.
168
        
169
        Parameters
170
        ----------
171
        X : iterable
172
            Data to transform through pipeline steps.
173
            
174
        **params : dict of str -> object
175
            Parameters for transform methods of pipeline steps.
176
            
177
        Returns
178
        -------
179
        Xt : ndarray
180
            Transformed data.
181
        """
182

183
    def inverse_transform(self, Xt, **params):
184
        """
185
        Apply inverse_transform for each step in reverse order.
186
        
187
        Parameters
188
        ----------
189
        Xt : array-like
190
            Transformed data to inverse transform.
191
            
192
        **params : dict of str -> object
193
            Parameters for inverse_transform methods.
194
            
195
        Returns
196
        -------
197
        X : ndarray
198
            Data in original feature space.
199
        """
200
```
201

202
### Attributes
203

204
```python { .api }
205
# Pipeline attributes after fitting
206
pipeline.named_steps  # Bunch object for accessing steps by name
207
pipeline.classes_     # Class labels from final estimator  
208
pipeline.n_features_in_     # Number of input features
209
pipeline.feature_names_in_  # Input feature names (if available)
210
```
211

212
## Helper Functions
213

214
### make_pipeline
215

216
Construct a Pipeline from estimators without explicit naming.
217

218
```python { .api }
219
def make_pipeline(
220
    *steps,
221
    memory=None,
222
    transform_input=None,
223
    verbose=False,
224
):
225
    """
226
    Construct Pipeline from given estimators.
227
    
228
    Shorthand for Pipeline constructor that automatically names estimators
229
    based on their class names in lowercase.
230
    
231
    Parameters
232
    ----------
233
    *steps : list of estimators
234
        Sequence of estimators to chain in pipeline.
235
        
236
    memory : None, str or object with joblib.Memory interface, default=None
237
        Used to cache fitted transformers. If string, path to caching directory.
238
        
239
    transform_input : list of str, default=None
240
        Names of metadata parameters to transform through pipeline steps.
241
        Only available with metadata routing enabled.
242
        
243
    verbose : bool, default=False
244
        If True, print time elapsed while fitting each step.
245
        
246
    Returns
247
    -------
248
    p : Pipeline
249
        Imbalanced-learn Pipeline instance that handles samplers.
250
    """
251
```
252

253
## Key Differences from sklearn.pipeline.Pipeline
254

255
The imbalanced-learn Pipeline class extends scikit-learn's Pipeline with several important enhancements:
256

257
### 1. Sampler Support
258
- **fit_resample() Integration**: Native support for samplers that implement `fit_resample()` method
259
- **Resampling During Fit**: Samplers are applied only during fit stages, not during transform/predict
260
- **Mixed Steps**: Can combine transformers (fit/transform) and samplers (fit_resample) in same pipeline
261

262
### 2. Enhanced Validation  
263
- **Step Validation**: Ensures intermediate steps implement either transform or fit_resample, but not both
264
- **Pipeline Nesting**: Prevents nesting of Pipeline objects within steps to avoid complexity
265
- **Passthrough Support**: Supports 'passthrough' and None values for skipping steps
266

267
### 3. Fit/Transform Behavior Warning
268
The pipeline breaks scikit-learn's usual contract where `fit_transform(X, y)` equals `fit(X, y).transform(X)`:
269
- **fit_transform()**: Applies resampling during the process  
270
- **fit().transform()**: No resampling applied during transform phase
271
- This ensures proper cross-validation behavior but can be surprising
272

273
## Usage Examples
274

275
### Basic Pipeline Creation
276

277
```python
278
from imblearn.pipeline import Pipeline
279
from imblearn.over_sampling import SMOTE
280
from imblearn.under_sampling import EditedNearestNeighbours
281
from sklearn.preprocessing import StandardScaler
282
from sklearn.decomposition import PCA
283
from sklearn.ensemble import RandomForestClassifier
284

285
# Create pipeline with preprocessing, sampling, and classification
286
pipeline = Pipeline([
287
    ('scaler', StandardScaler()),
288
    ('sampling', SMOTE(random_state=42)),
289
    ('pca', PCA(n_components=10)),
290
    ('classifier', RandomForestClassifier(random_state=42))
291
])
292

293
# Fit pipeline - resampling happens during fit
294
pipeline.fit(X_train, y_train)
295

296
# Make predictions - no resampling during prediction
297
y_pred = pipeline.predict(X_test)
298
```
299

300
### Pipeline with Multiple Sampling Steps
301

302
```python
303
from imblearn.pipeline import Pipeline
304
from imblearn.over_sampling import SMOTE
305
from imblearn.under_sampling import EditedNearestNeighbours
306
from sklearn.preprocessing import StandardScaler
307
from sklearn.svm import SVC
308

309
# Combine over-sampling and under-sampling
310
pipeline = Pipeline([
311
    ('scaler', StandardScaler()),
312
    ('over_sampling', SMOTE(random_state=42)),
313
    ('under_sampling', EditedNearestNeighbours()),
314
    ('classifier', SVC(probability=True))
315
])
316

317
pipeline.fit(X_train, y_train)
318
probabilities = pipeline.predict_proba(X_test)
319
```
320

321
### Using make_pipeline
322

323
```python
324
from imblearn.pipeline import make_pipeline
325
from imblearn.over_sampling import ADASYN
326
from sklearn.preprocessing import MinMaxScaler
327
from sklearn.linear_model import LogisticRegression
328

329
# Automatic step naming based on class names
330
pipeline = make_pipeline(
331
    MinMaxScaler(),
332
    ADASYN(random_state=42),
333
    LogisticRegression(random_state=42),
334
    verbose=True  # Print timing information
335
)
336

337
pipeline.fit(X_train, y_train)
338
print(f"Pipeline steps: {list(pipeline.named_steps.keys())}")
339
# Output: ['minmaxscaler', 'adasyn', 'logisticregression']
340
```
341

342
### Cross-validation with Pipeline
343

344
```python
345
from sklearn.model_selection import cross_val_score
346
from imblearn.pipeline import Pipeline
347
from imblearn.over_sampling import SMOTE
348
from sklearn.ensemble import RandomForestClassifier
349

350
# Create pipeline for cross-validation
351
pipeline = Pipeline([
352
    ('sampling', SMOTE(random_state=42)),
353
    ('classifier', RandomForestClassifier(random_state=42))
354
])
355

356
# Cross-validation applies sampling within each fold
357
scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')
358
print(f"Cross-validation F1 scores: {scores}")
359
print(f"Mean F1 score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
360
```
361

362
### Memory Caching for Expensive Operations
363

364
```python
365
from sklearn.externals import joblib
366
from imblearn.pipeline import Pipeline
367
from imblearn.over_sampling import SMOTE
368
from sklearn.decomposition import PCA
369
from sklearn.ensemble import RandomForestClassifier
370

371
# Cache expensive transformations
372
cachedir = '/tmp/joblib_cache'
373
memory = joblib.Memory(cachedir, verbose=0)
374

375
pipeline = Pipeline([
376
    ('sampling', SMOTE(random_state=42)),
377
    ('pca', PCA(n_components=50)),  # Expensive for large datasets
378
    ('classifier', RandomForestClassifier(random_state=42))
379
], memory=memory)
380

381
# First fit caches transformations
382
pipeline.fit(X_train, y_train)
383

384
# Subsequent fits with same parameters use cache
385
pipeline.set_params(classifier__n_estimators=200)
386
pipeline.fit(X_train, y_train)  # Reuses cached SMOTE and PCA results
387
```
388

389
### Parameter Grid Search
390

391
```python
392
from sklearn.model_selection import GridSearchCV
393
from imblearn.pipeline import Pipeline
394
from imblearn.over_sampling import SMOTE
395
from sklearn.svm import SVC
396

397
pipeline = Pipeline([
398
    ('sampling', SMOTE()),
399
    ('classifier', SVC())
400
])
401

402
# Define parameter grid with step prefixes
403
param_grid = {
404
    'sampling__k_neighbors': [3, 5, 7],
405
    'sampling__random_state': [42],
406
    'classifier__C': [0.1, 1, 10],
407
    'classifier__kernel': ['rbf', 'linear']
408
}
409

410
# Grid search with cross-validation
411
grid_search = GridSearchCV(
412
    pipeline, 
413
    param_grid, 
414
    cv=5, 
415
    scoring='f1',
416
    n_jobs=-1
417
)
418

419
grid_search.fit(X_train, y_train)
420
print(f"Best parameters: {grid_search.best_params_}")
421
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
422
```
423

424
### Advanced: Transform Input Parameters
425

426
```python
427
from sklearn.set_config import set_config
428
from imblearn.pipeline import Pipeline
429
from imblearn.over_sampling import SMOTE
430
from sklearn.preprocessing import StandardScaler
431
from sklearn.ensemble import RandomForestClassifier
432

433
# Enable metadata routing (sklearn >= 1.4)
434
set_config(enable_metadata_routing=True)
435

436
# Pipeline that transforms validation set through preprocessing
437
pipeline = Pipeline([
438
    ('scaler', StandardScaler()),
439
    ('sampling', SMOTE(random_state=42)),
440
    ('classifier', RandomForestClassifier())
441
], transform_input=['X_val'])
442

443
# Fit with validation set that gets transformed
444
pipeline.fit(X_train, y_train, X_val=X_val, y_val=y_val)
445
```
446

447
### Custom Step Access and Inspection
448

449
```python
450
from imblearn.pipeline import Pipeline
451
from imblearn.over_sampling import SMOTE
452
from sklearn.preprocessing import StandardScaler
453
from sklearn.ensemble import RandomForestClassifier
454

455
pipeline = Pipeline([
456
    ('scaler', StandardScaler()),
457
    ('sampling', SMOTE(random_state=42)),
458
    ('classifier', RandomForestClassifier(random_state=42))
459
])
460

461
pipeline.fit(X_train, y_train)
462

463
# Access individual steps
464
scaler = pipeline.named_steps['scaler']
465
sampler = pipeline.named_steps['sampling']
466
classifier = pipeline.named_steps['classifier']
467

468
# Get feature importance from final estimator
469
feature_importance = pipeline.named_steps['classifier'].feature_importances_
470

471
# Get resampling information
472
print(f"Original samples: {len(y_train)}")
473
# Note: Cannot directly get resampled data as sampling only occurs during fit
474

475
# Access pipeline properties
476
print(f"Number of pipeline steps: {len(pipeline.steps)}")
477
print(f"Step names: {list(pipeline.named_steps.keys())}")
478
print(f"Classes: {pipeline.classes_}")
479
```
480

481
## Best Practices
482

483
### 1. Cross-validation Safety
484
Always use the pipeline for cross-validation to prevent data leakage:
485
```python
486
# Correct: Sampling happens within each CV fold
487
scores = cross_val_score(pipeline, X, y, cv=5)
488

489
# Incorrect: Sampling applied to entire dataset first
490
X_resampled, y_resampled = smote.fit_resample(X, y)
491
scores = cross_val_score(classifier, X_resampled, y_resampled, cv=5)
492
```
493

494
### 2. Parameter Naming
495
Use double underscore notation for step-specific parameters:
496
```python
497
# Correct parameter naming
498
pipeline.set_params(
499
    sampling__k_neighbors=7,
500
    classifier__n_estimators=100
501
)
502

503
# Access parameters
504
params = pipeline.get_params()
505
print(params['sampling__random_state'])
506
```
507

508
### 3. Memory Management
509
Use caching for expensive operations in iterative workflows:
510
```python
511
# Cache expensive transformations
512
pipeline = Pipeline([
513
    ('expensive_transform', ExpensiveTransformer()),
514
    ('sampling', SMOTE()),
515
    ('classifier', RandomForestClassifier())
516
], memory='/tmp/cache')
517
```
518

519
### 4. Debugging and Monitoring
520
Use verbose mode and step inspection for debugging:
521
```python
522
# Enable timing information
523
pipeline = Pipeline(steps, verbose=True)
524

525
# Inspect individual steps after fitting
526
for name, step in pipeline.named_steps.items():
527
    print(f"Step {name}: {step}")
528
```

Version

Tile

Files

pipeline.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

pipeline.mddocs/