Tessl Tile for pypi/feature-engine@1.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

creation.md datetime.md discretisation.md encoding.md imputation.md index.md outliers.md preprocessing.md selection.md transformation.md wrappers.md

selection.mddocs/

0
# Feature Selection
1

2
Transformers for removing or selecting features based on various criteria including variance, correlation, performance metrics, and statistical tests to improve model performance and reduce dimensionality.
3

4
## Capabilities
5

6
### Drop Features by Name
7

8
Drops a list of variables indicated by the user from the dataframe.
9

10
```python { .api }
11
class DropFeatures:
12
    def __init__(self, features_to_drop):
13
        """
14
        Initialize DropFeatures.
15
        
16
        Parameters:
17
        - features_to_drop (list): Variable names to be dropped from dataframe
18
        """
19
    
20
    def fit(self, X, y=None):
21
        """
22
        Validate that features exist in dataset (no parameters learned).
23
        
24
        Parameters:
25
        - X (pandas.DataFrame): Training dataset
26
        - y (pandas.Series, optional): Target variable (not used)
27
        
28
        Returns:
29
        - self
30
        """
31
    
32
    def transform(self, X):
33
        """
34
        Drop indicated features from dataset.
35
        
36
        Parameters:
37
        - X (pandas.DataFrame): Dataset to transform
38
        
39
        Returns:
40
        - pandas.DataFrame: Dataset with specified features removed
41
        """
42
    
43
    def fit_transform(self, X, y=None):
44
        """Fit to data, then transform it."""
45
```
46

47
**Usage Example**:
48
```python
49
from feature_engine.selection import DropFeatures
50
import pandas as pd
51

52
# Sample data
53
data = {'var1': [1, 2, 3], 'var2': [4, 5, 6], 'var3': [7, 8, 9]}
54
df = pd.DataFrame(data)
55

56
# Drop specific features
57
selector = DropFeatures(['var1', 'var3'])
58
df_reduced = selector.fit_transform(df)
59
# Result: only var2 remains
60

61
print(selector.features_to_drop_)  # Shows features that will be dropped
62
```
63

64
### Drop Constant Features
65

66
Removes constant and quasi-constant features that provide little information.
67

68
```python { .api }
69
class DropConstantFeatures:
70
    def __init__(self, variables=None, tol=1, missing_values='raise'):
71
        """
72
        Initialize DropConstantFeatures.
73
        
74
        Parameters:
75
        - variables (list): List of variables to evaluate. If None, evaluates all variables
76
        - tol (float): Threshold for quasi-constant detection (0-1). Variables with tol fraction of most frequent value are dropped
77
        - missing_values (str): How to handle missing values - 'raise' or 'ignore'
78
        """
79
    
80
    def fit(self, X, y=None):
81
        """
82
        Identify constant and quasi-constant features.
83
        
84
        Parameters:
85
        - X (pandas.DataFrame): Training dataset
86
        - y (pandas.Series, optional): Target variable (not used)
87
        
88
        Returns:
89
        - self
90
        """
91
    
92
    def transform(self, X):
93
        """
94
        Remove constant and quasi-constant features.
95
        
96
        Parameters:
97
        - X (pandas.DataFrame): Dataset to transform
98
        
99
        Returns:
100
        - pandas.DataFrame: Dataset with constant features removed
101
        """
102
    
103
    def fit_transform(self, X, y=None):
104
        """Fit to data, then transform it."""
105
```
106

107
**Usage Example**:
108
```python
109
from feature_engine.selection import DropConstantFeatures
110

111
# Drop truly constant features (default)
112
selector = DropConstantFeatures()
113
df_reduced = selector.fit_transform(df)
114

115
# Drop quasi-constant features (>95% same value)
116
selector = DropConstantFeatures(tol=0.95)
117
df_reduced = selector.fit_transform(df)
118

119
print(selector.features_to_drop_)  # Features identified as constant/quasi-constant
120
```
121

122
### Drop Duplicate Features
123

124
Removes duplicate features from dataframe based on identical values.
125

126
```python { .api }
127
class DropDuplicateFeatures:
128
    def __init__(self, variables=None, missing_values='raise'):
129
        """
130
        Initialize DropDuplicateFeatures.
131
        
132
        Parameters:
133
        - variables (list): List of variables to evaluate. If None, evaluates all variables
134
        - missing_values (str): How to handle missing values - 'raise' or 'ignore'
135
        """
136
    
137
    def fit(self, X, y=None):
138
        """
139
        Identify duplicate features.
140
        
141
        Parameters:
142
        - X (pandas.DataFrame): Training dataset
143
        - y (pandas.Series, optional): Target variable (not used)
144
        
145
        Returns:
146
        - self
147
        """
148
    
149
    def transform(self, X):
150
        """
151
        Remove duplicate features, keeping first occurrence.
152
        
153
        Parameters:
154
        - X (pandas.DataFrame): Dataset to transform
155
        
156
        Returns:
157
        - pandas.DataFrame: Dataset with duplicate features removed
158
        """
159
    
160
    def fit_transform(self, X, y=None):
161
        """Fit to data, then transform it."""
162
```
163

164
### Drop Correlated Features
165

166
Removes correlated features from dataframe to reduce multicollinearity.
167

168
```python { .api }
169
class DropCorrelatedFeatures:
170
    def __init__(self, variables=None, method='pearson', threshold=0.8, missing_values='raise'):
171
        """
172
        Initialize DropCorrelatedFeatures.
173
        
174
        Parameters:
175
        - variables (list): List of numerical variables to evaluate. If None, selects all numerical variables
176
        - method (str): Correlation method - 'pearson', 'spearman', or 'kendall'
177
        - threshold (float): Correlation threshold (0-1) above which features are considered correlated
178
        - missing_values (str): How to handle missing values - 'raise' or 'ignore'
179
        """
180
    
181
    def fit(self, X, y=None):
182
        """
183
        Identify correlated features to remove.
184
        
185
        Parameters:
186
        - X (pandas.DataFrame): Training dataset
187
        - y (pandas.Series, optional): Target variable (not used)
188
        
189
        Returns:
190
        - self
191
        """
192
    
193
    def transform(self, X):
194
        """
195
        Remove correlated features.
196
        
197
        Parameters:
198
        - X (pandas.DataFrame): Dataset to transform
199
        
200
        Returns:
201
        - pandas.DataFrame: Dataset with correlated features removed
202
        """
203
    
204
    def fit_transform(self, X, y=None):
205
        """Fit to data, then transform it."""
206
```
207

208
**Usage Example**:
209
```python
210
from feature_engine.selection import DropCorrelatedFeatures
211

212
# Drop features with Pearson correlation > 0.8
213
selector = DropCorrelatedFeatures(threshold=0.8, method='pearson')
214
df_reduced = selector.fit_transform(df)
215

216
# Use Spearman correlation
217
selector = DropCorrelatedFeatures(threshold=0.9, method='spearman')
218
df_reduced = selector.fit_transform(df)
219

220
print(selector.correlated_feature_sets_)  # Shows groups of correlated features
221
print(selector.features_to_drop_)  # Features selected for removal
222
```
223

224
### Smart Correlated Selection
225

226
Selects features from correlated groups based on performance with target variable.
227

228
```python { .api }
229
class SmartCorrelatedSelection:
230
    def __init__(self, variables=None, method='pearson', threshold=0.8, 
231
                 selection_method='variance', estimator=None, scoring='accuracy', cv=3):
232
        """
233
        Initialize SmartCorrelatedSelection.
234
        
235
        Parameters:
236
        - variables (list): List of numerical variables to evaluate. If None, selects all numerical variables
237
        - method (str): Correlation method - 'pearson', 'spearman', or 'kendall'
238
        - threshold (float): Correlation threshold (0-1) for grouping correlated features
239
        - selection_method (str): Method to select from correlated groups - 'variance' or 'model_performance'
240
        - estimator: Sklearn estimator for performance-based selection
241
        - scoring (str): Scoring metric for model performance evaluation
242
        - cv (int): Cross-validation folds
243
        """
244
    
245
    def fit(self, X, y=None):
246
        """
247
        Identify correlated groups and select best feature from each group.
248
        
249
        Parameters:
250
        - X (pandas.DataFrame): Training dataset
251
        - y (pandas.Series): Target variable (required for model_performance selection)
252
        
253
        Returns:
254
        - self
255
        """
256
    
257
    def transform(self, X):
258
        """
259
        Keep only selected features from correlated groups.
260
        
261
        Parameters:
262
        - X (pandas.DataFrame): Dataset to transform
263
        
264
        Returns:
265
        - pandas.DataFrame: Dataset with smart feature selection applied
266
        """
267
    
268
    def fit_transform(self, X, y=None):
269
        """Fit to data, then transform it."""
270
```
271

272
### Performance-Based Selection
273

274
Selects features based on individual performance metrics.
275

276
```python { .api }
277
class SelectBySingleFeaturePerformance:
278
    def __init__(self, estimator, scoring='accuracy', cv=3, threshold=0.5, variables=None):
279
        """
280
        Initialize SelectBySingleFeaturePerformance.
281
        
282
        Parameters:
283
        - estimator: Sklearn estimator to evaluate feature performance
284
        - scoring (str): Scoring metric for performance evaluation
285
        - cv (int): Cross-validation folds
286
        - threshold (float): Performance threshold for feature selection
287
        - variables (list): List of variables to evaluate. If None, evaluates all variables
288
        """
289
    
290
    def fit(self, X, y):
291
        """
292
        Evaluate individual performance of each feature.
293
        
294
        Parameters:
295
        - X (pandas.DataFrame): Training dataset
296
        - y (pandas.Series): Target variable (required)
297
        
298
        Returns:
299
        - self
300
        """
301
    
302
    def transform(self, X):
303
        """
304
        Select features that meet performance threshold.
305
        
306
        Parameters:
307
        - X (pandas.DataFrame): Dataset to transform
308
        
309
        Returns:
310
        - pandas.DataFrame: Dataset with only high-performing features
311
        """
312
    
313
    def fit_transform(self, X, y):
314
        """Fit to data, then transform it."""
315
```
316

317
**Usage Example**:
318
```python
319
from feature_engine.selection import SelectBySingleFeaturePerformance
320
from sklearn.ensemble import RandomForestClassifier
321

322
# Select features based on individual performance
323
selector = SelectBySingleFeaturePerformance(
324
    estimator=RandomForestClassifier(n_estimators=10),
325
    scoring='accuracy',
326
    cv=3,
327
    threshold=0.6
328
)
329
df_selected = selector.fit_transform(df, y)
330

331
print(selector.feature_performance_)  # Performance score per feature
332
print(selector.features_to_drop_)  # Features below threshold
333
```
334

335
### Recursive Feature Elimination
336

337
Selects features by recursively eliminating worst performing features.
338

339
```python { .api }
340
class RecursiveFeatureElimination:
341
    def __init__(self, estimator, scoring='accuracy', cv=3, threshold=0.01, variables=None):
342
        """
343
        Initialize RecursiveFeatureElimination.
344
        
345
        Parameters:
346
        - estimator: Sklearn estimator with feature_importances_ or coef_ attribute
347
        - scoring (str): Scoring metric for performance evaluation
348
        - cv (int): Cross-validation folds
349
        - threshold (float): Performance drop threshold for stopping elimination
350
        - variables (list): List of variables to evaluate. If None, evaluates all variables
351
        """
352
    
353
    def fit(self, X, y):
354
        """
355
        Perform recursive feature elimination.
356
        
357
        Parameters:
358
        - X (pandas.DataFrame): Training dataset
359
        - y (pandas.Series): Target variable (required)
360
        
361
        Returns:
362
        - self
363
        """
364
    
365
    def transform(self, X):
366
        """
367
        Select features identified by recursive elimination.
368
        
369
        Parameters:
370
        - X (pandas.DataFrame): Dataset to transform
371
        
372
        Returns:
373
        - pandas.DataFrame: Dataset with selected features only
374
        """
375
    
376
    def fit_transform(self, X, y):
377
        """Fit to data, then transform it."""
378
```
379

380
### Recursive Feature Addition
381

382
Selects features by recursively adding best performing features.
383

384
```python { .api }
385
class RecursiveFeatureAddition:
386
    def __init__(self, estimator, scoring='accuracy', cv=3, threshold=0.01, variables=None):
387
        """
388
        Initialize RecursiveFeatureAddition.
389
        
390
        Parameters:
391
        - estimator: Sklearn estimator for performance evaluation
392
        - scoring (str): Scoring metric for performance evaluation
393
        - cv (int): Cross-validation folds
394
        - threshold (float): Performance improvement threshold for stopping addition
395
        - variables (list): List of variables to evaluate. If None, evaluates all variables
396
        """
397
    
398
    def fit(self, X, y):
399
        """
400
        Perform recursive feature addition.
401
        
402
        Parameters:
403
        - X (pandas.DataFrame): Training dataset
404
        - y (pandas.Series): Target variable (required)
405
        
406
        Returns:
407
        - self
408
        """
409
    
410
    def transform(self, X):
411
        """
412
        Select features identified by recursive addition.
413
        
414
        Parameters:
415
        - X (pandas.DataFrame): Dataset to transform
416
        
417
        Returns:
418
        - pandas.DataFrame: Dataset with selected features only
419
        """
420
    
421
    def fit_transform(self, X, y):
422
        """Fit to data, then transform it."""
423
```
424

425
### Selection by Shuffling
426

427
Selects features by evaluating performance drop after shuffling feature values.
428

429
```python { .api }
430
class SelectByShuffling:
431
    def __init__(self, estimator, scoring='accuracy', cv=3, threshold=0.01, variables=None):
432
        """
433
        Initialize SelectByShuffling.
434
        
435
        Parameters:
436
        - estimator: Sklearn estimator for performance evaluation
437
        - scoring (str): Scoring metric for performance evaluation
438
        - cv (int): Cross-validation folds
439
        - threshold (float): Performance drop threshold for feature importance
440
        - variables (list): List of variables to evaluate. If None, evaluates all variables
441
        """
442
    
443
    def fit(self, X, y):
444
        """
445
        Evaluate feature importance by shuffling.
446
        
447
        Parameters:
448
        - X (pandas.DataFrame): Training dataset
449
        - y (pandas.Series): Target variable (required)
450
        
451
        Returns:
452
        - self
453
        """
454
    
455
    def transform(self, X):
456
        """
457
        Select features that show significant performance drop when shuffled.
458
        
459
        Parameters:
460
        - X (pandas.DataFrame): Dataset to transform
461
        
462
        Returns:
463
        - pandas.DataFrame: Dataset with important features only
464
        """
465
    
466
    def fit_transform(self, X, y):
467
        """Fit to data, then transform it."""
468
```
469

470
### Drop High PSI Features
471

472
Removes features with high Population Stability Index, indicating significant data drift.
473

474
```python { .api }
475
class DropHighPSIFeatures:
476
    def __init__(self, variables=None, split_frac=0.5, threshold=0.25, 
477
                 missing_values='raise', switch=False):
478
        """
479
        Initialize DropHighPSIFeatures.
480
        
481
        Parameters:
482
        - variables (list): List of variables to evaluate. If None, evaluates all variables
483
        - split_frac (float): Fraction of data to use for reference vs comparison
484
        - threshold (float): PSI threshold above which features are dropped
485
        - missing_values (str): How to handle missing values - 'raise' or 'ignore'
486
        - switch (bool): Whether to switch reference and comparison datasets
487
        """
488
    
489
    def fit(self, X, y=None):
490
        """
491
        Calculate PSI for each variable and identify features to drop.
492
        
493
        Parameters:
494
        - X (pandas.DataFrame): Training dataset
495
        - y (pandas.Series, optional): Target variable (not used)
496
        
497
        Returns:
498
        - self
499
        """
500
    
501
    def transform(self, X):
502
        """
503
        Remove features with high PSI.
504
        
505
        Parameters:
506
        - X (pandas.DataFrame): Dataset to transform
507
        
508
        Returns:
509
        - pandas.DataFrame: Dataset with high PSI features removed
510
        """
511
    
512
    def fit_transform(self, X, y=None):
513
        """Fit to data, then transform it."""
514
```
515

516
**Usage Example**:
517
```python
518
from feature_engine.selection import DropHighPSIFeatures
519

520
# Drop features with PSI > 0.25 indicating significant data drift
521
selector = DropHighPSIFeatures(threshold=0.25, split_frac=0.6)
522
df_stable = selector.fit_transform(df)
523

524
print(selector.features_to_drop_)  # Features with high PSI
525
print(selector.psi_values_)  # PSI values per feature
526
```
527

528
### Select by Target Mean Performance
529

530
Selects features based on target mean performance for univariate analysis.
531

532
```python { .api }
533
class SelectByTargetMeanPerformance:
534
    def __init__(self, variables=None, scoring='roc_auc', threshold=0.5, bins=5):
535
        """
536
        Initialize SelectByTargetMeanPerformance.
537
        
538
        Parameters:
539
        - variables (list): List of variables to evaluate. If None, evaluates all numerical variables
540
        - scoring (str): Performance metric to use for feature evaluation
541
        - threshold (float): Performance threshold for feature selection
542
        - bins (int): Number of bins for discretizing continuous variables
543
        """
544
    
545
    def fit(self, X, y):
546
        """
547
        Evaluate target mean performance for each variable.
548
        
549
        Parameters:
550
        - X (pandas.DataFrame): Training dataset
551
        - y (pandas.Series): Target variable (required)
552
        
553
        Returns:
554
        - self
555
        """
556
    
557
    def transform(self, X):
558
        """
559
        Select features that meet target mean performance threshold.
560
        
561
        Parameters:
562
        - X (pandas.DataFrame): Dataset to transform
563
        
564
        Returns:
565
        - pandas.DataFrame: Dataset with selected features only
566
        """
567
    
568
    def fit_transform(self, X, y):
569
        """Fit to data, then transform it."""
570
```
571

572
**Usage Example**:
573
```python
574
from feature_engine.selection import SelectByTargetMeanPerformance
575

576
# Select features based on target mean performance
577
selector = SelectByTargetMeanPerformance(
578
    scoring='roc_auc',
579
    threshold=0.6,
580
    bins=5
581
)
582
df_selected = selector.fit_transform(df, y)
583

584
print(selector.feature_performance_)  # Performance scores per feature
585
print(selector.features_to_drop_)  # Features below threshold
586
```
587

588
## Usage Patterns
589

590
### Sequential Feature Selection Pipeline
591

592
```python
593
from sklearn.pipeline import Pipeline
594
from feature_engine.selection import (
595
    DropConstantFeatures, 
596
    DropCorrelatedFeatures,
597
    SelectBySingleFeaturePerformance
598
)
599
from sklearn.ensemble import RandomForestClassifier
600

601
# Multi-step feature selection pipeline
602
selection_pipeline = Pipeline([
603
    ('drop_constant', DropConstantFeatures(tol=0.99)),
604
    ('drop_correlated', DropCorrelatedFeatures(threshold=0.95)),
605
    ('performance_selection', SelectBySingleFeaturePerformance(
606
        estimator=RandomForestClassifier(n_estimators=10),
607
        threshold=0.6
608
    ))
609
])
610

611
df_selected = selection_pipeline.fit_transform(df, y)
612
```
613

614
### Feature Selection with Cross-Validation
615

616
```python
617
from sklearn.model_selection import cross_val_score
618
from feature_engine.selection import RecursiveFeatureElimination
619

620
# Feature selection with proper evaluation
621
selector = RecursiveFeatureElimination(
622
    estimator=RandomForestClassifier(),
623
    cv=5,
624
    threshold=0.01
625
)
626

627
# Fit selector
628
selector.fit(X_train, y_train)
629

630
# Transform datasets
631
X_train_selected = selector.transform(X_train)
632
X_test_selected = selector.transform(X_test)
633

634
# Evaluate selected features
635
scores = cross_val_score(
636
    RandomForestClassifier(), 
637
    X_train_selected, 
638
    y_train, 
639
    cv=5
640
)
641
print(f"CV Score with selected features: {scores.mean():.3f}")
642
```
643

644
## Common Attributes
645

646
All selection transformers share these fitted attributes:
647

648
- `features_to_drop_` (list): Features identified for removal
649
- `n_features_in_` (int): Number of features in training set
650

651
Selector-specific attributes:
652
- `correlated_feature_sets_` (list): Groups of correlated features (correlation-based selectors)
653
- `feature_performance_` (dict): Performance scores per feature (performance-based selectors)
654
- `performance_drifts_` (dict): Performance changes during selection process (recursive selectors)

Version

Tile

Files

selection.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

selection.mddocs/