0
# Feature Selection
1
2
Transformers for removing or selecting features based on various criteria including variance, correlation, performance metrics, and statistical tests to improve model performance and reduce dimensionality.
3
4
## Capabilities
5
6
### Drop Features by Name
7
8
Drops a list of variables indicated by the user from the dataframe.
9
10
```python { .api }
11
class DropFeatures:
12
def __init__(self, features_to_drop):
13
"""
14
Initialize DropFeatures.
15
16
Parameters:
17
- features_to_drop (list): Variable names to be dropped from dataframe
18
"""
19
20
def fit(self, X, y=None):
21
"""
22
Validate that features exist in dataset (no parameters learned).
23
24
Parameters:
25
- X (pandas.DataFrame): Training dataset
26
- y (pandas.Series, optional): Target variable (not used)
27
28
Returns:
29
- self
30
"""
31
32
def transform(self, X):
33
"""
34
Drop indicated features from dataset.
35
36
Parameters:
37
- X (pandas.DataFrame): Dataset to transform
38
39
Returns:
40
- pandas.DataFrame: Dataset with specified features removed
41
"""
42
43
def fit_transform(self, X, y=None):
44
"""Fit to data, then transform it."""
45
```
46
47
**Usage Example**:
48
```python
49
from feature_engine.selection import DropFeatures
50
import pandas as pd
51
52
# Sample data
53
data = {'var1': [1, 2, 3], 'var2': [4, 5, 6], 'var3': [7, 8, 9]}
54
df = pd.DataFrame(data)
55
56
# Drop specific features
57
selector = DropFeatures(['var1', 'var3'])
58
df_reduced = selector.fit_transform(df)
59
# Result: only var2 remains
60
61
print(selector.features_to_drop_) # Shows features that will be dropped
62
```
63
64
### Drop Constant Features
65
66
Removes constant and quasi-constant features that provide little information.
67
68
```python { .api }
69
class DropConstantFeatures:
70
def __init__(self, variables=None, tol=1, missing_values='raise'):
71
"""
72
Initialize DropConstantFeatures.
73
74
Parameters:
75
- variables (list): List of variables to evaluate. If None, evaluates all variables
76
- tol (float): Threshold for quasi-constant detection (0-1). Variables with tol fraction of most frequent value are dropped
77
- missing_values (str): How to handle missing values - 'raise' or 'ignore'
78
"""
79
80
def fit(self, X, y=None):
81
"""
82
Identify constant and quasi-constant features.
83
84
Parameters:
85
- X (pandas.DataFrame): Training dataset
86
- y (pandas.Series, optional): Target variable (not used)
87
88
Returns:
89
- self
90
"""
91
92
def transform(self, X):
93
"""
94
Remove constant and quasi-constant features.
95
96
Parameters:
97
- X (pandas.DataFrame): Dataset to transform
98
99
Returns:
100
- pandas.DataFrame: Dataset with constant features removed
101
"""
102
103
def fit_transform(self, X, y=None):
104
"""Fit to data, then transform it."""
105
```
106
107
**Usage Example**:
108
```python
109
from feature_engine.selection import DropConstantFeatures
110
111
# Drop truly constant features (default)
112
selector = DropConstantFeatures()
113
df_reduced = selector.fit_transform(df)
114
115
# Drop quasi-constant features (>95% same value)
116
selector = DropConstantFeatures(tol=0.95)
117
df_reduced = selector.fit_transform(df)
118
119
print(selector.features_to_drop_) # Features identified as constant/quasi-constant
120
```
121
122
### Drop Duplicate Features
123
124
Removes duplicate features from dataframe based on identical values.
125
126
```python { .api }
127
class DropDuplicateFeatures:
128
def __init__(self, variables=None, missing_values='raise'):
129
"""
130
Initialize DropDuplicateFeatures.
131
132
Parameters:
133
- variables (list): List of variables to evaluate. If None, evaluates all variables
134
- missing_values (str): How to handle missing values - 'raise' or 'ignore'
135
"""
136
137
def fit(self, X, y=None):
138
"""
139
Identify duplicate features.
140
141
Parameters:
142
- X (pandas.DataFrame): Training dataset
143
- y (pandas.Series, optional): Target variable (not used)
144
145
Returns:
146
- self
147
"""
148
149
def transform(self, X):
150
"""
151
Remove duplicate features, keeping first occurrence.
152
153
Parameters:
154
- X (pandas.DataFrame): Dataset to transform
155
156
Returns:
157
- pandas.DataFrame: Dataset with duplicate features removed
158
"""
159
160
def fit_transform(self, X, y=None):
161
"""Fit to data, then transform it."""
162
```
163
164
### Drop Correlated Features
165
166
Removes correlated features from dataframe to reduce multicollinearity.
167
168
```python { .api }
169
class DropCorrelatedFeatures:
170
def __init__(self, variables=None, method='pearson', threshold=0.8, missing_values='raise'):
171
"""
172
Initialize DropCorrelatedFeatures.
173
174
Parameters:
175
- variables (list): List of numerical variables to evaluate. If None, selects all numerical variables
176
- method (str): Correlation method - 'pearson', 'spearman', or 'kendall'
177
- threshold (float): Correlation threshold (0-1) above which features are considered correlated
178
- missing_values (str): How to handle missing values - 'raise' or 'ignore'
179
"""
180
181
def fit(self, X, y=None):
182
"""
183
Identify correlated features to remove.
184
185
Parameters:
186
- X (pandas.DataFrame): Training dataset
187
- y (pandas.Series, optional): Target variable (not used)
188
189
Returns:
190
- self
191
"""
192
193
def transform(self, X):
194
"""
195
Remove correlated features.
196
197
Parameters:
198
- X (pandas.DataFrame): Dataset to transform
199
200
Returns:
201
- pandas.DataFrame: Dataset with correlated features removed
202
"""
203
204
def fit_transform(self, X, y=None):
205
"""Fit to data, then transform it."""
206
```
207
208
**Usage Example**:
209
```python
210
from feature_engine.selection import DropCorrelatedFeatures
211
212
# Drop features with Pearson correlation > 0.8
213
selector = DropCorrelatedFeatures(threshold=0.8, method='pearson')
214
df_reduced = selector.fit_transform(df)
215
216
# Use Spearman correlation
217
selector = DropCorrelatedFeatures(threshold=0.9, method='spearman')
218
df_reduced = selector.fit_transform(df)
219
220
print(selector.correlated_feature_sets_) # Shows groups of correlated features
221
print(selector.features_to_drop_) # Features selected for removal
222
```
223
224
### Smart Correlated Selection
225
226
Selects features from correlated groups based on performance with target variable.
227
228
```python { .api }
229
class SmartCorrelatedSelection:
230
def __init__(self, variables=None, method='pearson', threshold=0.8,
231
selection_method='variance', estimator=None, scoring='accuracy', cv=3):
232
"""
233
Initialize SmartCorrelatedSelection.
234
235
Parameters:
236
- variables (list): List of numerical variables to evaluate. If None, selects all numerical variables
237
- method (str): Correlation method - 'pearson', 'spearman', or 'kendall'
238
- threshold (float): Correlation threshold (0-1) for grouping correlated features
239
- selection_method (str): Method to select from correlated groups - 'variance' or 'model_performance'
240
- estimator: Sklearn estimator for performance-based selection
241
- scoring (str): Scoring metric for model performance evaluation
242
- cv (int): Cross-validation folds
243
"""
244
245
def fit(self, X, y=None):
246
"""
247
Identify correlated groups and select best feature from each group.
248
249
Parameters:
250
- X (pandas.DataFrame): Training dataset
251
- y (pandas.Series): Target variable (required for model_performance selection)
252
253
Returns:
254
- self
255
"""
256
257
def transform(self, X):
258
"""
259
Keep only selected features from correlated groups.
260
261
Parameters:
262
- X (pandas.DataFrame): Dataset to transform
263
264
Returns:
265
- pandas.DataFrame: Dataset with smart feature selection applied
266
"""
267
268
def fit_transform(self, X, y=None):
269
"""Fit to data, then transform it."""
270
```
271
272
### Performance-Based Selection
273
274
Selects features based on individual performance metrics.
275
276
```python { .api }
277
class SelectBySingleFeaturePerformance:
278
def __init__(self, estimator, scoring='accuracy', cv=3, threshold=0.5, variables=None):
279
"""
280
Initialize SelectBySingleFeaturePerformance.
281
282
Parameters:
283
- estimator: Sklearn estimator to evaluate feature performance
284
- scoring (str): Scoring metric for performance evaluation
285
- cv (int): Cross-validation folds
286
- threshold (float): Performance threshold for feature selection
287
- variables (list): List of variables to evaluate. If None, evaluates all variables
288
"""
289
290
def fit(self, X, y):
291
"""
292
Evaluate individual performance of each feature.
293
294
Parameters:
295
- X (pandas.DataFrame): Training dataset
296
- y (pandas.Series): Target variable (required)
297
298
Returns:
299
- self
300
"""
301
302
def transform(self, X):
303
"""
304
Select features that meet performance threshold.
305
306
Parameters:
307
- X (pandas.DataFrame): Dataset to transform
308
309
Returns:
310
- pandas.DataFrame: Dataset with only high-performing features
311
"""
312
313
def fit_transform(self, X, y):
314
"""Fit to data, then transform it."""
315
```
316
317
**Usage Example**:
318
```python
319
from feature_engine.selection import SelectBySingleFeaturePerformance
320
from sklearn.ensemble import RandomForestClassifier
321
322
# Select features based on individual performance
323
selector = SelectBySingleFeaturePerformance(
324
estimator=RandomForestClassifier(n_estimators=10),
325
scoring='accuracy',
326
cv=3,
327
threshold=0.6
328
)
329
df_selected = selector.fit_transform(df, y)
330
331
print(selector.feature_performance_) # Performance score per feature
332
print(selector.features_to_drop_) # Features below threshold
333
```
334
335
### Recursive Feature Elimination
336
337
Selects features by recursively eliminating worst performing features.
338
339
```python { .api }
340
class RecursiveFeatureElimination:
341
def __init__(self, estimator, scoring='accuracy', cv=3, threshold=0.01, variables=None):
342
"""
343
Initialize RecursiveFeatureElimination.
344
345
Parameters:
346
- estimator: Sklearn estimator with feature_importances_ or coef_ attribute
347
- scoring (str): Scoring metric for performance evaluation
348
- cv (int): Cross-validation folds
349
- threshold (float): Performance drop threshold for stopping elimination
350
- variables (list): List of variables to evaluate. If None, evaluates all variables
351
"""
352
353
def fit(self, X, y):
354
"""
355
Perform recursive feature elimination.
356
357
Parameters:
358
- X (pandas.DataFrame): Training dataset
359
- y (pandas.Series): Target variable (required)
360
361
Returns:
362
- self
363
"""
364
365
def transform(self, X):
366
"""
367
Select features identified by recursive elimination.
368
369
Parameters:
370
- X (pandas.DataFrame): Dataset to transform
371
372
Returns:
373
- pandas.DataFrame: Dataset with selected features only
374
"""
375
376
def fit_transform(self, X, y):
377
"""Fit to data, then transform it."""
378
```
379
380
### Recursive Feature Addition
381
382
Selects features by recursively adding best performing features.
383
384
```python { .api }
385
class RecursiveFeatureAddition:
386
def __init__(self, estimator, scoring='accuracy', cv=3, threshold=0.01, variables=None):
387
"""
388
Initialize RecursiveFeatureAddition.
389
390
Parameters:
391
- estimator: Sklearn estimator for performance evaluation
392
- scoring (str): Scoring metric for performance evaluation
393
- cv (int): Cross-validation folds
394
- threshold (float): Performance improvement threshold for stopping addition
395
- variables (list): List of variables to evaluate. If None, evaluates all variables
396
"""
397
398
def fit(self, X, y):
399
"""
400
Perform recursive feature addition.
401
402
Parameters:
403
- X (pandas.DataFrame): Training dataset
404
- y (pandas.Series): Target variable (required)
405
406
Returns:
407
- self
408
"""
409
410
def transform(self, X):
411
"""
412
Select features identified by recursive addition.
413
414
Parameters:
415
- X (pandas.DataFrame): Dataset to transform
416
417
Returns:
418
- pandas.DataFrame: Dataset with selected features only
419
"""
420
421
def fit_transform(self, X, y):
422
"""Fit to data, then transform it."""
423
```
424
425
### Selection by Shuffling
426
427
Selects features by evaluating performance drop after shuffling feature values.
428
429
```python { .api }
430
class SelectByShuffling:
431
def __init__(self, estimator, scoring='accuracy', cv=3, threshold=0.01, variables=None):
432
"""
433
Initialize SelectByShuffling.
434
435
Parameters:
436
- estimator: Sklearn estimator for performance evaluation
437
- scoring (str): Scoring metric for performance evaluation
438
- cv (int): Cross-validation folds
439
- threshold (float): Performance drop threshold for feature importance
440
- variables (list): List of variables to evaluate. If None, evaluates all variables
441
"""
442
443
def fit(self, X, y):
444
"""
445
Evaluate feature importance by shuffling.
446
447
Parameters:
448
- X (pandas.DataFrame): Training dataset
449
- y (pandas.Series): Target variable (required)
450
451
Returns:
452
- self
453
"""
454
455
def transform(self, X):
456
"""
457
Select features that show significant performance drop when shuffled.
458
459
Parameters:
460
- X (pandas.DataFrame): Dataset to transform
461
462
Returns:
463
- pandas.DataFrame: Dataset with important features only
464
"""
465
466
def fit_transform(self, X, y):
467
"""Fit to data, then transform it."""
468
```
469
470
### Drop High PSI Features
471
472
Removes features with high Population Stability Index, indicating significant data drift.
473
474
```python { .api }
475
class DropHighPSIFeatures:
476
def __init__(self, variables=None, split_frac=0.5, threshold=0.25,
477
missing_values='raise', switch=False):
478
"""
479
Initialize DropHighPSIFeatures.
480
481
Parameters:
482
- variables (list): List of variables to evaluate. If None, evaluates all variables
483
- split_frac (float): Fraction of data to use for reference vs comparison
484
- threshold (float): PSI threshold above which features are dropped
485
- missing_values (str): How to handle missing values - 'raise' or 'ignore'
486
- switch (bool): Whether to switch reference and comparison datasets
487
"""
488
489
def fit(self, X, y=None):
490
"""
491
Calculate PSI for each variable and identify features to drop.
492
493
Parameters:
494
- X (pandas.DataFrame): Training dataset
495
- y (pandas.Series, optional): Target variable (not used)
496
497
Returns:
498
- self
499
"""
500
501
def transform(self, X):
502
"""
503
Remove features with high PSI.
504
505
Parameters:
506
- X (pandas.DataFrame): Dataset to transform
507
508
Returns:
509
- pandas.DataFrame: Dataset with high PSI features removed
510
"""
511
512
def fit_transform(self, X, y=None):
513
"""Fit to data, then transform it."""
514
```
515
516
**Usage Example**:
517
```python
518
from feature_engine.selection import DropHighPSIFeatures
519
520
# Drop features with PSI > 0.25 indicating significant data drift
521
selector = DropHighPSIFeatures(threshold=0.25, split_frac=0.6)
522
df_stable = selector.fit_transform(df)
523
524
print(selector.features_to_drop_) # Features with high PSI
525
print(selector.psi_values_) # PSI values per feature
526
```
527
528
### Select by Target Mean Performance
529
530
Selects features based on target mean performance for univariate analysis.
531
532
```python { .api }
533
class SelectByTargetMeanPerformance:
534
def __init__(self, variables=None, scoring='roc_auc', threshold=0.5, bins=5):
535
"""
536
Initialize SelectByTargetMeanPerformance.
537
538
Parameters:
539
- variables (list): List of variables to evaluate. If None, evaluates all numerical variables
540
- scoring (str): Performance metric to use for feature evaluation
541
- threshold (float): Performance threshold for feature selection
542
- bins (int): Number of bins for discretizing continuous variables
543
"""
544
545
def fit(self, X, y):
546
"""
547
Evaluate target mean performance for each variable.
548
549
Parameters:
550
- X (pandas.DataFrame): Training dataset
551
- y (pandas.Series): Target variable (required)
552
553
Returns:
554
- self
555
"""
556
557
def transform(self, X):
558
"""
559
Select features that meet target mean performance threshold.
560
561
Parameters:
562
- X (pandas.DataFrame): Dataset to transform
563
564
Returns:
565
- pandas.DataFrame: Dataset with selected features only
566
"""
567
568
def fit_transform(self, X, y):
569
"""Fit to data, then transform it."""
570
```
571
572
**Usage Example**:
573
```python
574
from feature_engine.selection import SelectByTargetMeanPerformance
575
576
# Select features based on target mean performance
577
selector = SelectByTargetMeanPerformance(
578
scoring='roc_auc',
579
threshold=0.6,
580
bins=5
581
)
582
df_selected = selector.fit_transform(df, y)
583
584
print(selector.feature_performance_) # Performance scores per feature
585
print(selector.features_to_drop_) # Features below threshold
586
```
587
588
## Usage Patterns
589
590
### Sequential Feature Selection Pipeline
591
592
```python
593
from sklearn.pipeline import Pipeline
594
from feature_engine.selection import (
595
DropConstantFeatures,
596
DropCorrelatedFeatures,
597
SelectBySingleFeaturePerformance
598
)
599
from sklearn.ensemble import RandomForestClassifier
600
601
# Multi-step feature selection pipeline
602
selection_pipeline = Pipeline([
603
('drop_constant', DropConstantFeatures(tol=0.99)),
604
('drop_correlated', DropCorrelatedFeatures(threshold=0.95)),
605
('performance_selection', SelectBySingleFeaturePerformance(
606
estimator=RandomForestClassifier(n_estimators=10),
607
threshold=0.6
608
))
609
])
610
611
df_selected = selection_pipeline.fit_transform(df, y)
612
```
613
614
### Feature Selection with Cross-Validation
615
616
```python
617
from sklearn.model_selection import cross_val_score
618
from feature_engine.selection import RecursiveFeatureElimination
619
620
# Feature selection with proper evaluation
621
selector = RecursiveFeatureElimination(
622
estimator=RandomForestClassifier(),
623
cv=5,
624
threshold=0.01
625
)
626
627
# Fit selector
628
selector.fit(X_train, y_train)
629
630
# Transform datasets
631
X_train_selected = selector.transform(X_train)
632
X_test_selected = selector.transform(X_test)
633
634
# Evaluate selected features
635
scores = cross_val_score(
636
RandomForestClassifier(),
637
X_train_selected,
638
y_train,
639
cv=5
640
)
641
print(f"CV Score with selected features: {scores.mean():.3f}")
642
```
643
644
## Common Attributes
645
646
All selection transformers share these fitted attributes:
647
648
- `features_to_drop_` (list): Features identified for removal
649
- `n_features_in_` (int): Number of features in training set
650
651
Selector-specific attributes:
652
- `correlated_feature_sets_` (list): Groups of correlated features (correlation-based selectors)
653
- `feature_performance_` (dict): Performance scores per feature (performance-based selectors)
654
- `performance_drifts_` (dict): Performance changes during selection process (recursive selectors)