or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

creation.mddatetime.mddiscretisation.mdencoding.mdimputation.mdindex.mdoutliers.mdpreprocessing.mdselection.mdtransformation.mdwrappers.md

selection.mddocs/

0

# Feature Selection

1

2

Transformers for removing or selecting features based on various criteria including variance, correlation, performance metrics, and statistical tests to improve model performance and reduce dimensionality.

3

4

## Capabilities

5

6

### Drop Features by Name

7

8

Drops a list of variables indicated by the user from the dataframe.

9

10

```python { .api }

11

class DropFeatures:

12

def __init__(self, features_to_drop):

13

"""

14

Initialize DropFeatures.

15

16

Parameters:

17

- features_to_drop (list): Variable names to be dropped from dataframe

18

"""

19

20

def fit(self, X, y=None):

21

"""

22

Validate that features exist in dataset (no parameters learned).

23

24

Parameters:

25

- X (pandas.DataFrame): Training dataset

26

- y (pandas.Series, optional): Target variable (not used)

27

28

Returns:

29

- self

30

"""

31

32

def transform(self, X):

33

"""

34

Drop indicated features from dataset.

35

36

Parameters:

37

- X (pandas.DataFrame): Dataset to transform

38

39

Returns:

40

- pandas.DataFrame: Dataset with specified features removed

41

"""

42

43

def fit_transform(self, X, y=None):

44

"""Fit to data, then transform it."""

45

```

46

47

**Usage Example**:

48

```python

49

from feature_engine.selection import DropFeatures

50

import pandas as pd

51

52

# Sample data

53

data = {'var1': [1, 2, 3], 'var2': [4, 5, 6], 'var3': [7, 8, 9]}

54

df = pd.DataFrame(data)

55

56

# Drop specific features

57

selector = DropFeatures(['var1', 'var3'])

58

df_reduced = selector.fit_transform(df)

59

# Result: only var2 remains

60

61

print(selector.features_to_drop_) # Shows features that will be dropped

62

```

63

64

### Drop Constant Features

65

66

Removes constant and quasi-constant features that provide little information.

67

68

```python { .api }

69

class DropConstantFeatures:

70

def __init__(self, variables=None, tol=1, missing_values='raise'):

71

"""

72

Initialize DropConstantFeatures.

73

74

Parameters:

75

- variables (list): List of variables to evaluate. If None, evaluates all variables

76

- tol (float): Threshold for quasi-constant detection (0-1). Variables with tol fraction of most frequent value are dropped

77

- missing_values (str): How to handle missing values - 'raise' or 'ignore'

78

"""

79

80

def fit(self, X, y=None):

81

"""

82

Identify constant and quasi-constant features.

83

84

Parameters:

85

- X (pandas.DataFrame): Training dataset

86

- y (pandas.Series, optional): Target variable (not used)

87

88

Returns:

89

- self

90

"""

91

92

def transform(self, X):

93

"""

94

Remove constant and quasi-constant features.

95

96

Parameters:

97

- X (pandas.DataFrame): Dataset to transform

98

99

Returns:

100

- pandas.DataFrame: Dataset with constant features removed

101

"""

102

103

def fit_transform(self, X, y=None):

104

"""Fit to data, then transform it."""

105

```

106

107

**Usage Example**:

108

```python

109

from feature_engine.selection import DropConstantFeatures

110

111

# Drop truly constant features (default)

112

selector = DropConstantFeatures()

113

df_reduced = selector.fit_transform(df)

114

115

# Drop quasi-constant features (>95% same value)

116

selector = DropConstantFeatures(tol=0.95)

117

df_reduced = selector.fit_transform(df)

118

119

print(selector.features_to_drop_) # Features identified as constant/quasi-constant

120

```

121

122

### Drop Duplicate Features

123

124

Removes duplicate features from dataframe based on identical values.

125

126

```python { .api }

127

class DropDuplicateFeatures:

128

def __init__(self, variables=None, missing_values='raise'):

129

"""

130

Initialize DropDuplicateFeatures.

131

132

Parameters:

133

- variables (list): List of variables to evaluate. If None, evaluates all variables

134

- missing_values (str): How to handle missing values - 'raise' or 'ignore'

135

"""

136

137

def fit(self, X, y=None):

138

"""

139

Identify duplicate features.

140

141

Parameters:

142

- X (pandas.DataFrame): Training dataset

143

- y (pandas.Series, optional): Target variable (not used)

144

145

Returns:

146

- self

147

"""

148

149

def transform(self, X):

150

"""

151

Remove duplicate features, keeping first occurrence.

152

153

Parameters:

154

- X (pandas.DataFrame): Dataset to transform

155

156

Returns:

157

- pandas.DataFrame: Dataset with duplicate features removed

158

"""

159

160

def fit_transform(self, X, y=None):

161

"""Fit to data, then transform it."""

162

```

163

164

### Drop Correlated Features

165

166

Removes correlated features from dataframe to reduce multicollinearity.

167

168

```python { .api }

169

class DropCorrelatedFeatures:

170

def __init__(self, variables=None, method='pearson', threshold=0.8, missing_values='raise'):

171

"""

172

Initialize DropCorrelatedFeatures.

173

174

Parameters:

175

- variables (list): List of numerical variables to evaluate. If None, selects all numerical variables

176

- method (str): Correlation method - 'pearson', 'spearman', or 'kendall'

177

- threshold (float): Correlation threshold (0-1) above which features are considered correlated

178

- missing_values (str): How to handle missing values - 'raise' or 'ignore'

179

"""

180

181

def fit(self, X, y=None):

182

"""

183

Identify correlated features to remove.

184

185

Parameters:

186

- X (pandas.DataFrame): Training dataset

187

- y (pandas.Series, optional): Target variable (not used)

188

189

Returns:

190

- self

191

"""

192

193

def transform(self, X):

194

"""

195

Remove correlated features.

196

197

Parameters:

198

- X (pandas.DataFrame): Dataset to transform

199

200

Returns:

201

- pandas.DataFrame: Dataset with correlated features removed

202

"""

203

204

def fit_transform(self, X, y=None):

205

"""Fit to data, then transform it."""

206

```

207

208

**Usage Example**:

209

```python

210

from feature_engine.selection import DropCorrelatedFeatures

211

212

# Drop features with Pearson correlation > 0.8

213

selector = DropCorrelatedFeatures(threshold=0.8, method='pearson')

214

df_reduced = selector.fit_transform(df)

215

216

# Use Spearman correlation

217

selector = DropCorrelatedFeatures(threshold=0.9, method='spearman')

218

df_reduced = selector.fit_transform(df)

219

220

print(selector.correlated_feature_sets_) # Shows groups of correlated features

221

print(selector.features_to_drop_) # Features selected for removal

222

```

223

224

### Smart Correlated Selection

225

226

Selects features from correlated groups based on performance with target variable.

227

228

```python { .api }

229

class SmartCorrelatedSelection:

230

def __init__(self, variables=None, method='pearson', threshold=0.8,

231

selection_method='variance', estimator=None, scoring='accuracy', cv=3):

232

"""

233

Initialize SmartCorrelatedSelection.

234

235

Parameters:

236

- variables (list): List of numerical variables to evaluate. If None, selects all numerical variables

237

- method (str): Correlation method - 'pearson', 'spearman', or 'kendall'

238

- threshold (float): Correlation threshold (0-1) for grouping correlated features

239

- selection_method (str): Method to select from correlated groups - 'variance' or 'model_performance'

240

- estimator: Sklearn estimator for performance-based selection

241

- scoring (str): Scoring metric for model performance evaluation

242

- cv (int): Cross-validation folds

243

"""

244

245

def fit(self, X, y=None):

246

"""

247

Identify correlated groups and select best feature from each group.

248

249

Parameters:

250

- X (pandas.DataFrame): Training dataset

251

- y (pandas.Series): Target variable (required for model_performance selection)

252

253

Returns:

254

- self

255

"""

256

257

def transform(self, X):

258

"""

259

Keep only selected features from correlated groups.

260

261

Parameters:

262

- X (pandas.DataFrame): Dataset to transform

263

264

Returns:

265

- pandas.DataFrame: Dataset with smart feature selection applied

266

"""

267

268

def fit_transform(self, X, y=None):

269

"""Fit to data, then transform it."""

270

```

271

272

### Performance-Based Selection

273

274

Selects features based on individual performance metrics.

275

276

```python { .api }

277

class SelectBySingleFeaturePerformance:

278

def __init__(self, estimator, scoring='accuracy', cv=3, threshold=0.5, variables=None):

279

"""

280

Initialize SelectBySingleFeaturePerformance.

281

282

Parameters:

283

- estimator: Sklearn estimator to evaluate feature performance

284

- scoring (str): Scoring metric for performance evaluation

285

- cv (int): Cross-validation folds

286

- threshold (float): Performance threshold for feature selection

287

- variables (list): List of variables to evaluate. If None, evaluates all variables

288

"""

289

290

def fit(self, X, y):

291

"""

292

Evaluate individual performance of each feature.

293

294

Parameters:

295

- X (pandas.DataFrame): Training dataset

296

- y (pandas.Series): Target variable (required)

297

298

Returns:

299

- self

300

"""

301

302

def transform(self, X):

303

"""

304

Select features that meet performance threshold.

305

306

Parameters:

307

- X (pandas.DataFrame): Dataset to transform

308

309

Returns:

310

- pandas.DataFrame: Dataset with only high-performing features

311

"""

312

313

def fit_transform(self, X, y):

314

"""Fit to data, then transform it."""

315

```

316

317

**Usage Example**:

318

```python

319

from feature_engine.selection import SelectBySingleFeaturePerformance

320

from sklearn.ensemble import RandomForestClassifier

321

322

# Select features based on individual performance

323

selector = SelectBySingleFeaturePerformance(

324

estimator=RandomForestClassifier(n_estimators=10),

325

scoring='accuracy',

326

cv=3,

327

threshold=0.6

328

)

329

df_selected = selector.fit_transform(df, y)

330

331

print(selector.feature_performance_) # Performance score per feature

332

print(selector.features_to_drop_) # Features below threshold

333

```

334

335

### Recursive Feature Elimination

336

337

Selects features by recursively eliminating worst performing features.

338

339

```python { .api }

340

class RecursiveFeatureElimination:

341

def __init__(self, estimator, scoring='accuracy', cv=3, threshold=0.01, variables=None):

342

"""

343

Initialize RecursiveFeatureElimination.

344

345

Parameters:

346

- estimator: Sklearn estimator with feature_importances_ or coef_ attribute

347

- scoring (str): Scoring metric for performance evaluation

348

- cv (int): Cross-validation folds

349

- threshold (float): Performance drop threshold for stopping elimination

350

- variables (list): List of variables to evaluate. If None, evaluates all variables

351

"""

352

353

def fit(self, X, y):

354

"""

355

Perform recursive feature elimination.

356

357

Parameters:

358

- X (pandas.DataFrame): Training dataset

359

- y (pandas.Series): Target variable (required)

360

361

Returns:

362

- self

363

"""

364

365

def transform(self, X):

366

"""

367

Select features identified by recursive elimination.

368

369

Parameters:

370

- X (pandas.DataFrame): Dataset to transform

371

372

Returns:

373

- pandas.DataFrame: Dataset with selected features only

374

"""

375

376

def fit_transform(self, X, y):

377

"""Fit to data, then transform it."""

378

```

379

380

### Recursive Feature Addition

381

382

Selects features by recursively adding best performing features.

383

384

```python { .api }

385

class RecursiveFeatureAddition:

386

def __init__(self, estimator, scoring='accuracy', cv=3, threshold=0.01, variables=None):

387

"""

388

Initialize RecursiveFeatureAddition.

389

390

Parameters:

391

- estimator: Sklearn estimator for performance evaluation

392

- scoring (str): Scoring metric for performance evaluation

393

- cv (int): Cross-validation folds

394

- threshold (float): Performance improvement threshold for stopping addition

395

- variables (list): List of variables to evaluate. If None, evaluates all variables

396

"""

397

398

def fit(self, X, y):

399

"""

400

Perform recursive feature addition.

401

402

Parameters:

403

- X (pandas.DataFrame): Training dataset

404

- y (pandas.Series): Target variable (required)

405

406

Returns:

407

- self

408

"""

409

410

def transform(self, X):

411

"""

412

Select features identified by recursive addition.

413

414

Parameters:

415

- X (pandas.DataFrame): Dataset to transform

416

417

Returns:

418

- pandas.DataFrame: Dataset with selected features only

419

"""

420

421

def fit_transform(self, X, y):

422

"""Fit to data, then transform it."""

423

```

424

425

### Selection by Shuffling

426

427

Selects features by evaluating performance drop after shuffling feature values.

428

429

```python { .api }

430

class SelectByShuffling:

431

def __init__(self, estimator, scoring='accuracy', cv=3, threshold=0.01, variables=None):

432

"""

433

Initialize SelectByShuffling.

434

435

Parameters:

436

- estimator: Sklearn estimator for performance evaluation

437

- scoring (str): Scoring metric for performance evaluation

438

- cv (int): Cross-validation folds

439

- threshold (float): Performance drop threshold for feature importance

440

- variables (list): List of variables to evaluate. If None, evaluates all variables

441

"""

442

443

def fit(self, X, y):

444

"""

445

Evaluate feature importance by shuffling.

446

447

Parameters:

448

- X (pandas.DataFrame): Training dataset

449

- y (pandas.Series): Target variable (required)

450

451

Returns:

452

- self

453

"""

454

455

def transform(self, X):

456

"""

457

Select features that show significant performance drop when shuffled.

458

459

Parameters:

460

- X (pandas.DataFrame): Dataset to transform

461

462

Returns:

463

- pandas.DataFrame: Dataset with important features only

464

"""

465

466

def fit_transform(self, X, y):

467

"""Fit to data, then transform it."""

468

```

469

470

### Drop High PSI Features

471

472

Removes features with high Population Stability Index, indicating significant data drift.

473

474

```python { .api }

475

class DropHighPSIFeatures:

476

def __init__(self, variables=None, split_frac=0.5, threshold=0.25,

477

missing_values='raise', switch=False):

478

"""

479

Initialize DropHighPSIFeatures.

480

481

Parameters:

482

- variables (list): List of variables to evaluate. If None, evaluates all variables

483

- split_frac (float): Fraction of data to use for reference vs comparison

484

- threshold (float): PSI threshold above which features are dropped

485

- missing_values (str): How to handle missing values - 'raise' or 'ignore'

486

- switch (bool): Whether to switch reference and comparison datasets

487

"""

488

489

def fit(self, X, y=None):

490

"""

491

Calculate PSI for each variable and identify features to drop.

492

493

Parameters:

494

- X (pandas.DataFrame): Training dataset

495

- y (pandas.Series, optional): Target variable (not used)

496

497

Returns:

498

- self

499

"""

500

501

def transform(self, X):

502

"""

503

Remove features with high PSI.

504

505

Parameters:

506

- X (pandas.DataFrame): Dataset to transform

507

508

Returns:

509

- pandas.DataFrame: Dataset with high PSI features removed

510

"""

511

512

def fit_transform(self, X, y=None):

513

"""Fit to data, then transform it."""

514

```

515

516

**Usage Example**:

517

```python

518

from feature_engine.selection import DropHighPSIFeatures

519

520

# Drop features with PSI > 0.25 indicating significant data drift

521

selector = DropHighPSIFeatures(threshold=0.25, split_frac=0.6)

522

df_stable = selector.fit_transform(df)

523

524

print(selector.features_to_drop_) # Features with high PSI

525

print(selector.psi_values_) # PSI values per feature

526

```

527

528

### Select by Target Mean Performance

529

530

Selects features based on target mean performance for univariate analysis.

531

532

```python { .api }

533

class SelectByTargetMeanPerformance:

534

def __init__(self, variables=None, scoring='roc_auc', threshold=0.5, bins=5):

535

"""

536

Initialize SelectByTargetMeanPerformance.

537

538

Parameters:

539

- variables (list): List of variables to evaluate. If None, evaluates all numerical variables

540

- scoring (str): Performance metric to use for feature evaluation

541

- threshold (float): Performance threshold for feature selection

542

- bins (int): Number of bins for discretizing continuous variables

543

"""

544

545

def fit(self, X, y):

546

"""

547

Evaluate target mean performance for each variable.

548

549

Parameters:

550

- X (pandas.DataFrame): Training dataset

551

- y (pandas.Series): Target variable (required)

552

553

Returns:

554

- self

555

"""

556

557

def transform(self, X):

558

"""

559

Select features that meet target mean performance threshold.

560

561

Parameters:

562

- X (pandas.DataFrame): Dataset to transform

563

564

Returns:

565

- pandas.DataFrame: Dataset with selected features only

566

"""

567

568

def fit_transform(self, X, y):

569

"""Fit to data, then transform it."""

570

```

571

572

**Usage Example**:

573

```python

574

from feature_engine.selection import SelectByTargetMeanPerformance

575

576

# Select features based on target mean performance

577

selector = SelectByTargetMeanPerformance(

578

scoring='roc_auc',

579

threshold=0.6,

580

bins=5

581

)

582

df_selected = selector.fit_transform(df, y)

583

584

print(selector.feature_performance_) # Performance scores per feature

585

print(selector.features_to_drop_) # Features below threshold

586

```

587

588

## Usage Patterns

589

590

### Sequential Feature Selection Pipeline

591

592

```python

593

from sklearn.pipeline import Pipeline

594

from feature_engine.selection import (

595

DropConstantFeatures,

596

DropCorrelatedFeatures,

597

SelectBySingleFeaturePerformance

598

)

599

from sklearn.ensemble import RandomForestClassifier

600

601

# Multi-step feature selection pipeline

602

selection_pipeline = Pipeline([

603

('drop_constant', DropConstantFeatures(tol=0.99)),

604

('drop_correlated', DropCorrelatedFeatures(threshold=0.95)),

605

('performance_selection', SelectBySingleFeaturePerformance(

606

estimator=RandomForestClassifier(n_estimators=10),

607

threshold=0.6

608

))

609

])

610

611

df_selected = selection_pipeline.fit_transform(df, y)

612

```

613

614

### Feature Selection with Cross-Validation

615

616

```python

617

from sklearn.model_selection import cross_val_score

618

from feature_engine.selection import RecursiveFeatureElimination

619

620

# Feature selection with proper evaluation

621

selector = RecursiveFeatureElimination(

622

estimator=RandomForestClassifier(),

623

cv=5,

624

threshold=0.01

625

)

626

627

# Fit selector

628

selector.fit(X_train, y_train)

629

630

# Transform datasets

631

X_train_selected = selector.transform(X_train)

632

X_test_selected = selector.transform(X_test)

633

634

# Evaluate selected features

635

scores = cross_val_score(

636

RandomForestClassifier(),

637

X_train_selected,

638

y_train,

639

cv=5

640

)

641

print(f"CV Score with selected features: {scores.mean():.3f}")

642

```

643

644

## Common Attributes

645

646

All selection transformers share these fitted attributes:

647

648

- `features_to_drop_` (list): Features identified for removal

649

- `n_features_in_` (int): Number of features in training set

650

651

Selector-specific attributes:

652

- `correlated_feature_sets_` (list): Groups of correlated features (correlation-based selectors)

653

- `feature_performance_` (dict): Performance scores per feature (performance-based selectors)

654

- `performance_drifts_` (dict): Performance changes during selection process (recursive selectors)