0
# Scikit-learn Wrappers
1
2
Transformers for applying scikit-learn transformers to specific subsets of variables while maintaining DataFrame structure and column names, enabling seamless integration of scikit-learn functionality within feature-engine workflows.
3
4
## Capabilities
5
6
### Scikit-learn Transformer Wrapper
7
8
Wrapper to apply any Scikit-learn transformer to a selected group of variables while preserving DataFrame structure.
9
10
```python { .api }
11
class SklearnTransformerWrapper:
12
def __init__(self, transformer, variables=None):
13
"""
14
Initialize SklearnTransformerWrapper.
15
16
Parameters:
17
- transformer: Instance of a scikit-learn transformer (must have fit, transform methods)
18
- variables (list): List of variables to be transformed. If None, transforms all numerical variables
19
"""
20
21
def fit(self, X, y=None):
22
"""
23
Fit the scikit-learn transformer on selected variables.
24
25
Parameters:
26
- X (pandas.DataFrame): Training dataset
27
- y (pandas.Series, optional): Target variable (passed to transformer if needed)
28
29
Returns:
30
- self
31
"""
32
33
def transform(self, X):
34
"""
35
Transform data using the fitted scikit-learn transformer.
36
37
Parameters:
38
- X (pandas.DataFrame): Dataset to transform
39
40
Returns:
41
- pandas.DataFrame: Dataset with transformed variables, maintaining DataFrame structure
42
"""
43
44
def fit_transform(self, X, y=None):
45
"""Fit to data, then transform it."""
46
47
def inverse_transform(self, X):
48
"""
49
Inverse transform using the scikit-learn transformer (if supported).
50
51
Parameters:
52
- X (pandas.DataFrame): Dataset with transformed values
53
54
Returns:
55
- pandas.DataFrame: Dataset with original scale restored
56
"""
57
```
58
59
**Usage Examples**:
60
61
### Standard Scaling
62
```python
63
from feature_engine.wrappers import SklearnTransformerWrapper
64
from sklearn.preprocessing import StandardScaler
65
import pandas as pd
66
import numpy as np
67
68
# Sample numerical data
69
data = {
70
'feature1': np.random.normal(100, 20, 1000),
71
'feature2': np.random.normal(50, 10, 1000),
72
'feature3': np.random.normal(200, 50, 1000),
73
'categorical': np.random.choice(['A', 'B', 'C'], 1000)
74
}
75
df = pd.DataFrame(data)
76
77
# Apply StandardScaler to specific numerical variables
78
scaler_wrapper = SklearnTransformerWrapper(
79
transformer=StandardScaler(),
80
variables=['feature1', 'feature2']
81
)
82
df_scaled = scaler_wrapper.fit_transform(df)
83
84
# feature3 and categorical remain unchanged
85
# feature1 and feature2 are standardized
86
print(df_scaled.describe())
87
print(df_scaled.dtypes) # DataFrame structure preserved
88
```
89
90
### Principal Component Analysis
91
```python
92
from sklearn.decomposition import PCA
93
94
# Apply PCA to selected variables
95
pca_wrapper = SklearnTransformerWrapper(
96
transformer=PCA(n_components=2),
97
variables=['feature1', 'feature2', 'feature3']
98
)
99
df_pca = pca_wrapper.fit_transform(df)
100
101
# Note: PCA creates new features, original variables are replaced
102
# with principal components (PC1, PC2, etc.)
103
print("PCA explained variance ratio:",
104
pca_wrapper.transformer_.explained_variance_ratio_)
105
```
106
107
### Robust Scaling
108
```python
109
from sklearn.preprocessing import RobustScaler
110
111
# Apply RobustScaler (less sensitive to outliers)
112
robust_wrapper = SklearnTransformerWrapper(
113
transformer=RobustScaler(),
114
variables=['feature1', 'feature3']
115
)
116
df_robust = robust_wrapper.fit_transform(df)
117
118
# Inverse transformation
119
df_original = robust_wrapper.inverse_transform(df_robust)
120
```
121
122
### Polynomial Features
123
```python
124
from sklearn.preprocessing import PolynomialFeatures
125
126
# Generate polynomial features
127
poly_wrapper = SklearnTransformerWrapper(
128
transformer=PolynomialFeatures(degree=2, include_bias=False),
129
variables=['feature1', 'feature2']
130
)
131
df_poly = poly_wrapper.fit_transform(df)
132
133
# Creates additional polynomial combination features
134
print(f"Original features: {len(df.columns)}")
135
print(f"With polynomial features: {len(df_poly.columns)}")
136
```
137
138
### Quantile Transformation
139
```python
140
from sklearn.preprocessing import QuantileTransformer
141
142
# Apply quantile transformation for normalization
143
quantile_wrapper = SklearnTransformerWrapper(
144
transformer=QuantileTransformer(output_distribution='normal'),
145
variables=['feature1', 'feature2', 'feature3']
146
)
147
df_quantile = quantile_wrapper.fit_transform(df)
148
149
# Transforms to normal distribution
150
```
151
152
## Advanced Usage Patterns
153
154
### Pipeline Integration
155
156
```python
157
from sklearn.pipeline import Pipeline
158
from feature_engine.imputation import MeanMedianImputer
159
from feature_engine.wrappers import SklearnTransformerWrapper
160
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
161
from sklearn.ensemble import RandomForestClassifier
162
163
# Complex preprocessing pipeline
164
preprocessing_pipeline = Pipeline([
165
('imputer', MeanMedianImputer()),
166
('polynomial', SklearnTransformerWrapper(
167
transformer=PolynomialFeatures(degree=2),
168
variables=['feature1', 'feature2']
169
)),
170
('scaler', SklearnTransformerWrapper(
171
transformer=StandardScaler(),
172
variables=None # Scale all numerical variables
173
)),
174
('classifier', RandomForestClassifier())
175
])
176
177
# Fit and predict
178
preprocessing_pipeline.fit(X_train, y_train)
179
predictions = preprocessing_pipeline.predict(X_test)
180
```
181
182
### Multiple Transformer Application
183
184
```python
185
from sklearn.preprocessing import StandardScaler, MinMaxScaler
186
187
# Apply different scalers to different variable groups
188
standard_scaler_wrapper = SklearnTransformerWrapper(
189
transformer=StandardScaler(),
190
variables=['feature1', 'feature2']
191
)
192
193
minmax_scaler_wrapper = SklearnTransformerWrapper(
194
transformer=MinMaxScaler(),
195
variables=['feature3']
196
)
197
198
# Sequential application
199
df_multi_scaled = standard_scaler_wrapper.fit_transform(df)
200
df_multi_scaled = minmax_scaler_wrapper.fit_transform(df_multi_scaled)
201
```
202
203
### Custom Scikit-learn Transformer
204
205
```python
206
from sklearn.base import BaseEstimator, TransformerMixin
207
import numpy as np
208
209
# Custom transformer
210
class LogTransformer(BaseEstimator, TransformerMixin):
211
def fit(self, X, y=None):
212
return self
213
214
def transform(self, X):
215
return np.log1p(X) # log(1 + x)
216
217
def inverse_transform(self, X):
218
return np.expm1(X) # exp(x) - 1
219
220
# Use with wrapper
221
log_wrapper = SklearnTransformerWrapper(
222
transformer=LogTransformer(),
223
variables=['feature1', 'feature2']
224
)
225
df_log = log_wrapper.fit_transform(df)
226
df_original = log_wrapper.inverse_transform(df_log)
227
```
228
229
### Handling Categorical Variables
230
231
```python
232
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
233
234
# For categorical variables with sklearn transformers
235
categorical_data = {
236
'category1': ['A', 'B', 'C', 'A', 'B'],
237
'category2': ['X', 'Y', 'Z', 'X', 'Y'],
238
'numerical': [1, 2, 3, 4, 5]
239
}
240
df_cat = pd.DataFrame(categorical_data)
241
242
# Use OrdinalEncoder for multiple categorical variables
243
ordinal_wrapper = SklearnTransformerWrapper(
244
transformer=OrdinalEncoder(),
245
variables=['category1', 'category2']
246
)
247
df_encoded = ordinal_wrapper.fit_transform(df_cat)
248
```
249
250
### Cross-Validation with Wrapper
251
252
```python
253
from sklearn.model_selection import cross_val_score
254
from sklearn.pipeline import Pipeline
255
from sklearn.ensemble import RandomForestRegressor
256
257
# Create pipeline with wrapper
258
pipeline_with_wrapper = Pipeline([
259
('scaler', SklearnTransformerWrapper(
260
transformer=StandardScaler(),
261
variables=['feature1', 'feature2', 'feature3']
262
)),
263
('regressor', RandomForestRegressor())
264
])
265
266
# Cross-validation
267
cv_scores = cross_val_score(
268
pipeline_with_wrapper,
269
X_train,
270
y_train,
271
cv=5,
272
scoring='neg_mean_squared_error'
273
)
274
print(f"CV RMSE: {np.sqrt(-cv_scores.mean()):.3f}")
275
```
276
277
## Benefits of Using SklearnTransformerWrapper
278
279
### Maintains DataFrame Structure
280
- Preserves column names and indices
281
- Keeps non-transformed columns unchanged
282
- Returns pandas DataFrame instead of numpy array
283
284
### Variable Selection
285
- Apply transformers to specific subsets of variables
286
- Leave categorical or irrelevant variables untouched
287
- Flexible variable selection strategies
288
289
### Pipeline Compatibility
290
- Works seamlessly with feature-engine transformers
291
- Integrates with scikit-learn pipelines
292
- Maintains consistent API across transformers
293
294
### Inverse Transformation Support
295
- Provides inverse transformation when available
296
- Maintains original scale recovery capability
297
- Useful for interpretability and debugging
298
299
## Common Use Cases
300
301
1. **Preprocessing specific variable types**: Apply StandardScaler only to continuous variables
302
2. **Dimensionality reduction**: Use PCA on high-dimensional feature subsets
303
3. **Distribution transformation**: Apply QuantileTransformer to skewed variables
304
4. **Feature generation**: Create polynomial features from selected variables
305
5. **Robust scaling**: Use RobustScaler for variables with outliers
306
307
## Common Attributes
308
309
SklearnTransformerWrapper has these fitted attributes:
310
311
- `transformer_` (sklearn transformer): Fitted scikit-learn transformer instance
312
- `variables_` (list): Variables that were transformed
313
- `n_features_in_` (int): Number of features in training set
314
315
The wrapper provides access to the underlying transformer's attributes through the `transformer_` attribute, enabling access to learned parameters like feature names, explained variance, etc.