0
# Preprocessing Utilities
1
2
General preprocessing functions and transformers for data preparation and variable matching between datasets to ensure consistency and compatibility in machine learning workflows.
3
4
## Capabilities
5
6
### Variable Matching
7
8
Ensures that variables in a dataset match those in a reference dataset, handling missing columns and maintaining consistent structure across training and prediction datasets.
9
10
```python { .api }
11
class MatchVariables:
12
def __init__(self, missing_values='raise'):
13
"""
14
Initialize MatchVariables.
15
16
Parameters:
17
- missing_values (str): How to handle missing variables - 'raise' or 'ignore'
18
"""
19
20
def fit(self, X, y=None):
21
"""
22
Learn the reference set of variables from training data.
23
24
Parameters:
25
- X (pandas.DataFrame): Reference dataset (typically training data)
26
- y (pandas.Series, optional): Target variable (not used)
27
28
Returns:
29
- self
30
"""
31
32
def transform(self, X):
33
"""
34
Transform dataset to match reference variables.
35
36
Parameters:
37
- X (pandas.DataFrame): Dataset to transform
38
39
Returns:
40
- pandas.DataFrame: Dataset with variables matching reference set
41
"""
42
43
def fit_transform(self, X, y=None):
44
"""Fit to data, then transform it."""
45
```
46
47
**Usage Example**:
48
```python
49
from feature_engine.preprocessing import MatchVariables
50
import pandas as pd
51
import numpy as np
52
53
# Training dataset
54
train_data = {
55
'feature1': np.random.randn(100),
56
'feature2': np.random.randn(100),
57
'feature3': np.random.randn(100),
58
'target': np.random.randint(0, 2, 100)
59
}
60
df_train = pd.DataFrame(train_data)
61
62
# Test dataset with missing feature and extra feature
63
test_data = {
64
'feature1': np.random.randn(50),
65
'feature2': np.random.randn(50),
66
# feature3 is missing
67
'feature4': np.random.randn(50) # Extra feature
68
}
69
df_test = pd.DataFrame(test_data)
70
71
# Match test data to training data structure
72
matcher = MatchVariables(missing_values='ignore')
73
matcher.fit(df_train.drop('target', axis=1)) # Fit on features only
74
df_test_matched = matcher.transform(df_test)
75
76
print("Training features:", df_train.drop('target', axis=1).columns.tolist())
77
print("Original test features:", df_test.columns.tolist())
78
print("Matched test features:", df_test_matched.columns.tolist())
79
# Result: df_test_matched will have feature1, feature2, feature3 (with NaN)
80
# feature4 is dropped
81
```
82
83
## Usage Patterns
84
85
### Model Deployment Pipeline
86
87
```python
88
from sklearn.pipeline import Pipeline
89
from feature_engine.imputation import MeanMedianImputer
90
from feature_engine.preprocessing import MatchVariables
91
from feature_engine.encoding import OneHotEncoder
92
from sklearn.ensemble import RandomForestClassifier
93
94
# Training pipeline
95
training_pipeline = Pipeline([
96
('imputer', MeanMedianImputer()),
97
('encoder', OneHotEncoder()),
98
('classifier', RandomForestClassifier())
99
])
100
101
# Fit on training data
102
training_pipeline.fit(X_train, y_train)
103
104
# Deployment pipeline with variable matching
105
deployment_pipeline = Pipeline([
106
('matcher', MatchVariables()), # Ensure consistent variables
107
('imputer', MeanMedianImputer()),
108
('encoder', OneHotEncoder()),
109
('classifier', RandomForestClassifier())
110
])
111
112
# Fit matcher on training features
113
deployment_pipeline.named_steps['matcher'].fit(X_train)
114
115
# Copy trained parameters from training pipeline
116
deployment_pipeline.named_steps['imputer'] = training_pipeline.named_steps['imputer']
117
deployment_pipeline.named_steps['encoder'] = training_pipeline.named_steps['encoder']
118
deployment_pipeline.named_steps['classifier'] = training_pipeline.named_steps['classifier']
119
120
# Now can handle new data with different column structure
121
predictions = deployment_pipeline.predict(X_new)
122
```
123
124
### Cross-Dataset Validation
125
126
```python
127
# Different datasets with potentially different features
128
dataset1 = pd.DataFrame({
129
'age': [25, 30, 35],
130
'income': [50000, 60000, 70000],
131
'education': ['BS', 'MS', 'PhD']
132
})
133
134
dataset2 = pd.DataFrame({
135
'age': [28, 32],
136
'income': [55000, 65000],
137
'experience': [3, 5] # Different feature
138
})
139
140
dataset3 = pd.DataFrame({
141
'income': [45000, 75000],
142
'education': ['BS', 'MS'],
143
'location': ['NYC', 'LA'] # Different feature
144
})
145
146
# Use first dataset as reference
147
matcher = MatchVariables(missing_values='ignore')
148
matcher.fit(dataset1)
149
150
# Transform other datasets to match
151
dataset2_matched = matcher.transform(dataset2)
152
dataset3_matched = matcher.transform(dataset3)
153
154
print("Reference columns:", dataset1.columns.tolist())
155
print("Dataset2 matched:", dataset2_matched.columns.tolist())
156
print("Dataset3 matched:", dataset3_matched.columns.tolist())
157
# All will have: age, income, education (with NaN where missing)
158
```
159
160
### Feature Engineering Consistency
161
162
```python
163
from feature_engine.creation import MathematicalCombination
164
from feature_engine.datetime import DatetimeFeatures
165
166
# Complex feature engineering pipeline
167
feature_pipeline = Pipeline([
168
('datetime_features', DatetimeFeatures(
169
features_to_extract=['month', 'day_of_week']
170
)),
171
('math_combinations', MathematicalCombination(
172
variables_to_combine=['feature1', 'feature2'],
173
math_operations=['sum', 'prod']
174
)),
175
('matcher', MatchVariables()) # Ensure final consistency
176
])
177
178
# Fit on training data
179
feature_pipeline.fit(X_train)
180
181
# Apply to validation/test data with potential missing features
182
X_val_processed = feature_pipeline.transform(X_val)
183
X_test_processed = feature_pipeline.transform(X_test)
184
185
# All datasets will have consistent feature structure
186
```
187
188
### Handling Schema Changes
189
190
```python
191
# Original model trained on v1 data schema
192
v1_schema = ['customer_id', 'purchase_amount', 'product_category', 'region']
193
v1_data = pd.DataFrame({col: np.random.randn(100) for col in v1_schema})
194
195
# New data has updated schema
196
v2_schema = ['customer_id', 'purchase_amount', 'product_category', 'region', 'channel', 'discount']
197
v2_data = pd.DataFrame({col: np.random.randn(50) for col in v2_schema})
198
199
# Legacy data missing new columns
200
legacy_schema = ['customer_id', 'purchase_amount', 'product_category'] # Missing region
201
legacy_data = pd.DataFrame({col: np.random.randn(25) for col in legacy_schema})
202
203
# Train matcher on original schema
204
schema_matcher = MatchVariables(missing_values='ignore')
205
schema_matcher.fit(v1_data)
206
207
# All datasets can be processed consistently
208
v2_matched = schema_matcher.transform(v2_data) # Extra columns removed
209
legacy_matched = schema_matcher.transform(legacy_data) # Missing column added with NaN
210
211
print("V1 schema:", v1_data.columns.tolist())
212
print("V2 matched:", v2_matched.columns.tolist())
213
print("Legacy matched:", legacy_matched.columns.tolist())
214
# All have same columns: customer_id, purchase_amount, product_category, region
215
```
216
217
### API Integration
218
219
```python
220
import json
221
222
def preprocess_api_data(api_response, trained_matcher):
223
"""
224
Preprocess data from API response to match model expectations.
225
"""
226
# Parse API response
227
data = json.loads(api_response)
228
df = pd.DataFrame([data]) # Single row from API
229
230
# Match to expected schema
231
df_matched = trained_matcher.transform(df)
232
233
return df_matched
234
235
# Example API responses with different structures
236
api_response_1 = '{"feature1": 1.0, "feature2": 2.0, "feature3": 3.0}'
237
api_response_2 = '{"feature1": 1.5, "feature2": 2.5}' # Missing feature3
238
api_response_3 = '{"feature1": 2.0, "feature2": 3.0, "feature3": 4.0, "extra_field": 5.0}'
239
240
# Trained matcher expects feature1, feature2, feature3
241
matcher = MatchVariables()
242
matcher.fit(pd.DataFrame(columns=['feature1', 'feature2', 'feature3']))
243
244
# All API responses can be handled consistently
245
for i, response in enumerate([api_response_1, api_response_2, api_response_3], 1):
246
processed = preprocess_api_data(response, matcher)
247
print(f"API response {i} processed shape:", processed.shape)
248
print(f"Columns: {processed.columns.tolist()}")
249
```
250
251
### Error Handling Modes
252
253
```python
254
# Strict mode - raise error on missing variables
255
strict_matcher = MatchVariables(missing_values='raise')
256
strict_matcher.fit(df_train)
257
258
try:
259
result = strict_matcher.transform(df_missing_features)
260
except ValueError as e:
261
print(f"Strict mode error: {e}")
262
263
# Lenient mode - ignore missing variables
264
lenient_matcher = MatchVariables(missing_values='ignore')
265
lenient_matcher.fit(df_train)
266
result = lenient_matcher.transform(df_missing_features) # Succeeds with NaN
267
```
268
269
## Best Practices
270
271
### 1. Use in Production Pipelines
272
Always include MatchVariables in production pipelines to handle schema changes gracefully.
273
274
### 2. Fit on Training Data Only
275
Fit the matcher on training data to establish the canonical variable set.
276
277
### 3. Handle Missing Data Downstream
278
Use missing_values='ignore' and handle NaN values with appropriate imputation strategies.
279
280
### 4. Version Control Schemas
281
Keep track of expected schemas when deploying models to different environments.
282
283
### 5. Monitor Schema Drift
284
Log when MatchVariables adds or removes columns to detect data drift.
285
286
## Common Attributes
287
288
MatchVariables has these fitted attributes:
289
290
- `variables_to_match_` (list): Reference set of variables established during fit
291
- `n_features_in_` (int): Number of features in training set
292
293
The transformer ensures that output datasets always have exactly the variables specified in `variables_to_match_`, adding missing variables as NaN columns and dropping extra variables.